metha

命令行 OAI 采集器和客户端,内置缓存。「Command line OAI harvester and client with built-in cache.」

Github stars Tracking Chart

metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a
low-barrier mechanism for repository interoperability. Data Providers are
repositories that expose structured metadata via OAI-PMH. Service Providers
then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and
harvest data incrementally. The goal of metha is to make it simple to get
access to data, its focus is not to manage it.

DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

The metha tool has been developed for Project finc at
Leipzig University Library.

Why yet another OAI harvester?

  • I wanted to crawl Arxiv but found that existing tools would timeout.
  • Some harvesters would start to download all records anew, if I interrupted a running harvest.
  • There are many OAI
    endpoints out
    there. It is a widely used
    protocol and
    somewhat worth knowing.
  • I wanted something simple for the command line; also fast and robust - metha
    as it is implemented now, is relatively robust and more efficient than
    requesting all record one-by-one (there is one
    annoyance which will hopefully be
    fixed soon).

How it works

The functionality is spread accross a few different executables:

  • metha-sync for harvesting
  • metha-cat for viewing
  • metha-id for gathering data about endpoints
  • metha-ls for inspecting the local cache
  • metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base
directory is ~/.metha by default and can be adjusted with the METHA_DIR
environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky
$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will
automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the
last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST
2016
, the harvester would request all data since the repositories earliest
date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting
directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz"

Main metrics

Overview
Name With Ownermiku/metha
Primary LanguagePython
Program languageMakefile (Language Count: 7)
PlatformLinux
License:GNU General Public License v3.0
所有者活动
Created At2016-04-16 13:18:58
Pushed At2025-04-24 18:12:58
Last Commit At2025-04-24 19:06:41
Release Count123
Last Release Namev0.3.27 (Posted on )
First Release Namev0.1.3 (Posted on )
用户参与
Stargazers Count124
Watchers Count8
Fork Count14
Commits Count1.3k
Has Issues Enabled
Issues Count30
Issue Open Count14
Pull Requests Count9
Pull Requests Open Count0
Pull Requests Close Count1
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private