metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a
low-barrier mechanism for repository interoperability. Data Providers are
repositories that expose structured metadata via OAI-PMH. Service Providers
then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and
harvest data incrementally. The goal of metha is to make it simple to get
access to data, its focus is not to manage it.

The metha tool has been developed for Project finc at
Leipzig University Library.

Why yet another OAI harvester?

I wanted to crawl Arxiv but found that existing tools would timeout.
Some harvesters would start to download all records anew, if I interrupted a running harvest.
There are many OAI
endpoints out
there. It is a widely used
protocol and
somewhat worth knowing.
I wanted something simple for the command line; also fast and robust - metha
as it is implemented now, is relatively robust and more efficient than
requesting all record one-by-one (there is one
annoyance which will hopefully be
fixed soon).

How it works

The functionality is spread accross a few different executables:

metha-sync for harvesting
metha-cat for viewing
metha-id for gathering data about endpoints
metha-ls for inspecting the local cache
metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base
directory is ~/.metha by default and can be adjusted with the METHA_DIR
environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will
automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the
last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST
2016, the harvester would request all data since the repositories earliest
date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting
directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz"

Name With Owner	miku/metha
Primary Language	HTML
Program language	Makefile (Language Count: 7)
Platform	Linux
License:	GNU General Public License v3.0

Created At	2016-04-16 13:18:58
Pushed At	2025-09-08 21:38:49
Last Commit At	2025-09-08 23:23:02
Release Count	141
Last Release Name	v0.4.15 (Posted on )
First Release Name	v0.1.3 (Posted on )

Stargazers Count	127
Watchers Count	8
Fork Count	14
Commits Count	1.5k
Has Issues Enabled
Issues Count	30
Issue Open Count	14
Pull Requests Count	9
Pull Requests Open Count	0
Pull Requests Close Count	1

Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private

metha

Github stars Tracking Chart

metha

Why yet another OAI harvester?

How it works

Main metrics