metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a
low-barrier mechanism for repository interoperability. Data Providers are
repositories that expose structured metadata via OAI-PMH. Service Providers
then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and
harvest data incrementally. The goal of metha is to make it simple to get
access to data, its focus is not to manage it.

The metha tool has been developed for Project finc at
Leipzig University Library.

Why yet another OAI harvester?

I wanted to crawl Arxiv but found that existing tools would timeout.
Some harvesters would start to download all records anew, if I interrupted a running harvest.
There are many OAI
endpoints out
there. It is a widely used
protocol and
somewhat worth knowing.
I wanted something simple for the command line; also fast and robust - metha
as it is implemented now, is relatively robust and more efficient than
requesting all record one-by-one (there is one
annoyance which will hopefully be
fixed soon).

How it works

The functionality is spread accross a few different executables:

metha-sync for harvesting
metha-cat for viewing
metha-id for gathering data about endpoints
metha-ls for inspecting the local cache
metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base
directory is ~/.metha by default and can be adjusted with the METHA_DIR
environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will
automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the
last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST
2016, the harvester would request all data since the repositories earliest
date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting
directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz"

名称与所有者	miku/metha
主编程语言	Python
编程语言	Makefile (语言数: 8)
平台	Linux
许可证	GNU General Public License v3.0

创建于	2016-04-16 13:18:58
推送于	2025-06-26 09:35:38
最后一次提交	2025-06-26 11:35:35
发布数	131
最新版本名称	v0.4.5 (发布于 )
第一版名称	v0.1.3 (发布于 )

星数	126
关注者数	8
派生数	14
提交数	1.4k
已启用问题?
问题数	30
打开的问题数	14
拉请求数	9
打开的拉请求数	0
关闭的拉请求数	1

已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?

metha

Github星跟踪图

metha

Why yet another OAI harvester?

How it works

主要指标