metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a
low-barrier mechanism for repository interoperability. Data Providers are
repositories that expose structured metadata via OAI-PMH. Service Providers
then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and
harvest data incrementally. The goal of metha is to make it simple to get
access to data, its focus is not to manage it.

The metha tool has been developed for Project finc at
Leipzig University Library.

Why yet another OAI harvester?

I wanted to crawl Arxiv but found that existing tools would timeout.
Some harvesters would start to download all records anew, if I interrupted a running harvest.
There are many OAI
endpoints out
there. It is a widely used
protocol and
somewhat worth knowing.
I wanted something simple for the command line; also fast and robust - metha
as it is implemented now, is relatively robust and more efficient than
requesting all record one-by-one (there is one
annoyance which will hopefully be
fixed soon).

How it works

The functionality is spread accross a few different executables:

metha-sync for harvesting
metha-cat for viewing
metha-id for gathering data about endpoints
metha-ls for inspecting the local cache
metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base
directory is ~/.metha by default and can be adjusted with the METHA_DIR
environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will
automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the
last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST
2016, the harvester would request all data since the repositories earliest
date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting
directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz"

名稱與所有者	miku/metha
主編程語言	HTML
編程語言	Makefile (語言數: 7)
平台	Linux
許可證	GNU General Public License v3.0

創建於	2016-04-16 13:18:58
推送於	2025-09-08 21:38:49
最后一次提交	2025-09-08 23:23:02
發布數	141
最新版本名稱	v0.4.15 (發布於 )
第一版名稱	v0.1.3 (發布於 )

星數	127
關注者數	8
派生數	14
提交數	1.5k
已啟用問題?
問題數	30
打開的問題數	14
拉請求數	9
打開的拉請求數	0
關閉的拉請求數	1

已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?

metha

Github星跟蹤圖

metha

Why yet another OAI harvester?

How it works

主要指標