metha

命令行 OAI 采集器和客户端,内置缓存。「Command line OAI harvester and client with built-in cache.」

Github星跟踪图

metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a
low-barrier mechanism for repository interoperability. Data Providers are
repositories that expose structured metadata via OAI-PMH. Service Providers
then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and
harvest data incrementally. The goal of metha is to make it simple to get
access to data, its focus is not to manage it.

DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

The metha tool has been developed for Project finc at
Leipzig University Library.

Why yet another OAI harvester?

  • I wanted to crawl Arxiv but found that existing tools would timeout.
  • Some harvesters would start to download all records anew, if I interrupted a running harvest.
  • There are many OAI
    endpoints out
    there. It is a widely used
    protocol and
    somewhat worth knowing.
  • I wanted something simple for the command line; also fast and robust - metha
    as it is implemented now, is relatively robust and more efficient than
    requesting all record one-by-one (there is one
    annoyance which will hopefully be
    fixed soon).

How it works

The functionality is spread accross a few different executables:

  • metha-sync for harvesting
  • metha-cat for viewing
  • metha-id for gathering data about endpoints
  • metha-ls for inspecting the local cache
  • metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base
directory is ~/.metha by default and can be adjusted with the METHA_DIR
environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky
$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will
automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the
last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST
2016
, the harvester would request all data since the repositories earliest
date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting
directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz"

主要指标

概览
名称与所有者miku/metha
主编程语言Python
编程语言Makefile (语言数: 7)
平台Linux
许可证GNU General Public License v3.0
所有者活动
创建于2016-04-16 13:18:58
推送于2025-04-24 18:12:58
最后一次提交2025-04-24 19:06:41
发布数123
最新版本名称v0.3.27 (发布于 )
第一版名称v0.1.3 (发布于 )
用户参与
星数124
关注者数8
派生数14
提交数1.3k
已启用问题?
问题数30
打开的问题数14
拉请求数9
打开的拉请求数0
关闭的拉请求数1
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?