metha

命令行 OAI 采集器和客户端,内置缓存。「Command line OAI harvester and client with built-in cache.」

Github星跟蹤圖

metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a
low-barrier mechanism for repository interoperability. Data Providers are
repositories that expose structured metadata via OAI-PMH. Service Providers
then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and
harvest data incrementally. The goal of metha is to make it simple to get
access to data, its focus is not to manage it.

DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

The metha tool has been developed for Project finc at
Leipzig University Library.

Why yet another OAI harvester?

  • I wanted to crawl Arxiv but found that existing tools would timeout.
  • Some harvesters would start to download all records anew, if I interrupted a running harvest.
  • There are many OAI
    endpoints out
    there. It is a widely used
    protocol and
    somewhat worth knowing.
  • I wanted something simple for the command line; also fast and robust - metha
    as it is implemented now, is relatively robust and more efficient than
    requesting all record one-by-one (there is one
    annoyance which will hopefully be
    fixed soon).

How it works

The functionality is spread accross a few different executables:

  • metha-sync for harvesting
  • metha-cat for viewing
  • metha-id for gathering data about endpoints
  • metha-ls for inspecting the local cache
  • metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base
directory is ~/.metha by default and can be adjusted with the METHA_DIR
environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky
$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will
automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the
last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST
2016
, the harvester would request all data since the repositories earliest
date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting
directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz"

主要指標

概覽
名稱與所有者miku/metha
主編程語言Python
編程語言Makefile (語言數: 7)
平台Linux
許可證GNU General Public License v3.0
所有者活动
創建於2016-04-16 13:18:58
推送於2025-04-24 18:12:58
最后一次提交2025-04-24 19:06:41
發布數123
最新版本名稱v0.3.27 (發布於 )
第一版名稱v0.1.3 (發布於 )
用户参与
星數124
關注者數8
派生數14
提交數1.3k
已啟用問題?
問題數30
打開的問題數14
拉請求數9
打開的拉請求數0
關閉的拉請求數1
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?