github-mirror

Scripts to mirror Github in a cloudy fashion

  • 所有者: gousiosg/github-mirror
  • 平台:
  • 許可證: BSD 2-Clause "Simplified" License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

ghtorrent: Mirror and index data from the Github API

A library and a collection of scripts used to retrieve data from the Github API
and extract metadata in an SQL database, in a modular and scalable manner. The
scripts are distributed as a Gem (ghtorrent), but they can also be run by
checking out this repository.

GHTorrent can be used for a variety of purposes, such as:

  • Mirror the Github API event stream and follow links from events to actual data
    to gradually build a Github index
  • Create a queriable metadata database for a specific repository
  • Construct a data source for extracting process analytics (see for example those) for one or more repositories

Components

GHTorrents components (which can be used individually) are:

  • APIClient: Knows how to query the Github API (both single entities and
    pages) and respect the API request limit. Can be configured to override the
    default IP address, in case of multihomed hosts.
  • Retriever: Knows how to retrieve specific Github entities (users, repositories, watchers) by name. Uses an optional persister to avoid
    retrieving data that have not changed.
  • Persister: A key/value store, which can be backed by a real key/value store,
    to store Github JSON replies and query them on request. The backing key/value
    store must support arbitrary queries to the stored JSON objects.
  • GHTorrent: Knows how to extract information from the data retrieved by
    the retriever in order to update an SQL database (see schema) with metadata.

Component Configuration

The Persister and GHTorrent components have configurable back ends:

  • Persister: Either uses MongoDB > 3.0 (mongo driver) or no persister (noop driver)
  • GHTorrent: GHTorrent is tested mainly with MySQL and SQLite, but can theoretically be used with any SQL database compatible with Sequel. Your milaege may vary.

For distributed mirroring you also need RabbitMQ >= 3.3

Installation

1. Install GHTorrent

GHTorrent is written in Ruby (tested with Ruby > 2.0). To install it as a Gem do:

2. Install Your Preferred Database

Depending on which SQL database you want to use, install the appropriate
dependency gem.

Configuration

Copy config.yaml.tmpl
to a file in your home directory.

All provided scripts accept the -c option, which accepts the location of the configuration file as
a parameter.

You can find more information of how you can setup a mirroring cluster of machines
to retrieve data in parallel on the Wiki.

Using GHTorrent

To mirror the event stream and capture all data:

  • ght-mirror-events.rb periodically polls Github's event
    queue (https://api.github.com/events), stores all new events in the
    configured pestister, and posts them to the github exchange in
    RabbitMQ.

  • ght-data_retrieval.rb creates queues that route posted events to processor
    functions. The functions use the appropriate Github API call to retrieve the
    linked contents, extract metadata (for database storage), and store the
    retrieved data in the appropriate collection in the persister, to avoid
    duplicate API calls.
    Data in the SQL database contain pointers (the ext_ref_id field) to the
    "raw" data in the persister.

To retrieve data for a repository or user:

  • ght-retrieve-repo retrieves all data for a specific repository
  • ght-retrieve-user retrieves all data for a specific user

To perform maintenance:

  • ght-load loads selected events from the persister to the queue in order for
    the ght-data-retrieval script to reprocess them

Data

The code in this repository is used to power the data collection process of
the GHTorrent.org project.
You can find all data collected by in the project in the
Downloads page.

There are two sets of data:

  • Raw events: Github's event stream. These
    are the roots for mirroring operations. The ght-data-retrieval crawler starts
    from an event and goes deep into the rabbit hole.
  • SQL dumps + Linked data: Data dumps from the SQL database and the corresponding MongoDB entities.

Bugs & Feature Requests

Please tell us about features you'd like or bugs you've discovered on our
Issue Tracker.

Patches, bug fixes, etc are welcome. Please fork the repository and create
a pull request when done fixing/implementing the new feature.

Citing GHTorrent in your Research

If you find GHTorrent and the accompanying datasets useful in your research,
please consider citing the following paper:

Georgios Gousios and Diomidis Spinellis, "GHTorrent: GitHub’s data from a firehose," in MSR '12: Proceedings of the 9th Working Conference on Mining Software Repositories, June 2-–3, 2012. Zurich, Switzerland.

Authors

License

2-clause BSD

主要指標

概覽
名稱與所有者gousiosg/github-mirror
主編程語言Ruby
編程語言Ruby (語言數: 3)
平台
許可證BSD 2-Clause "Simplified" License
所有者活动
創建於2011-11-26 13:02:32
推送於2024-04-04 08:52:19
最后一次提交2024-04-04 08:52:19
發布數15
最新版本名稱v0.11 (發布於 2015-09-24 14:37:53)
第一版名稱0.1 (發布於 2012-05-16 00:04:57)
用户参与
星數564
關注者數26
派生數107
提交數0.9k
已啟用問題?
問題數66
打開的問題數28
拉請求數23
打開的拉請求數4
關閉的拉請求數12
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?