behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Github星跟蹤圖

Build Status

Behemoth is an open source platform for large scale document processing based on Apache Hadoop.

It consists of a simple annotation-based implementation of a document and a number of modules operating on these documents.
One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide reusable modules for :

  • ingesting from common data sources (Warc, Nutch, etc...)
  • text processing (Tika, UIMA, GATE, Language Identification)
  • generating output for external tools (SOLR, Mahout)

Its modular architecture simplifies the development of custom annotators based on MapReduce.

Note that Behemoth does not implement any NLP or Machine Learning components as such but serves as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.

WIKI : https://github.com/DigitalPebble/behemoth/wiki

Mailing list : http://groups.google.com/group/digitalpebble

StackOverflow : http://stackoverflow.com/questions/tagged/behemoth

主要指標

概覽
名稱與所有者DigitalPebble/behemoth
主編程語言Java
編程語言Shell (語言數: 3)
平台
許可證Other
所有者活动
創建於2010-06-16 00:39:09
推送於2018-04-25 18:58:00
最后一次提交2018-04-25 18:57:53
發布數2
最新版本名稱behemoth-parent-1.1 (發布於 2014-02-10 18:49:12)
第一版名稱behemoth-parent-1.0 (發布於 2012-11-09 18:09:18)
用户参与
星數283
關注者數40
派生數59
提交數267
已啟用問題?
問題數42
打開的問題數12
拉請求數13
打開的拉請求數1
關閉的拉請求數7
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?