anemone

Anemone web-spider framework

  • 所有者: chriskite/anemone
  • 平台:
  • 許可證: MIT License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

= Anemone

Anemone is a web spider framework that can spider a domain and collect useful
information about the pages it visits. It is versatile, allowing you to
write your own specialized spider tasks quickly and easily.

See http://anemone.rubyforge.org for more information.

== Features

  • Multi-threaded design for high performance
  • Tracks 301 HTTP redirects
  • Built-in BFS algorithm for determining page depth
  • Allows exclusion of URLs based on regular expressions
  • Choose the links to follow on each page with focus_crawl()
  • HTTPS support
  • Records response time for each page
  • CLI program can list all pages in a domain, calculate page depths, and more
  • Obey robots.txt
  • In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis

== Examples
See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

== Requirements

  • nokogiri
  • robots

== Development
To test and develop this gem, additional requirements are:

  • rspec
  • fakeweb
  • tokyocabinet
  • kyotocabinet-ruby
  • mongo
  • redis
  • sqlite3

You will need to have KyotoCabinet, {Tokyo Cabinet}[http://fallabs.com/tokyocabinet/], {MongoDB}[http://www.mongodb.org/], and {Redis}[http://code.google.com/p/redis/] installed on your system and running.

主要指標

概覽
名稱與所有者chriskite/anemone
主編程語言Ruby
編程語言Ruby (語言數: 1)
平台
許可證MIT License
所有者活动
創建於2009-04-14 18:31:48
推送於2020-03-20 11:27:38
最后一次提交2012-05-30 14:32:32
發布數6
最新版本名稱v0.7.2 (發布於 2012-05-30 14:33:44)
第一版名稱v0.4.0 (發布於 2010-04-08 20:36:57)
用户参与
星數1.6k
關注者數62
派生數323
提交數197
已啟用問題?
問題數0
打開的問題數0
拉請求數2
打開的拉請求數20
關閉的拉請求數24
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?