anemone

Anemone web-spider framework

  • Owner: chriskite/anemone
  • Platform:
  • License:: MIT License
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

= Anemone

Anemone is a web spider framework that can spider a domain and collect useful
information about the pages it visits. It is versatile, allowing you to
write your own specialized spider tasks quickly and easily.

See http://anemone.rubyforge.org for more information.

== Features

  • Multi-threaded design for high performance
  • Tracks 301 HTTP redirects
  • Built-in BFS algorithm for determining page depth
  • Allows exclusion of URLs based on regular expressions
  • Choose the links to follow on each page with focus_crawl()
  • HTTPS support
  • Records response time for each page
  • CLI program can list all pages in a domain, calculate page depths, and more
  • Obey robots.txt
  • In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis

== Examples
See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

== Requirements

  • nokogiri
  • robots

== Development
To test and develop this gem, additional requirements are:

  • rspec
  • fakeweb
  • tokyocabinet
  • kyotocabinet-ruby
  • mongo
  • redis
  • sqlite3

You will need to have KyotoCabinet, {Tokyo Cabinet}[http://fallabs.com/tokyocabinet/], {MongoDB}[http://www.mongodb.org/], and {Redis}[http://code.google.com/p/redis/] installed on your system and running.

Main metrics

Overview
Name With Ownerchriskite/anemone
Primary LanguageRuby
Program languageRuby (Language Count: 1)
Platform
License:MIT License
所有者活动
Created At2009-04-14 18:31:48
Pushed At2020-03-20 11:27:38
Last Commit At2012-05-30 14:32:32
Release Count6
Last Release Namev0.7.2 (Posted on 2012-05-30 14:33:44)
First Release Namev0.4.0 (Posted on 2010-04-08 20:36:57)
用户参与
Stargazers Count1.6k
Watchers Count62
Fork Count323
Commits Count197
Has Issues Enabled
Issues Count0
Issue Open Count0
Pull Requests Count2
Pull Requests Open Count20
Pull Requests Close Count24
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private