crawl-anywhere

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.

  • Owner: bejean/crawl-anywhere
  • Platform:
  • License:: Apache License 2.0
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

Crawl-Anywhere

April 2013 - Starting version 4.0, Crawl-Anywhere becomes an open-source project. Current version is 4.0.0

Stable version 3.0.x is still available at http://www.crawl-anywhere.com/

Introduction

Crawl Anywhere is mainly a web crawler. However, Crawl-Anywhere includes all components in order to build a vertical search engine.

Crawl Anywhere includes :

Project home page : http://www.crawl-anywhere.com/

A web crawler is a program that discovers and read all HTML pages or documents (HTML, PDF, Office, ...) on a web site in order for example to index these data and build a search engine (like google). Wikipedia provides a great description of what is a Web crawler : http://en.wikipedia.org/wiki/Web_crawler.

Support

Build distribution

Pre-requisites :

  • Maven 3.0.0 or >
  • Oracle Java 7 or >

Steps :

Installation

Pre-requisites :

  • Oracle Java 7 or >
  • Apache 2.0 or >
  • PHP 5.2.x or 5.3.x or 5.4.x
  • MongoDB 64 bits 2.2 or >
  • Solr 4.3.0 or > (configuration files provided for Solr 4.3.0 and 4.10.0)

Steps :

Getting Started

See the User Manual at http://www.crawl-anywhere.com/getting-started/

History

  • release 4.0.0-alpha-1 : April, 28 2013
  • release 4.0.0-alpha-2 : May, 22 2013
  • release 4.0.0-alpha-3 : June, 21 2013
  • release 4.0.0-alpha-4 : June, 23 2013
  • release 4.0.0-beta-1 : August, 6 2013
  • release 4.0.0-release-candidate : October, 20 2013
  • release 4.0.0 final : December, 1, 2014

Main metrics

Overview
Name With Ownerbejean/crawl-anywhere
Primary LanguagePHP
Program languageShell (Language Count: 6)
Platform
License:Apache License 2.0
所有者活动
Created At2013-01-28 10:21:11
Pushed At2017-07-01 17:59:18
Last Commit At2015-01-28 16:18:56
Release Count7
Last Release Name4.0.0 (Posted on 2014-11-30 22:59:54)
First Release Name4.0.0-alpha-1 (Posted on 2013-04-27 19:11:09)
用户参与
Stargazers Count95
Watchers Count23
Fork Count37
Commits Count218
Has Issues Enabled
Issues Count90
Issue Open Count36
Pull Requests Count0
Pull Requests Open Count2
Pull Requests Close Count1
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private