frontera

A scalable frontier for web crawlers

  • 所有者: scrapinghub/frontera
  • 平台:
  • 許可證: BSD 3-Clause "New" or "Revised" License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

Frontera

pypi
python versions
Build Status
codecov

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by
the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
  • Two run modes: single process and distributed.
  • Built-in SqlAlchemy, Redis and HBase backends.
  • Built-in Apache Kafka and ZeroMQ message buses.
  • Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
  • Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
  • Transparent data flow, allowing to integrate custom components easily using Kafka.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • Optional use of Scrapy for fetching and parsing.
  • 3-clause BSD license, allowing to use in any commercial product.
  • Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and
pull requests.

主要指標

概覽
名稱與所有者scrapinghub/frontera
主編程語言Python
編程語言Python (語言數: 2)
平台
許可證BSD 3-Clause "New" or "Revised" License
所有者活动
創建於2014-11-22 15:42:50
推送於2025-06-06 14:41:02
最后一次提交2025-06-06 16:40:42
發布數26
最新版本名稱v0.7.2 (發布於 )
第一版名稱v0.1.0 (發布於 2014-11-24 11:13:45)
用户参与
星數1.3k
關注者數156
派生數217
提交數0.9k
已啟用問題?
問題數156
打開的問題數78
拉請求數175
打開的拉請求數17
關閉的拉請求數70
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?