scrapy-mosquitera

Restrict crawl and scraping scope using matchers.

  • 所有者: scrapinghub/scrapy-mosquitera
  • 平台:
  • 許可證: BSD 3-Clause "New" or "Revised" License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

===============================================
scrapy-mosquitera - tools for filtered scraping

.. image:: https://travis-ci.org/scrapinghub/scrapy-mosquitera.svg?branch=master
:target: https://travis-ci.org/scrapinghub/scrapy-mosquitera

.. image:: https://img.shields.io/pypi/v/scrapy-mosquitera.svg?maxAge=2592000
:target: https://pypi.python.org/pypi/scrapy-mosquitera

.. image:: https://img.shields.io/pypi/pyversions/scrapy-mosquitera.svg?maxAge=2592000

.. image:: https://img.shields.io/pypi/l/scrapy-mosquitera.svg?maxAge=2592000

.. epigraph::

How can I scrape items off a site from the last five days?

-- Scrapy User

That question started the development of scrapy-mosquitera, a tool to help
you restrict crawling and scraping scope using matchers.

Matchers are simple Python functions that return the validity of an element
under certain restrictions.

The first goal in the project was date matching, but you can create your own
matcher for your own crawling and scraping needs.

How it works

In the case where the dates are available in the URLs, you will just use
the matcher function directly in your code::

from scrapy_mosquitera.matchers import date_matches

date = scrape_date_from_url(url)

if date_matches(data=date, after='5 days ago'):
yield Request(url=url, callback=self.parse_item)

To handle the case when the date is only available at the time when you scrape
the items, scrapy-mosquitera provides a PaginationMixin to control the
crawl according to the dates scraped.

Head on to the remaining of the documentation_ for more details.

.. _documentation: http://scrapy-mosquitera.readthedocs.io

Installation

The quick way::

pip install scrapy-mosquitera

概覽

名稱與所有者scrapinghub/scrapy-mosquitera
主編程語言Python
編程語言Makefile (語言數: 3)
平台
許可證BSD 3-Clause "New" or "Revised" License
發布數2
最新版本名稱v0.1.1 (發布於 )
第一版名稱0.1.0 (發布於 )
創建於2016-05-10 12:27:29
推送於2016-06-08 20:59:24
最后一次提交2016-06-08 16:59:12
星數25
關注者數6
派生數6
提交數34
已啟用問題?
問題數0
打開的問題數0
拉請求數0
打開的拉請求數0
關閉的拉請求數0
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?
去到頂部