scrapy-mosquitera

Restrict crawl and scraping scope using matchers.

  • 所有者: scrapinghub/scrapy-mosquitera
  • 平台:
  • 许可证: BSD 3-Clause "New" or "Revised" License
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

===============================================
scrapy-mosquitera - tools for filtered scraping

.. image:: https://travis-ci.org/scrapinghub/scrapy-mosquitera.svg?branch=master
:target: https://travis-ci.org/scrapinghub/scrapy-mosquitera

.. image:: https://img.shields.io/pypi/v/scrapy-mosquitera.svg?maxAge=2592000
:target: https://pypi.python.org/pypi/scrapy-mosquitera

.. image:: https://img.shields.io/pypi/pyversions/scrapy-mosquitera.svg?maxAge=2592000

.. image:: https://img.shields.io/pypi/l/scrapy-mosquitera.svg?maxAge=2592000

.. epigraph::

How can I scrape items off a site from the last five days?

-- Scrapy User

That question started the development of scrapy-mosquitera, a tool to help
you restrict crawling and scraping scope using matchers.

Matchers are simple Python functions that return the validity of an element
under certain restrictions.

The first goal in the project was date matching, but you can create your own
matcher for your own crawling and scraping needs.

How it works

In the case where the dates are available in the URLs, you will just use
the matcher function directly in your code::

from scrapy_mosquitera.matchers import date_matches

date = scrape_date_from_url(url)

if date_matches(data=date, after='5 days ago'):
yield Request(url=url, callback=self.parse_item)

To handle the case when the date is only available at the time when you scrape
the items, scrapy-mosquitera provides a PaginationMixin to control the
crawl according to the dates scraped.

Head on to the remaining of the documentation_ for more details.

.. _documentation: http://scrapy-mosquitera.readthedocs.io

Installation

The quick way::

pip install scrapy-mosquitera

概览

名称与所有者scrapinghub/scrapy-mosquitera
主编程语言Python
编程语言Makefile (语言数: 3)
平台
许可证BSD 3-Clause "New" or "Revised" License
发布数2
最新版本名称v0.1.1 (发布于 )
第一版名称0.1.0 (发布于 )
创建于2016-05-10 12:27:29
推送于2016-06-08 20:59:24
最后一次提交2016-06-08 16:59:12
星数25
关注者数6
派生数6
提交数34
已启用问题?
问题数0
打开的问题数0
拉请求数0
打开的拉请求数0
关闭的拉请求数0
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?
去到顶部