webpager

Paginating the web

  • 所有者: scrapinghub/webpager
  • 平台:
  • 许可证:
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

Webpager

A simple library to classify if an anchor on HTML page is a pagination link or not.

Installation

Clone the repository, then install package requirements
(package requires lxml, scikit-learn)::

$ pip install -r requirements.txt

then install package itself::

$ python setup.py install

Usage

Get a HTML page somewhere.::

>>> from urllib import urlopen
>>> url = 'http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-Trattoria_Caffe_Monteverdi-Hong_Kong.html'
>>> html = urlopen(url).read()

Load web pager and classify.::

>>> from webpager import WebPager
>>> webpager = WebPager()
>>> for anchor, label in webpager.paginate(html, url):
>>>     if label:
>>>	         print anchor.get('href')

http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or40-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS

Training

see train.ipynb_ for more details.

.. _train.ipynb: http://nbviewer.ipython.org/github/scrapinghub/webpager/blob/master/train.ipynb

主要指标

概览
名称与所有者scrapinghub/webpager
主编程语言C
编程语言Python (语言数: 2)
平台
许可证
所有者活动
创建于2013-08-16 09:54:24
推送于2014-02-11 08:52:23
最后一次提交2014-02-11 16:52:09
发布数0
用户参与
星数37
关注者数129
派生数12
提交数15
已启用问题?
问题数1
打开的问题数1
拉请求数1
打开的拉请求数0
关闭的拉请求数0
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?