webpager

Paginating the web

  • 所有者: scrapinghub/webpager
  • 平台:
  • 許可證:
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

Webpager

A simple library to classify if an anchor on HTML page is a pagination link or not.

Installation

Clone the repository, then install package requirements
(package requires lxml, scikit-learn)::

$ pip install -r requirements.txt

then install package itself::

$ python setup.py install

Usage

Get a HTML page somewhere.::

>>> from urllib import urlopen
>>> url = 'http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-Trattoria_Caffe_Monteverdi-Hong_Kong.html'
>>> html = urlopen(url).read()

Load web pager and classify.::

>>> from webpager import WebPager
>>> webpager = WebPager()
>>> for anchor, label in webpager.paginate(html, url):
>>>     if label:
>>>	         print anchor.get('href')

http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or40-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS

Training

see train.ipynb_ for more details.

.. _train.ipynb: http://nbviewer.ipython.org/github/scrapinghub/webpager/blob/master/train.ipynb

主要指標

概覽
名稱與所有者scrapinghub/webpager
主編程語言C
編程語言Python (語言數: 2)
平台
許可證
所有者活动
創建於2013-08-16 09:54:24
推送於2014-02-11 08:52:23
最后一次提交2014-02-11 16:52:09
發布數0
用户参与
星數37
關注者數129
派生數12
提交數15
已啟用問題?
問題數1
打開的問題數1
拉請求數1
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?