aile

Automatic Item List Extraction

  • 所有者: scrapinghub/aile
  • 平台:
  • 許可證: MIT License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

Automatic Item List Extraction Build Status

This repository is a temporary container for experiments in automatic extraction of list and tables from web pages.
At some later point I will merge the surviving algorithms either in scrapely
or portia.

I document my ideas and algorithms descriptions at readthedocs.

The current approach is based on the HTML code of the page, treated as a stream of HTML tags as processed by
scrapely. An alternative approach would be to use also the web page
rendering information (this script renders a tree
of bounding boxes for each element).

Installation

pip install -r requirements.txt
python setup.py develop

Running

If you want to have a feeling of how it works there are two demo scripts included in the repo.

  • demo1.py
    Will annotate the HTML code of a web page, marking as red the lines that form part of the repeating item
    and with a prefix number the field number inside the item. The output is written in the file 'annotated.html'.

    python demo1.py https://news.ycombinator.com
    

    annotated HTML

  • demo2.py
    Will label, color and draw the HTML tree so that repeating elements are easy to see. The output is interactive
    (requires PyQt4).

    python demo2.py https://news.ycombinator.com
    

    annotated tree

Algorithms

We are trying to auto-detect repeating patterns in the tags, not necessarily made of of li, tr or td tags.

Clustering trees with a measure of similarity

The idea is to compute the distance between all subtrees in the web page and run a clustering algorithm with this distance matrix.
For a web page of size N this can be achieved in time O(N^2). The current algorithm actually computes a kernel and from the kernel
computes the distance. The algorithm is based on:

Kernels for semi-structured data
Hisashi Kashima, Teruo Koyanagi

Once we compute the distance between all subtrees of the web page DBSCAN clustering is run using the distance matrix.
The result is massaged a little more until you get the result.

Markov models

The problem of detecting repeating patterns in streams is known as motif discovery and most of the literature about it seems
to be published in the field of genetics. Inspired from this there is a branch
(MEME and Profile HMM algorithms).

The Markov model approach has the following problems right now:

  • Requires several web pages for training, depending on the web page type
  • Training is performed using EM algorithm which requires several attempts until a good optimum is achieved
  • The number of hidden states is hard to determine. There are some heuristics applied that work partially

These problems are not unsurmountable (I think) but require a lot of work:

  • Precision could be improved using conditional random fields.
    These could alleviate the need for data.
  • Training can run greatly in parallel. This is actually already done using joblib
    in a single PC but it could be further improved using a cluster of computers
  • There are some papers about hidden state merging/splitting and even an
    infinite number of states

主要指標

概覽
名稱與所有者scrapinghub/aile
主編程語言HTML
編程語言Python (語言數: 3)
平台
許可證MIT License
所有者活动
創建於2015-09-15 16:27:35
推送於2016-06-15 12:32:18
最后一次提交2016-06-15 14:32:00
發布數0
用户参与
星數87
關注者數80
派生數16
提交數157
已啟用問題?
問題數6
打開的問題數5
拉請求數0
打開的拉請求數1
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?