python-boilerpipe

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

Github stars Tracking Chart

python-boilerpipe

A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Configuration

Dependencies:

  • jpype
  • chardet

The boilerpipe jar files will get fetched and included automatically when building the package.

Installation

Checkout the code:

git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe

virtualenv

virtualenv env
source env/bin/activate
pip install -r requirements.txt
python setup.py install

Fedora

sudo dnf install -y python2-jpype
sudo python setup.py install

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argment extractor, being one of the available boilerpipe extractor types:

  • DefaultExtractor
  • ArticleExtractor
  • ArticleSentencesExtractor
  • KeepEverythingExtractor
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor
  • NumWordsRulesExtractor
  • CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()

For KeepEverythingWithMinKWordsExtractor we have to specify kMin parameter, which defaults to 1 for now:

extractor = Extractor(extractor='KeepEverythingWithMinKWordsExtractor', url=your_url, kMin=20)

Main metrics

Overview
Name With OwnerKotlinCraft/modern-java-and-kotlin-tutorials
Primary LanguageJava
Program languagePython (Language Count: 2)
Platform
License:
所有者活动
Created At2016-03-11 12:54:11
Pushed At2020-06-26 08:05:28
Last Commit At2020-06-26 10:05:21
Release Count0
用户参与
Stargazers Count205
Watchers Count33
Fork Count75
Commits Count52
Has Issues Enabled
Issues Count2
Issue Open Count0
Pull Requests Count2
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private