python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

  • Owner: buriy/python-readability
  • Platform:
  • License:: Apache License 2.0
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
:target: https://travis-ci.org/buriy/python-readability

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project <http://lab.arc90.com/experiments/readability/>__.

Installation

It's easy using pip, just run:

.. code-block:: bash

$ pip install readability-lxml

Usage

.. code-block:: python

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8beta Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>__ license.

Thanks to

  • Latest readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>__
  • Ruby port by starrhorne and iterationlabs
  • Python port <https://github.com/gfxmonk/python-readability>__ by gfxmonk
  • Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.

Main metrics

Overview
Name With Ownerburiy/python-readability
Primary LanguagePython
Program languageMakefile (Language Count: 2)
Platform
License:Apache License 2.0
所有者活动
Created At2011-05-02 18:51:48
Pushed At2025-05-03 21:14:27
Last Commit At2025-05-04 04:09:54
Release Count16
Last Release Name0.8.4.1 (Posted on )
First Release Name0.2 (Posted on )
用户参与
Stargazers Count2.8k
Watchers Count96
Fork Count354
Commits Count265
Has Issues Enabled
Issues Count106
Issue Open Count33
Pull Requests Count56
Pull Requests Open Count1
Pull Requests Close Count29
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private