python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

  • 所有者: buriy/python-readability
  • 平台:
  • 許可證: Apache License 2.0
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

.. image:: https://travis-ci.org/buriy/python-readability.svg?branch=master
:target: https://travis-ci.org/buriy/python-readability

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project <http://lab.arc90.com/experiments/readability/>__.

Installation

It's easy using pip, just run:

.. code-block:: bash

$ pip install readability-lxml

Usage

.. code-block:: python

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8beta Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 <http://www.apache.org/licenses/LICENSE-2.0>__ license.

Thanks to

  • Latest readability.js <https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js>__
  • Ruby port by starrhorne and iterationlabs
  • Python port <https://github.com/gfxmonk/python-readability>__ by gfxmonk
  • Decruft effort <http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.

主要指標

概覽
名稱與所有者buriy/python-readability
主編程語言Python
編程語言Makefile (語言數: 2)
平台
許可證Apache License 2.0
所有者活动
創建於2011-05-02 18:51:48
推送於2025-05-03 21:14:27
最后一次提交2025-05-04 04:09:54
發布數16
最新版本名稱0.8.4.1 (發布於 )
第一版名稱0.2 (發布於 )
用户参与
星數2.8k
關注者數96
派生數354
提交數265
已啟用問題?
問題數106
打開的問題數33
拉請求數56
打開的拉請求數1
關閉的拉請求數29
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?