justext

A Go package that implements the JusText boilerplate removal algorithm

  • 所有者: JalfResi/justext
  • 平台:
  • 许可证: MIT License
  • 分类:
  • 主题:
  • 喜欢:
    0
      比较:

Github星跟踪图

justext

A Go package that implements the JusText boilerplate removal algorithm (http://code.google.com/p/justext/)

Install

go get github.com/JalfResi/justext

And import:

import "github.com/JalfResi/justext"

Usage

Supports all stoplist files available at http://code.google.com/p/justext/source/browse/#svn%2Ftrunk%2Fjustext%2Fstoplists

Justext expects valid HTML; it is your responsability to ensure that valid HTML is passed to Justext. To make things easier
I have written a CGO wrapper around libtidy which you can find here: github.com/JalfResi/GoTidy
In the future, once exp/html is part of the standard packages I will refactor JusText to accept only valid HTML documents/strings.

Justext use the reader-writer idiom, alowing you to setup the reader with a common configuration and just pump out
articles to the writer.

Example usage:

// Create a justext reader from another reader
reader := justext.NewReader(os.Stdin)

// Configure the reader
reader.LengthLow = 70
reader.LengthHigh = 200
reader.Stoplist = stoplist // The stoplist map[string]bool
reader.StopwordsLow = 0.3
reader.StopwordsHigh = 0.32
reader.MaxLinkDensity = 0.2
reader.MaxHeadingDistance = 200
reader.NoHeadings = false

// Read from the reader to generate a paragraph set
paragraphSet, _ := reader.ReadAll()

// Create a writer from another writer
writer := justext.NewWriter(os.Stdout)
// Write the paragraph set to the writer
writer.WriteAll(paragraphSet)

主要指标

概览
名称与所有者JalfResi/justext
主编程语言Go
编程语言Go (语言数: 2)
平台
许可证MIT License
所有者活动
创建于2012-02-04 19:54:15
推送于2022-11-06 20:08:34
最后一次提交2022-11-06 20:08:34
发布数0
用户参与
星数109
关注者数4
派生数15
提交数83
已启用问题?
问题数28
打开的问题数14
拉请求数6
打开的拉请求数0
关闭的拉请求数0
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?