justext

A Go package that implements the JusText boilerplate removal algorithm

  • 所有者: JalfResi/justext
  • 平台:
  • 許可證: MIT License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

justext

A Go package that implements the JusText boilerplate removal algorithm (http://code.google.com/p/justext/)

Install

go get github.com/JalfResi/justext

And import:

import "github.com/JalfResi/justext"

Usage

Supports all stoplist files available at http://code.google.com/p/justext/source/browse/#svn%2Ftrunk%2Fjustext%2Fstoplists

Justext expects valid HTML; it is your responsability to ensure that valid HTML is passed to Justext. To make things easier
I have written a CGO wrapper around libtidy which you can find here: github.com/JalfResi/GoTidy
In the future, once exp/html is part of the standard packages I will refactor JusText to accept only valid HTML documents/strings.

Justext use the reader-writer idiom, alowing you to setup the reader with a common configuration and just pump out
articles to the writer.

Example usage:

// Create a justext reader from another reader
reader := justext.NewReader(os.Stdin)

// Configure the reader
reader.LengthLow = 70
reader.LengthHigh = 200
reader.Stoplist = stoplist // The stoplist map[string]bool
reader.StopwordsLow = 0.3
reader.StopwordsHigh = 0.32
reader.MaxLinkDensity = 0.2
reader.MaxHeadingDistance = 200
reader.NoHeadings = false

// Read from the reader to generate a paragraph set
paragraphSet, _ := reader.ReadAll()

// Create a writer from another writer
writer := justext.NewWriter(os.Stdout)
// Write the paragraph set to the writer
writer.WriteAll(paragraphSet)

主要指標

概覽
名稱與所有者JalfResi/justext
主編程語言Go
編程語言Go (語言數: 2)
平台
許可證MIT License
所有者活动
創建於2012-02-04 19:54:15
推送於2022-11-06 20:08:34
最后一次提交2022-11-06 20:08:34
發布數0
用户参与
星數109
關注者數4
派生數15
提交數83
已啟用問題?
問題數28
打開的問題數14
拉請求數6
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?