justext

A Go package that implements the JusText boilerplate removal algorithm

  • Owner: JalfResi/justext
  • Platform:
  • License:: MIT License
  • Category::
  • Topic:
  • Like:
    0
      Compare:

Github stars Tracking Chart

justext

A Go package that implements the JusText boilerplate removal algorithm (http://code.google.com/p/justext/)

Install

go get github.com/JalfResi/justext

And import:

import "github.com/JalfResi/justext"

Usage

Supports all stoplist files available at http://code.google.com/p/justext/source/browse/#svn%2Ftrunk%2Fjustext%2Fstoplists

Justext expects valid HTML; it is your responsability to ensure that valid HTML is passed to Justext. To make things easier
I have written a CGO wrapper around libtidy which you can find here: github.com/JalfResi/GoTidy
In the future, once exp/html is part of the standard packages I will refactor JusText to accept only valid HTML documents/strings.

Justext use the reader-writer idiom, alowing you to setup the reader with a common configuration and just pump out
articles to the writer.

Example usage:

// Create a justext reader from another reader
reader := justext.NewReader(os.Stdin)

// Configure the reader
reader.LengthLow = 70
reader.LengthHigh = 200
reader.Stoplist = stoplist // The stoplist map[string]bool
reader.StopwordsLow = 0.3
reader.StopwordsHigh = 0.32
reader.MaxLinkDensity = 0.2
reader.MaxHeadingDistance = 200
reader.NoHeadings = false

// Read from the reader to generate a paragraph set
paragraphSet, _ := reader.ReadAll()

// Create a writer from another writer
writer := justext.NewWriter(os.Stdout)
// Write the paragraph set to the writer
writer.WriteAll(paragraphSet)

Main metrics

Overview
Name With OwnerJalfResi/justext
Primary LanguageGo
Program languageGo (Language Count: 2)
Platform
License:MIT License
所有者活动
Created At2012-02-04 19:54:15
Pushed At2022-11-06 20:08:34
Last Commit At2022-11-06 20:08:34
Release Count0
用户参与
Stargazers Count109
Watchers Count4
Fork Count15
Commits Count83
Has Issues Enabled
Issues Count28
Issue Open Count14
Pull Requests Count6
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private