Sentences

Golang 中的多语言命令行句子标记器。「A multilingual command line sentence tokenizer in Golang」

Github星跟踪图

release
GODOC
MIT
Go Report Card

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Install

mac

brew tap neurosnap/sentences
brew install sentences

other

Or you can find the pre-built binaries on the github
releases page
.

using golang

go get github.com/neurosnap/sentences
go install github.com/neurosnap/sentences/_cmd/sentences

Command

Command line

Get it

go get github.com/neurosnap/sentences

Use it

import (
    "fmt"

    "github.com/neurosnap/sentences"
    "github.com/neurosnap/sentences/data"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // Compiling language specific data into a binary file can be accomplished
    // by using `make <lang>` and then loading the `json` data:
    b, _ := data.Asset("data/english.json");

    // load the training data
    training, _ := sentences.LoadTraining(b)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "github.com/neurosnap/sentences/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Contributing

I need help maintaining this library. If you are interested in contributing
to this library then please start by looking at the golder-rules branch which
tests the Golden Rules
for english sentence tokenization created by the Pragmatic Segmenter
library.

Create an issue for a particular failing test and submit an issue/PR.

I'm happy to help anyone willing to contribute.

Customizable

Sentences was built around composability, most major components of this package
can be extended.

Eager to make adhoc changes but don't know how to start?
Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default
the command line utility loads english. I welcome anyone willing to test the
other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in
any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang.
The way the punkt system accomplishes this goal is through training the tokenizer
with text in that given language. Once the likelyhoods of abbreviations, collocations,
and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary
issue being abbreviations. The punkt system attempts to determine whether a word
is an abbrevation, an end to a sentence, or even both through training the system with text
in the given language. The punkt system incorporates both token- and type-based
analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English
text, we compare this package with other libraries across multiple programming languages.

Library Avg Speed (s, 10 runs) Accuracy (%)
Sentences 1.96 98.95
NLTK 5.22 99.21

概览

名称与所有者neurosnap/sentences
主编程语言Go
编程语言Go (语言数: 3)
平台Linux, Mac, Windows
许可证MIT License
发布数16
最新版本名称v1.1.2 (发布于 )
第一版名称v1 (发布于 2015-11-23 19:42:09)
创建于2015-08-07 01:08:20
推送于2024-02-28 04:28:03
最后一次提交
星数424
关注者数15
派生数38
提交数223
已启用问题?
问题数17
打开的问题数5
拉请求数17
打开的拉请求数0
关闭的拉请求数3
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?
去到顶部