thulac4j

Chinese Word Segmentation Tool, THULAC的Java实现.

Github stars Tracking Chart

thulac4j

thulac4j是THULAC的高效Java 8实现,具有分词速度快、准、强的特点;支持

  • 自定义词典
  • 繁体转简体
  • 停用词过滤

使用示例

在项目中使用thulac4j,添加依赖(请使用最新版本):

<dependency>
  <groupId>io.github.yizhiru</groupId>
  <artifactId>thulac4j</artifactId>
  <version>3.1.2</version>
</dependency>

thulac4j支持中文分词与词性标注,使用示例如下:

String sentence = "滔滔的流水,向着波士顿湾无声逝去";
List<String> words = Segmenter.segment(sentence);
// [滔滔, 的, 流水, ,, 向着, 波士顿湾, 无声, 逝去]

POSTagger pos = new POSTagger("models/model_c_model.bin", "models/model_c_dat.bin");
List<SegItem> words = pos.tagging(sentence);
// [滔滔/a, 的/u, 流水/n, ,/w, 向着/p, 波士顿湾/ns, 无声/v, 逝去/v]

模型数据较大,没有放在jar包与源码。训练模型下载及更多使用说明,请参看Wiki.

最后感谢THUNLP实验室!

Main metrics

Overview
Name With Owneryizhiru/thulac4j
Primary LanguageJava
Program languageJava (Language Count: 1)
Platform
License:Apache License 2.0
所有者活动
Created At2017-03-03 01:00:21
Pushed At2021-04-12 06:10:51
Last Commit At2020-10-21 10:17:42
Release Count1
Last Release Name3.1.2 (Posted on )
First Release Name3.1.2 (Posted on )
用户参与
Stargazers Count86
Watchers Count10
Fork Count31
Commits Count33
Has Issues Enabled
Issues Count19
Issue Open Count0
Pull Requests Count2
Pull Requests Open Count0
Pull Requests Close Count0
项目设置
Has Wiki Enabled
Is Archived
Is Fork
Is Locked
Is Mirror
Is Private