kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Github星跟蹤圖

Build Status
BuildStatus(Windows)
Coverage Status
GoDoc
Docker Pulls
Docker Automated build
Deploy

Kagome Japanese Morphological Analyzer

Kagome is an open source Japanese morphological analyzer written in pure golang.
The MeCab-IPADIC and UniDic (unidic-mecab) dictionary/statiscal models are packaged in Kagome binary.

demo

% kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

Install

% go get -u github.com/ikawaha/kagome/...

Usage

$ kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome
usage: kagome <command>
The commands are:
   [tokenize] - command line tokenize (*default)
   server - run tokenize server
   lattice - lattice viewer

tokenize [-file input_file] [-dic dic_file] [-udic userdic_file] [-sysdic (ipa, uni)] [-mode (normal, search, extended)]
  -dic string
       dic
  -file string
       input file
  -mode string
       tokenize mode (normal, search, extended) (default "normal")
  -sysdic string
       system dic type (ipa, uni) (default "ipa")
  -udic string
       user dic

Command line mode

$ go run cmd/kagome/main.go -h

or

$ go run cmd/kagome/main.go tokenize -h
Usage of tokenize:
  -dic string
       dic
  -file string
       input file
  -mode string
       tokenize mode (normal, search, extended) (default "normal")
  -sysdic string
       system dic type (ipa, uni) (default "ipa")
  -udic string
       user dic

Server mode

$ go run cmd/kagome/main.go server -h
Usage of server:
  -http string
        HTTP service address (default ":6060")
  -sysdic string
       system dic type (ipa, uni) (default "ipa")
  -udic string
        user dictionary

Kagome has segmentation mode for search such as Kuromoji.

  • Normal: Regular segmentation
  • Search: Use a heuristic to do additional segmentation useful for search
  • Extended: Similar to search mode, but also unigram unknown words, Untokenized, Normal, Search, Extended, :-------, :---------, :---------, :---------, 関西国際空港, 関西国際空港, 関西 国際 空港, 関西 国際 空港, 日本経済新聞, 日本経済新聞, 日本 経済 新聞, 日本 経済 新聞, シニアソフトウェアエンジニア, シニアソフトウェアエンジニア, シニア ソフトウェア エンジニア, シニア ソフトウェア エンジニア, デジカメを買った, デジカメ を 買っ た, デジカメ を 買っ た, デ ジ カ メ を 買っ た, #### HTTP service
Web API
$ kagome server -http=":8080" &
$ curl -XPUT localhost:8080/a -d'{"sentence":"すもももももももものうち", "mode":"normal"}', jq .
{
  "status": true,
  "tokens": [
    {
      "id": 36163,
      "start": 0,
      "end": 3,
      "surface": "すもも",
      "class": "KNOWN",
      "features": [
        "名詞",
        "一般",
        "*",
        "*",
        "*",
        "*",
        "すもも",
        "スモモ",
        "スモモ"
      ]
    },
    {
      "id": 73244,
      "start": 3,
      "end": 4,
      "surface": "も",
      "class": "KNOWN",
      "features": [
        "助詞",
        "係助詞",
        "*",
        "*",
        "*",
        "*",
        "も",
        "モ",
        "モ"
      ]
    },
    {
      "id": 74989,
      "start": 4,
      "end": 6,
      "surface": "もも",
      "class": "KNOWN",
      "features": [
        "名詞",
        "一般",
        "*",
        "*",
        "*",
        "*",
        "もも",
        "モモ",
        "モモ"
      ]
    },
    {
      "id": 73244,
      "start": 6,
      "end": 7,
      "surface": "も",
      "class": "KNOWN",
      "features": [
        "助詞",
        "係助詞",
        "*",
        "*",
        "*",
        "*",
        "も",
        "モ",
        "モ"
      ]
    },
    {
      "id": 74989,
      "start": 7,
      "end": 9,
      "surface": "もも",
      "class": "KNOWN",
      "features": [
        "名詞",
        "一般",
        "*",
        "*",
        "*",
        "*",
        "もも",
        "モモ",
        "モモ"
      ]
    },
    {
      "id": 55829,
      "start": 9,
      "end": 10,
      "surface": "の",
      "class": "KNOWN",
      "features": [
        "助詞",
        "連体化",
        "*",
        "*",
        "*",
        "*",
        "の",
        "ノ",
        "ノ"
      ]
    },
    {
      "id": 8024,
      "start": 10,
      "end": 12,
      "surface": "うち",
      "class": "KNOWN",
      "features": [
        "名詞",
        "非自立",
        "副詞可能",
        "*",
        "*",
        "*",
        "うち",
        "ウチ",
        "ウチ"
      ]
    }
  ]
}

Parameters, Parameter, Type, Required, Description, :---, :---, :---, :---, sentence, string, Required, Sentenct to tokenize., mode, string, Optional, Mode to tokenize the sentence. Default is the "normal". Selectable value is "normal", "search" or "extended"., ##### Demo

Launch a server and access http://localhost:8888.
(To draw a lattice, demo application uses graphviz . You need graphviz installed.)

$ kagome -http=":8888" &

Deploy

Demo

User Dictionary

User dictionary format is same as Kuromoji. There is a sample in _sample dir.

% kagome tokenize -udic _sample/userdic.txt
第68代横綱朝青龍
第	接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
68	名詞,数,*,*,*,*,*
代	名詞,接尾,助数詞,*,*,*,代,ダイ,ダイ
横綱	名詞,一般,*,*,*,*,横綱,ヨコヅナ,ヨコズナ
朝青龍	カスタム人名,朝青龍,アサショウリュウ
EOS

Utility

A debug tool of tokenize process outputs a lattice in graphviz dot format.

$ kagome lattice -v すもももももももものうち, dot -Tpng -o lattice.png
すもも	  名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	  助詞,係助詞,*,*,*,*,も,モ,モ
もも	  名詞,一般,*,*,*,*,もも,モモ,モモ
も	  助詞,係助詞,*,*,*,*,も,モ,モ
もも	  名詞,一般,*,*,*,*,もも,モモ,モモ
の	  助詞,連体化,*,*,*,*,の,ノ,ノ
うち	  名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS

lattice

Programming example

Below is a simple go example that demonstrates how a simple text can be segmented.

sample code:

package main

import (
	"fmt"
	"strings"

	"github.com/ikawaha/kagome/tokenizer"
)

func main() {
	t := tokenizer.New()
	tokens := t.Tokenize("寿司が食べたい。") // t.Analyze("寿司が食べたい。", tokenizer.Normal)
	for _, token := range tokens {
		if token.Class == tokenizer.DUMMY {
			// BOS: Begin Of Sentence, EOS: End Of Sentence.
			fmt.Printf("%s\n", token.Surface)
			continue
		}
		features := strings.Join(token.Features(), ",")
		fmt.Printf("%s\t%v\n", token.Surface, features)
	}
}

output:

BOS
寿司    名詞,一般,*,*,*,*,寿司,スシ,スシ
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ    動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい    助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
。      記号,句点,*,*,*,*,。,。,。
EOS

Working with Google App Engine

The system dictionary UniDic is too large to upload to Google App Engine.
For Google App Engine, please use kagome.ipadic which is a small dictionary version of kagome.

see details: https://github.com/ikawaha/kagome/issues/86

Docker image

Docker

docker pull ikawaha/kagome:latest

Examples

Show help

$ docker run --rm ikawaha/kagome -h

Show tokenize command's help

$ docker run --rm ikawaha/kagome tokenize -h

Interactive mode

$ docker run --rm -it ikawaha/kagome
すもももももももものうち
すもも	名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
も	助詞,係助詞,*,*,*,*,も,モ,モ
もも	名詞,一般,*,*,*,*,もも,モモ,モモ
の	助詞,連体化,*,*,*,*,の,ノ,ノ
うち	名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ
EOS
^C
$

Server mode as Web API

  • Detached to run on background with 8888 port.
$ docker run --rm -d -p 8888:80 ikawaha/kagome server -http=":80"
  • Access to the API from the client.
$ curl -s -XPUT localhost:8888/a -d'{"sentence":"すもももももももものうち", "mode":"normal"}', jq .

Contributing

Issues and pull requests are always welcome. Code changes are made to the develop branch. Do not make your changes against the master branch.

License

Kagome is licensed under the Apache License v2.0 and uses the MeCab-IPADIC, UniDic dictionary/statistical model. See NOTICE.txt for license details.

主要指標

概覽
名稱與所有者ikawaha/kagome
主編程語言Go
編程語言Go (語言數: 3)
平台
許可證MIT License
所有者活动
創建於2014-06-26 04:38:13
推送於2025-06-18 23:46:01
最后一次提交2020-06-26 18:41:25
發布數86
最新版本名稱v2.10.2 (發布於 )
第一版名稱v0.0.1 (發布於 2014-06-26 15:52:42)
用户参与
星數878
關注者數22
派生數55
提交數824
已啟用問題?
問題數36
打開的問題數4
拉請求數267
打開的拉請求數0
關閉的拉請求數57
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?