wordninja

Probabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.

  • 所有者: keredson/wordninja
  • 平台:
  • 許可證: MIT License
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

image

Word Ninja

Slice your munged together words! Seriously, Take anything, 'imateapot' for example, would become ['im', 'a', 'teapot']. Useful for humanizing stuff (like database tables when people don't like underscores).

This project is repackaging the excellent work from here: http://stackoverflow.com/a/11642687/2449774

Usage

$ python
>>> import wordninja
>>> wordninja.split('derekanderson')
['derek', 'anderson']
>>> wordninja.split('imateapot')
['im', 'a', 'teapot']
>>> wordninja.split('heshotwhointhewhatnow')
['he', 'shot', 'who', 'in', 'the', 'what', 'now']
>>> wordninja.split('thequickbrownfoxjumpsoverthelazydog')
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Performance

It's super fast!

>>> def f():
...   wordninja.split('imateapot')
... 
>>> timeit.timeit(f, number=10000)
0.40885152100236155

It can handle long strings:

>>> wordninja.split('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')
['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']

And scales well. (This string takes ~7ms to compute.)

How to Install

pip3 install wordninja

Custom Language Models

#1 most requested feature! If you want to do something other than english (or want to specify your own model of english), this is how you do it.

>>> lm = wordninja.LanguageModel('my_lang.txt.gz')
>>> lm.split('derek')
['der','ek']

Language files must be gziped text files with one word per line in decreasing order of probability.

If you want to make your model the default, set:

wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('my_lang.txt.gz')

主要指標

概覽
名稱與所有者keredson/wordninja
主編程語言Python
編程語言Python (語言數: 1)
平台
許可證MIT License
所有者活动
創建於2017-04-20 22:05:42
推送於2023-02-19 00:44:37
最后一次提交2023-02-14 12:23:58
發布數0
用户参与
星數853
關注者數9
派生數111
提交數19
已啟用問題?
問題數21
打開的問題數14
拉請求數4
打開的拉請求數5
關閉的拉請求數2
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?