wordvectors

Pre-trained word vectors of 30+ languages

Github星跟蹤圖

Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.

Requirements

  • nltk >= 1.11.1
  • regex >= 2016.6.24
  • lxml >= 3.3.3
  • numpy >= 1.11.2
  • konlpy >= 0.4.4 (Only for Korean)
  • mecab (Only for Japanese)
  • pythai >= 0.1.3 (Only for Thai)
  • pyvi >= 0.0.7.2 (Only for Vietnamese)
  • jieba >= 0.38 (Only for Chinese)
  • gensim > =0.13.1 (for Word2Vec)
  • fastText (for fasttext)

Background / References

  • Check this to know what word embedding is.
  • Check this to quickly get a picture of Word2vec.
  • Check this to install fastText.
  • Watch this to really understand what's happening under the hood of Word2vec.
  • Go get various English word vectors here if needed.

Work Flow

  • STEP 1. Download the wikipedia database backup dumps of the language you want.
  • STEP 2. Extract running texts to data/ folder.
  • STEP 3. Run build_corpus.py.
  • STEP 4-1. Run make_wordvector.sh to get Word2Vec word vectors.
  • STEP 4-2. Run fasttext.sh to get fastText word vectors.

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively., Language, ISO 639-1, Vector Size, Corpus Size, Vocabulary Size, ---, ---, ---, ---, ---, Bengali (w) , Bengali (f), bn, 300, 147M, 10059, negative sampling, Catalan (w) , Catalan (f), ca, 300, 967M, 50013, negative sampling, Chinese (w) , Chinese (f), zh, 300, 1G, 50101, negative sampling, Danish (w) , Danish (f), da, 300, 295M, 30134, negative sampling, Dutch (w) , Dutch (f), nl, 300, 1G, 50160, negative sampling, Esperanto (w) , Esperanto (f), eo, 300, 1G, 50597, negative sampling, Finnish (w) , Finnish (f), fi, 300, 467M, 30029, negative sampling, French (w) , French (f), fr, 300, 1G, 50130, negative sampling, German (w) , German (f), de, 300, 1G, 50006, negative sampling, Hindi (w) , Hindi (f), hi, 300, 323M, 30393, negative sampling, Hungarian (w) , Hungarian (f), hu, 300, 692M, 40122, negative sampling, Indonesian (w) , Indonesian (f), id, 300, 402M, 30048, negative sampling, Italian (w) , Italian (f), it, 300, 1G, 50031, negative sampling, Japanese (w) , Japanese (f), ja, 300, 1G, 50108, negative sampling, Javanese (w) , Javanese (f), jv, 100, 31M, 10019, negative sampling, Korean (w) , Korean (f), ko, 200, 339M, 30185, negative sampling, Malay (w) , Malay (f), ms, 100, 173M, 10010, negative sampling, Norwegian (w) , Norwegian (f), no, 300, 1G, 50209, negative sampling, Norwegian Nynorsk (w) , Norwegian Nynorsk (f), nn, 100, 114M, 10036, negative sampling, Polish (w) , Polish (f), pl, 300, 1G, 50035, negative sampling, Portuguese (w) , Portuguese (f), pt, 300, 1G, 50246, negative sampling, Russian (w) , Russian (f), ru, 300, 1G, 50102, negative sampling, Spanish (w) , Spanish (f), es, 300, 1G, 50003, negative sampling, Swahili (w) , Swahili (f), sw, 100, 24M, 10222, negative sampling, Swedish (w) , Swedish (f), sv, 300, 1G, 50052, negative sampling, Tagalog (w) , Tagalog (f), tl, 100, 38M, 10068, negative sampling, Thai (w) , Thai (f), th, 300, 696M, 30225, negative sampling, Turkish (w) , Turkish (f), tr, 200, 370M, 30036, negative sampling, Vietnamese (w) , Vietnamese (f), vi, 100, 74M, 10087, negative sampling

主要指標

概覽
名稱與所有者Kyubyong/wordvectors
主編程語言Python
編程語言Python (語言數: 2)
平台
許可證MIT License
所有者活动
創建於2016-12-21 01:55:34
推送於2018-10-11 20:54:06
最后一次提交2017-09-13 20:48:37
發布數0
用户参与
星數2.2k
關注者數85
派生數391
提交數8
已啟用問題?
問題數24
打開的問題數16
拉請求數0
打開的拉請求數3
關閉的拉請求數2
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?