基于概率 fastText 的多义词嵌入

ACL 2018 论文:基于概率 fastText 的多义词嵌入(Athiwaratkun 等,2018)。「ACL 2018 paper: Probabilistic FastText for Multi-Sense Word Embeddings (Athiwaratkun et al., 2018)」

  • 所有者: benathi/multisense-prob-fasttext
  • 平台: Linux
  • 許可證: Other
  • 分類:
  • 主題:
  • 喜歡:
    0
      比較:

Github星跟蹤圖

Probabilistic FastText for Multi-Sense Word Embeddings

This repository contains the implementation of the models in Athiwaratkun et al., Probabilistic FastText for Multi-Sense Word Embeddings, ACL 2018. The paper can be accessed on Arxiv here.

Similar to our previous work in Athiwaratkun and Wilson, Multimodal Word Distributions, ACL 2017, we represent each word in the dictionary as a Gaussian Mixture distribution that can extract multiple meanings. We use FastText as our subword representation to enhance semantic estimation of rare words or words outside the training vocabulary.

The BibTeX entry for the paper is:

@InProceedings{athi_multift_2018,
    author = {Ben Athiwaratkun, Andrew Gordon Wilson, and Anima Anandkumar},
    title = {Probabilistic FastText for Multi-Sense Word Embeddings},
    booktitle = {Conference of the Association for Computational Linguistics (ACL)},
    year = {2018}
}

0. What's in this Library?

We provide

(1) scripts to train the multi-sense FastText embeddings. We give instructions on how to train the model in 1.

(2) Python scripts to evaluate the trained models on word similarity in 2. Our scripts allows the subword model to be loaded directly into a Python object which can be used to other tasks.

(3) pre-trained model and evaluation script in 3. This section includes intructions on how to load a pre-trained FastText model (single sense) into our format which allows loading as Python object.

1. Train

1.1 Compile the C++ files. The step requires a compiler with C++11 support such as g++-4.7.2 or newer or clang-3.3 or newer. It also requires make which can be installed via sudo apt-get install build-essential on Ubuntu.

Once you have make and a C++ compiler, you can compile our code by executing:

make

This command will generate multift, an executable of our model.

1.2 Obtain text data for training. We included scripts to download text8 and text9 in data/.

bash data/get_text8.sh
bash data/get_text9.sh

In our paper, we use the concatenation of ukWaC and WaCkypedia_EN as our English text corpus. Both datasets can be requested here.

The foreign language datasets deWac (German), itWac (Italian), and frWac (French) can be requested using the above link as well.

1.3 Run sample training scripts for text8 or text9.

bash exps/train_text8_multi.sh

After the training is complete, the following files will be saved:

modelname.words         List of words in the dictionary
modelname.bin           A binary file for the subword embedding model
modelname.in            The subword embeddings
modelname.in2           The embeddings for the second Gaussian component.
modelname.subword       The final representation of words in the dictionary. Note that the representation for words outside the dictionary can be computed using the provided python module based on the files *.in and *.in2.

2. Evaluate

2.1 The provided python module multift.py can be used to load the multisense FT object.

ft = multift.MultiFastText(basename="modelfiles/modelname", multi=True)

Note that the first time it loads the model can be quite slow. However, it saves the .npy files for later use which allows the loading to be much faster.

We can query for nearest neighbors give a word or evaluate the embeddings against word similarity datasets.

2.2 The script eval/eval_model_wordsim.py calculates the Spearman's correlation for multiple word similarity datasets given a model. We provide examples below.

python eval/eval_model_wordsim.py --modelname modelfiles/multi_text8_e10_d300_vs2e-4_lr1e-5_margin1

主要指標

概覽
名稱與所有者benathi/multisense-prob-fasttext
主編程語言C++
編程語言Makefile (語言數: 4)
平台Linux
許可證Other
所有者活动
創建於2018-05-11 16:01:45
推送於2018-06-11 04:31:44
最后一次提交2018-06-11 00:31:43
發布數0
用户参与
星數148
關注者數9
派生數30
提交數5
已啟用問題?
問題數7
打開的問題數6
拉請求數0
打開的拉請求數0
關閉的拉請求數0
项目设置
已啟用Wiki?
已存檔?
是復刻?
已鎖定?
是鏡像?
是私有?