subword-nmt

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

Github星跟踪图

Subword Neural Machine Translation

This repository contains preprocessing scripts to segment text into subword
units. The primary purpose is to facilitate the reproduction of our experiments
on Neural Machine Translation with subword units (see below for reference).

INSTALLATION

install via pip (from PyPI):

pip install subword-nmt

install via pip (from Github):

pip install https://github.com/rsennrich/subword-nmt/archive/master.zip

alternatively, clone this repository; the scripts are executable stand-alone.

USAGE INSTRUCTIONS

Check the individual files for usage instructions.

To apply byte pair encoding to word segmentation, invoke these commands:

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

To segment rare words into character n-grams, do the following:

subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}
subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}

The original segmentation can be restored with a simple replacement:

sed -r 's/(@@ ), (@@ ?$)//g'

If you cloned the repository and did not install a package, you can also run the individual commands as scripts:

./subword_nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}

BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT

We found that for languages that share an alphabet, learning BPE on the
concatenation of the (two or more) involved languages increases the consistency
of segmentation, and reduces the problem of inserting/deleting characters when
copying/transliterating names.

However, this introduces undesirable edge cases in that a word may be segmented
in a way that has only been observed in the other language, and is thus unknown
at test time. To prevent this, apply_bpe.py accepts a --vocabulary and a
--vocabulary-threshold option so that the script will only produce symbols
which also appear in the vocabulary (with at least some frequency).

To use this functionality, we recommend the following recipe (assuming L1 and L2
are the two languages):

Learn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:

cat {train_file}.L1 {train_file}.L2, subword-nmt learn-bpe -s {num_operations} -o {codes_file}
subword-nmt apply-bpe -c {codes_file} < {train_file}.L1, subword-nmt get-vocab > {vocab_file}.L1
subword-nmt apply-bpe -c {codes_file} < {train_file}.L2, subword-nmt get-vocab > {vocab_file}.L2

more conventiently, you can do the same with with this command:

subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2

re-apply byte pair encoding with vocabulary filter:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2

as a last step, extract the vocabulary to be used by the neural network. Example with Nematus:

nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2

[you may want to take the union of all vocabularies to support multilingual systems]

for test/dev data, re-use the same options for consistency:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1

PUBLICATIONS

The segmentation methods are described in:

Rico Sennrich, Barry Haddow and Alexandra Birch (2016):
Neural Machine Translation of Rare Words with Subword Units
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

HOW IMPLEMENTATION DIFFERS FROM Sennrich et al. (2016)

This repository implements the subword segmentation as described in Sennrich et al. (2016),
but since version 0.2, there is one core difference related to end-of-word tokens.

In Sennrich et al. (2016), the end-of-word token </w> is initially represented as a separate token, which can be merged with other subwords over time:

u n d </w>
f u n d </w>

Since 0.2, end-of-word tokens are initially concatenated with the word-final character:

u n d</w>
f u n d</w>

The new representation ensures that when BPE codes are learned from the above examples and then applied to new text, it is clear that a subword unit und is unambiguously word-final, and un is unambiguously word-internal, preventing the production of up to two different subword units from each BPE merge operation.

apply_bpe.py is backward-compatible and continues to accept old-style BPE files. New-style BPE files are identified by having the following first line: #version: 0.2

ACKNOWLEDGMENTS

This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).

主要指标

概览
名称与所有者rsennrich/subword-nmt
主编程语言Python
编程语言Python (语言数: 1)
平台
许可证MIT License
所有者活动
创建于2015-09-01 10:50:32
推送于2024-08-07 14:21:16
最后一次提交2024-08-07 16:21:06
发布数6
最新版本名称v0.3.8 (发布于 2021-12-08 11:01:34)
第一版名称v0.1 (发布于 )
用户参与
星数2.2k
关注者数53
派生数469
提交数144
已启用问题?
问题数87
打开的问题数3
拉请求数21
打开的拉请求数0
关闭的拉请求数12
项目设置
已启用Wiki?
已存档?
是复刻?
已锁定?
是镜像?
是私有?