NICT_LOGO.JPG KYOTO-U_LOGO.JPG

WAT 2018

The 5th Workshop on Asian Translation
Baseline Systems
Data preparation by BPE for J, E, C, and K

[BASELINE SYSTEMS TOP] | [BPE]

BPE

We assume that there are tokenized files (test.src, test.tgt, dev.src, dev.tgt, train.src, and train.tgt) in corpus.tok/.
MOSES_SCRIPT=${path}/mosesdecoder-RELEASE-4.0/scripts
SCRIPT_DIR=${path}/script.converter.distribution
mkdir corpus.bpe
cd corpus.bpe

  • Building a BPE model
  • subword-nmt learn-joint-bpe-and-vocab --input ../corpus.tok/train.src ../corpus.tok/train.tgt -s 100000 -o bpe_codes --write-vocabulary vocab.src vocab.tgt

  • Applying the BPE model
  • for name in train dev test; do
      subword-nmt apply-bpe -c bpe_codes --vocabulary vocab.src --vocabulary-threshold 10 < ../corpus.tok/${name}.src > ${name}.src
      subword-nmt apply-bpe -c bpe_codes --vocabulary vocab.tgt --vocabulary-threshold 10 < ../corpus.tok/${name}.tgt > ${name}.tgt
    done

    cd ..
    Back to top

    NICT (National Institute of Information and Communications Technology)
    Kyoto University
    Last Modified: 2018-07-30