WAT 2014

The 1st Workshop on Asian Translation

Baseline Systems

Data preparation for the JE and EJ subtasks

[BASELINE SYSTEMS TOP] | [SETUP] | [FOR PHRASE-BASED AND HIERARCHICAL PHRASE-BASED SMT] | [FOR STRING-TO-TREE AND TREE-TO-STRING SMT]

Setup

We assume that there are ASPEC text files (test.txt, dev.txt, train-1.txt, train-2.txt, and train-3.txt) in corpus.org/.

MOSES_SCRIPT=${path}/mosesdecoder-RELEASE-2.1.1/scripts SCRIPT_DIR=${path}/script.converter.distribution

cd corpus.org/
Extracting sentences
for name in dev test; do perl -ne 'chomp; @a=split/ \|\|\| /; print $a[2], "\n";' < ${name}.txt > ${name}.ja.txt perl -ne 'chomp; @a=split/ \|\|\| /; print $a[3], "\n";' < ${name}.txt > ${name}.en.txt done for name in train-1 train-2 train-3; do perl -ne 'chomp; @a=split/ \|\|\| /; print $a[3], "\n";' < ${name}.txt > ${name}.ja.txt perl -ne 'chomp; @a=split/ \|\|\| /; print $a[4], "\n";' < ${name}.txt > ${name}.en.txt done

(Removing date expressions at EOS in Japanese in the training and development data to reduce noise)
for file in train-1 train-2 train-3 dev; do mv ${file}.ja.txt ${file}.ja.txt.org cat ${file}.ja.txt.org | perl -Mencoding=utf8 -pe 's/(.)［[０-９．]+］$/${1}/;' > ${file}.ja.txt done cd ..

For Phrase-based and Hierarchical Phrase-based SMT

mkdir corpus.tok cd corpus.tok
Tokenizing sentences in Japanese
for file in train-1 train-2 train-3 dev test; do cat ../corpus.org/${file}.ja.txt | \ perl -Mencoding=utf8 -pe 's/　/ /g;' | \ juman -b | \ perl -ne 'chomp; if($_ eq "EOS"){print join(" ",@b),"\n"; @b=();} else {@a=split/ /; push @b, $a[0];}' | \ perl -pe 's/^ +//; s/ +$//; s/ +/ /g;' | \ perl -Mencoding=utf8 -pe 'tr/\|[]/｜［］/; ' \ > ${file}.ja done
Tokenizing sentences in English
for file in train-1 train-2 train-3 dev test; do cat ../corpus.org/${file}.en.txt | \ perl ${SCRIPT_DIR}/z2h-utf8.pl | \ perl ${MOSES_SCRIPT}/tokenizer/tokenizer.perl -l en \ > ${file}.tok.en done
Training truecaser for English
cat train-1.tok.en train-2.tok.en train-3.tok.en dev.tok.en > train_dev.tok.en ${MOSES_SCRIPT}/recaser/train-truecaser.perl --model truecase-model.en --corpus train_dev.tok.en
Truecasing English sentences
for file in train-1 train-2 train-3 dev test; do ${MOSES_SCRIPT}/recaser/truecase.perl --model truecase-model.en < ${file}.tok.en > ${file}.en done
Building training data for language models
cat train-1.ja train-2.ja train-3.ja > train-all.ja cat train-1.en train-2.en train-3.en > train-all.en
Cleaning training data for translation models
perl ${MOSES_SCRIPT}/training/clean-corpus-n.perl train-1 ja en train-clean 1 40 cd ..

For String-to-Tree and Tree-to-String SMT

mkdir corpus.tree cd corpus.tree for file in train-1 dev test; do ln -s ../corpus.tok/${file}.ja done
Tokenizing sentences in English
for file in train-1 train-2 train-3 dev test; do cat ../corpus.org/${file}.en.txt | \ perl ${SCRIPT_DIR}/z2h-utf8.pl | \ perl ${MOSES_SCRIPT}/tokenizer/tokenizer.perl -l en -penn \ > ${file}.tok.en done
Training truecaser for English
cat train-1.tok.en train-2.tok.en train-3.tok.en dev.tok.en > train_dev.tok.en ${MOSES_SCRIPT}/recaser/train-truecaser.perl --model truecase-model.en --corpus train_dev.tok.en
Truecasing English sentences
for file in train-1 train-2 train-3; do ${MOSES_SCRIPT}/recaser/truecase.perl --model truecase-model.en < ${file}.tok.en > ${file}.en done
Building training data for a language model
cat train-1.en train-2.en train-3.en | perl -pe 's/\-LRB\-/$/g; s/\-RRB\-/$/g;' > train-all.en
Parsing English sentences
ln -s train-1.ja train-1.tok.ja perl ${MOSES_SCRIPT}/training/clean-corpus-n.perl train-1.tok ja en train.reduced.tok 1 40 for file in train.reduced dev test; do cat ${file}.tok.en | \ perl ${MOSES_SCRIPT}/training/wrappers/parse-de-berkeley.perl \ -binarize \ -ja ${path}/BerkeleyParser-1.7/BerkeleyParser-1.7.jar \ -gr ${path}/BerkeleyParser-1.7/eng_sm6.gr \ > ${file}.tok.xml.en done
Cleaning training data for translation models
ln -s train.reduced.tok.ja train.reduced.tok.xml.ja perl ${MOSES_SCRIPT}/training/clean-corpus-n.perl \ train.reduced.tok.xml ja en \ train-clean.tok.xml 1 40 \ --ignore-xml
Truecasing English sentences
for file in train-clean dev test; do ${MOSES_SCRIPT}/recaser/truecase.perl --model truecase-model.en < ${file}.tok.xml.en > ${file}.en done ln -s train-clean.tok.xml.ja train-clean.ja cd ..
Back to top

JST (Japan Science and Technology Agency)
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2014-07-07

WAT 2014 The 1st Workshop on Asian Translation Baseline Systems Data preparation for the JE and EJ subtasks

Setup

Extracting sentences

For Phrase-based and Hierarchical Phrase-based SMT

Tokenizing sentences in Japanese

Tokenizing sentences in English

Training truecaser for English

Truecasing English sentences

Building training data for language models

Cleaning training data for translation models

For String-to-Tree and Tree-to-String SMT

Tokenizing sentences in English

Training truecaser for English

Truecasing English sentences

Building training data for a language model

Parsing English sentences

Cleaning training data for translation models

Truecasing English sentences

WAT 2014

The 1st Workshop on Asian Translation

Baseline Systems

Data preparation for the JE and EJ subtasks