WAT 2020

The 7th Workshop on Asian Translation

Baseline Systems

Data preparation for the JC and CJ subtasks

[BASELINE SYSTEMS TOP] | [SETUP] | [FOR NMT]

Setup

We assume that there are ASPEC text files (test.txt, dev.txt, and train.txt) in corpus.org/.

MOSES_SCRIPT=${path}/mosesdecoder-RELEASE-2.1.1/scripts SCRIPT_DIR=${path}/script.converter.distribution

cd corpus.org/
Extracting sentences
for name in dev test train; do perl -ne 'chomp; @a=split/ \|\|\| /; print $a[1], "\n";' < ${name}.txt > ${name}.ja.txt perl -ne 'chomp; @a=split/ \|\|\| /; print $a[2], "\n";' < ${name}.txt > ${name}.zh.txt done cd ..

For NMT

mkdir corpus.tok cd corpus.tok
Tokenizing sentences in Japanese
for file in train dev test; do cat ../corpus.org/${file}.ja.txt | \ perl -CSD -Mutf8 -pe 's/　/ /g;' | \ juman -b | \ perl -ne 'chomp; if($_ eq "EOS"){print join(" ",@b),"\n"; @b=();} else {@a=split/ /; push @b, $a[0];}' | \ perl -pe 's/^ +//; s/ +$//; s/ +/ /g;' | \ perl -CSD -Mutf8 -pe 'tr/\|[]/｜［］/; ' \ > ${file}.ja done
Tokenizing sentences in Chinese
for name in dev test train; do ${path}/stanford-segmenter-2014-01-04/segment.sh ctb ../corpus.org/${name}.zh.txt UTF-8 0 | \ perl -CSD -Mutf8 -pe 'tr/\|[]/｜［］/; ' \ > ${name}.zh done cd ..
Applying BPE (see Data preparation by BPE)

NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2020-07-08

WAT 2020 The 7th Workshop on Asian Translation Baseline Systems Data preparation for the JC and CJ subtasks

Setup

Extracting sentences

For NMT

Tokenizing sentences in Japanese

Tokenizing sentences in Chinese

Applying BPE (see Data preparation by BPE)

WAT 2020

The 7th Workshop on Asian Translation

Baseline Systems

Data preparation for the JC and CJ subtasks