WAT 2020

The 7th Workshop on Asian Translation

Baseline Systems

Data preparation for the JE and EJ patent subtasks

[BASELINE SYSTEMS TOP] | [SETUP] | [FOR NMT]

Setup

We assume that there are JPC text files (test.txt, dev.txt, and train.txt) in corpus.org/.

MOSES_SCRIPT=${path}/mosesdecoder-RELEASE-2.1.1/scripts SCRIPT_DIR=${path}/script.converter.distribution

cd corpus.org/
Extracting sentences
for name in dev test train; do perl -ne 'chomp; @a=split/ \|\|\| /; print $a[4], "\n";' < ${name}.txt > ${name}.ja.txt perl -ne 'chomp; @a=split/ \|\|\| /; print $a[3], "\n";' < ${name}.txt > ${name}.en.txt done cd ..

For NMT

mkdir corpus.tok cd corpus.tok
Tokenizing sentences in Japanese
for file in train dev test; do cat ../corpus.org/${file}.ja.txt | \ perl -CSD -Mutf8 -pe 's/　/ /g;' | \ juman -b | \ perl -ne 'chomp; if($_ eq "EOS"){print join(" ",@b),"\n"; @b=();} else {@a=split/ /; push @b, $a[0];}' | \ perl -pe 's/^ +//; s/ +$//; s/ +/ /g;' | \ perl -CSD -Mutf8 -pe 'tr/\|[]/｜［］/; ' \ > ${file}.ja done
Tokenizing sentences in English
for file in train dev test; do cat ../corpus.org/${file}.en.txt | \ perl ${SCRIPT_DIR}/z2h-utf8.pl | \ perl ${MOSES_SCRIPT}/tokenizer/tokenizer.perl -l en -no-escape \ > ${file}.en done cd ..
Applying BPE (see Data preparation by BPE)

NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2020-07-08

WAT 2020 The 7th Workshop on Asian Translation Baseline Systems Data preparation for the JE and EJ patent subtasks

Setup

Extracting sentences

For NMT

Tokenizing sentences in Japanese

Tokenizing sentences in English

Applying BPE (see Data preparation by BPE)

WAT 2020

The 7th Workshop on Asian Translation

Baseline Systems

Data preparation for the JE and EJ patent subtasks