JST_LOGO.JPG NICT_LOGO.JPG KYOTO-U_LOGO.JPG

WAT 2014

The 1st Workshop on Asian Translation
Baseline Systems
Data preparation for the JC and CJ subtasks

[BASELINE SYSTEMS TOP] | [SETUP] | [FOR PHRASE-BASED AND HIERARCHICAL PHRASE-BASED SMT] | [FOR STRING-TO-TREE AND TREE-TO-STRING SMT]

Setup

We assume that there are ASPEC text files (test.txt, dev.txt, and train.txt) in corpus.org/.
MOSES_SCRIPT=${path}/mosesdecoder-RELEASE-2.1.1/scripts
SCRIPT_DIR=${path}/script.converter.distribution
cd corpus.org/

  • Extracting sentences
  • for name in dev test train; do
      perl -ne 'chomp; @a=split/ \|\|\| /; print $a[1], "\n";' < ${name}.txt > ${name}.ja.txt
      perl -ne 'chomp; @a=split/ \|\|\| /; print $a[2], "\n";' < ${name}.txt > ${name}.zh.txt
    done

    cd ..
    Back to top

    For Phrase-based and Hierarchical Phrase-based SMT

    mkdir corpus.tok
    cd corpus.tok

  • Tokenizing sentences in Japanese
  • for file in train dev test; do
      cat ../corpus.org/${file}.ja.txt | \
        perl -Mencoding=utf8 -pe 's/ / /g;' | \
        juman -b | \
        perl -ne 'chomp; if($_ eq "EOS"){print join(" ",@b),"\n"; @b=();} else {@a=split/ /; push @b, $a[0];}' | \
        perl -pe 's/^ +//; s/ +$//; s/ +/ /g;' | \
        perl -Mencoding=utf8 -pe 'tr/\|[]/|[]/; ' \
        > ${file}.ja
    done

  • Tokenizing sentences in Chinese
  • for name in dev test train; do
      ${path}/stanford-segmenter-2014-01-04/segment.sh ctb ../corpus.org/${name}.zh.txt UTF-8 0 | \
        perl -Mencoding=utf8 -pe 'tr/\|[]/|[]/; ' \
        > ${name}.zh
    done

  • Cleaning training data for translation models
  • perl ${MOSES_SCRIPT}/training/clean-corpus-n.perl train ja zh train-clean 1 40

    cd ..

    Back to top

    For String-to-Tree and Tree-to-String SMT

    mkdir corpus.tree
    cd corpus.tree

  • Parsing Chinese sentences
  • for file in train-clean dev test; do
      perl ${MOSES_SCRIPT}/training/wrappers/parse-de-berkeley.perl \
        -binarize \
        -ja ${path}/BerkeleyParser-1.7/BerkeleyParser-1.7.jar \
        -gr ${path}/BerkeleyParser-1.7/chn_sm5.gr \
        < ../corpus.tok/${file}.zh > ${file}.parsed.zh
    done

  • Cleaning training data for translation models
  • ln -s ../corpus.tok/train-clean.ja train-clean.parsed.ja
    perl ${MOSES_SCRIPT}/training/clean-corpus-n.perl \
      train-clean.parsed ja zh \
      train-clean 1 40 \
      --ignore-xml

    ln -s dev.parsed.zh dev.zh
    ln -s test.parsed.zh test.zh

    cd ..

    Back to top

    JST (Japan Science and Technology Agency)
    NICT (National Institute of Information and Communications Technology)
    Kyoto University
    Last Modified: 2014-07-07