WAT 2016

The 3rd Workshop on Asian Translation

Baseline Systems

Data preparation for the EH and HE subtasks

[BASELINE SYSTEMS TOP] | [SETUP] | [FOR PHRASE-BASED AND HIERARCHICAL PHRASE-BASED SMT] | [FOR STRING-TO-TREE AND TREE-TO-STRING SMT]

Setup

We assume that there are the HINDEN corpus text files (test.en/hi, dev.en/hi, and train.en/hi) in corpus.org/.

MOSES_SCRIPT=${path}/mosesdecoder-RELEASE-2.1.1/scripts SCRIPT_DIR=${path}/script.converter.distribution INDIC_LIBRARY=/path/to/indic/library/cloned/from/bitbucket

For Phrase-based and Hierarchical Phrase-based SMT

mkdir corpus.tok cd corpus.tok
Tokenizing sentences in English
for file in train dev test; do ${MOSES_SCRIPT}/tokenizer/tokenizer.perl -l en < ../corpus.org/${file}.en > ${file}.en done
Tokenizing/Normalizing sentences in Hindi
for name in dev test train; do python ${INDIC_LIBRARY}/src/indicnlp/normalize/indic_normalize.py ../corpus.org/${file}.hi ${file}.normalized.hi hi python ${INDIC_LIBRARY}/src/indicnlp/tokenize/indic_tokenize.py ${file}.normalized.hi ${file}.hi hi done We recommend that the participants also try out unsupervised morphological analysis and transliteration. (Available in the Indic NLP library)
Cleaning training data for translation models
perl ${MOSES_SCRIPT}/training/clean-corpus-n.perl train en hi train-clean 1 40 cd ..

JST (Japan Science and Technology Agency)
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2016-06-11

WAT 2016 The 3rd Workshop on Asian Translation Baseline Systems Data preparation for the EH and HE subtasks

Setup

For Phrase-based and Hierarchical Phrase-based SMT

Tokenizing sentences in English

Tokenizing/Normalizing sentences in Hindi

Cleaning training data for translation models

WAT 2016

The 3rd Workshop on Asian Translation

Baseline Systems

Data preparation for the EH and HE subtasks