We will evaluate Japanese <--> Russian translation as an example of an extremely low-resource non-English centric pair. WAT provides a manually aligned, cleaned and filtered Japanese <--> Russian, Japanese <--> English and English <--> Russian train, development and test corpus to study extremely low resource situations for distant language pairs.
The training data in this repository is from Global Voices domain whereas the development and test sets are from the News Commentary Domain. We encourage the use of large out-of-domain corpora from: KFTT, JESC, TED, ASPEC, UN and Yandex. We also allow the use of parallel and monolingual corpora for Japanese-English and Russian-English in WMT 2020 except for the News Commentary Corpora. We provide a filtered version of the Russian-English News Commentary Corpus (v15) here after removing the development and test set sentences. We encourage participants to explore multilingual, domain-adaptation based solutions that also incorporate monolingual pre-training and back-translation. Before using any corpora other than those listed above or if you notice something strange then kindly contact the organizers.
For questions, comments, etc. please email to "prajdabre -at- gmail -dot- com" while keeping in cc "wat-organizer -at- googlegroups -dot- com".