NICT-SAP's Unstrcutured Document Translation Task (IT and Wikinews domains)
In collaboration with SAP and NICT, WAT will evaluate Hindi/Thai/Malay/Indonesian/Vietnamese <--> English translation for two domains: IT domain (Software Documentation) and Wikinews domain (ALT). The purpose is to determine the feasibility of multilingualism, domain adaptation or document level knowledge given very little to none clean parallel corpora for training. This task is the same as in 2020 and 2021, except for the addition of Vietnamese for this year's task.
IT domain and Wikinews are two extremely low-resource domains for Machine Translation, especially when concerning languages such as Hindi, Thai, Malay, Indonesian and Vietnamese. Either, there are clean but extremely small parallel corpora (approx. 18000 lines) for Wikinews or no clean corpora for the IT domain. In low-resource settings, it is often helpful to leverage monolingual or bilingual corpora from multiple languages and domains to boost translation quality. Additionally, given that the evaluation sets for both tasks contain document level splits (as meta data), it should be possible to leverage extended context to improve translation quality. Thus, the purpose of this task is to determine the limits to which translation quality can be pushed in such a setting via a combination of multilingualism, domain adaptation or document level knowledge. The specific details of this task are:
For the IT domain evaluation, we only use the evaluation data created by SAP. Meta data related to the document IDs is also provided (check the repo).
We provide the Wikinews domain development and test sets for evaluation along with training data from the ALT corpus here. We also provide document IDs as meta data (see README).
We encourage the use of ANY monolingual and parallel corpora available through WAT (any year), WMT (any year) or OPUS, for training.
Some specific parallel corpora that we recommend using are: Samanantar corpus v0.3 (which contains Hindi--English sentences), corpora from OPUS such as (but not limited to) gnome, kde4 and ubuntu (for IT domain).
Before using any corpora other than those listed above (non WAT, WMT or OPUS corpora) kindly ask the organizers.
About Thai Evaluation: We do not provide segmented Thai data and we encourage the use of any segmentation tool of your choice. When submitting Thai language translations, kindly submit unsegmented sentences (no spaces). Only for this language we will evaluate translations using character level BLEU (and possibly chrF).
Generally speaking participants are welcome to submit translations for any language pair belonging to any domain.
However, a solution exploiting either or a combination of multilingualism, domain adaptation or document level knowledge for all language pairs (to and from English) is strongly encouraged given its potential to efficiently and significantly boost translation quality.
- In particular, we encourage submissions that use document level knowledge.
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com".
NICT (National Institute of Information and Communications Technology)
Last Modified: 2021-01-04