NICT-SAP's Unstrcutured Document Translation Task (IT and Wikinews domains)



In collaboration with SAP and NICT, WAT will evaluate Hindi/Thai/Malay/Indonesian/Vietnamese <--> English translation for two domains: IT domain (Software Documentation) and Wikinews domain (ALT). The purpose is to determine the feasibility of multilingualism, domain adaptation or document level knowledge given very little to none clean parallel corpora for training. This task is the same as in 2020 and 2021, except for the addition of Vietnamese for this year's task.


IT domain and Wikinews are two extremely low-resource domains for Machine Translation, especially when concerning languages such as Hindi, Thai, Malay, Indonesian and Vietnamese. Either, there are clean but extremely small parallel corpora (approx. 18000 lines) for Wikinews or no clean corpora for the IT domain. In low-resource settings, it is often helpful to leverage monolingual or bilingual corpora from multiple languages and domains to boost translation quality. Additionally, given that the evaluation sets for both tasks contain document level splits (as meta data), it should be possible to leverage extended context to improve translation quality. Thus, the purpose of this task is to determine the limits to which translation quality can be pushed in such a setting via a combination of multilingualism, domain adaptation or document level knowledge. The specific details of this task are:


Submission Details

Back to top


For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com".

Back to top

NICT (National Institute of Information and Communications Technology)
Last Modified: 2021-01-04