NICT-SAP's Structured Document Translation Task



Recently ,the quality of sentence level machine translation has begun to saturate for a number of language pairs. Translation systems are often used to translate web pages which contain a variety of structured information. Structured pages/documents contain sentences annotated with rich meta information. For example: "This is a <b>sentence</b>." is an example of a sentence in a structured document. Its translation in Spanish should be: "Esta es una <b>frase</b>." where the <b> tag appropriately encloses the translation of the word "sentence". Structured document translation is challenging as the translation system will have to deal with the alignment of the content enclosed in tags. To facilitate research on translation of structured data, in collaboration with SAP and NICT, WAT will evaluate Japanese/Chinese(T/S)/Korean <--> English structured document translation for the IT domain (Software Documentation). (T=Traditional, S=Simplified). Additionally, we will evaluate 6 non-English centric directions between Japanese, Chinese (S) and Korean. (We do not focus on Chinese (T) for the non-English centric evaluation.)


The purpose of this task is to identify solutions for structured document translation for the software documentation domain. Since there is no well established source of training data containing structured information, participants will have to rely on leveraging parallel data containing plain (non structured) sentences, and word aligners. Additionally, leveraging document context may lead to better performance. This year we also encourage participants to leverage large language models (LLMs) such as BLOOM or XGLM for this task.


Submission Details

Back to top


For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com".

Back to top

NICT (National Institute of Information and Communications Technology)
Last Modified: 2021-01-04