NICT-SAP's Structured Document Translation Task



Recently ,the quality of sentence level machine translation has begun to saturate for a number of language pairs. Translation systems are often used to translate web pages which contain a variety of structured information. Structured pages/documents contain sentences annotated with rich meta information. For example: "This is a <b>sentence</b>." is an example of a sentence in a structured document. Its translation in Spanish should be: "Esta es una <b>frase</b>." where the <b> tag appropriately encloses the translation of the word "sentence". Structured document translation is challenging as the translation system will have to deal with the alignment of the content enclosed in tags. To facilitate research on translation of structured data, in collaboration with SAP and NICT, WAT will evaluate Japanese/Chinese(T/S)/Korean <--> English structured document translation for the IT domain (Software Documentation). (T=Traditional, S=Simplified)


The purpose of this task is to identify solutions for structured document translation for the software documentation domain. Since there is no well established source of training data containing structured information, participants will have to rely on leveraging parallel data containing plain (non structured) sentences, and word aligners. Additionally, leveraging document context may lead to better performance.


Submission Details

Back to top


For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com".

Back to top

NICT (National Institute of Information and Communications Technology)
Last Modified: 2021-01-04