NICT-SAP's Structured Document Translation Task
Recently ,the quality of sentence level machine translation has begun to saturate for a number of language pairs. Translation systems are often used to translate web pages which contain a variety of structured information. Structured pages/documents contain sentences annotated with rich meta information. For example: "This is a <b>sentence</b>." is an example of a sentence in a structured document. Its translation in Spanish should be: "Esta es una <b>frase</b>." where the <b> tag appropriately encloses the translation of the word "sentence". Structured document translation is challenging as the translation system will have to deal with the alignment of the content enclosed in tags. To facilitate research on translation of structured data, in collaboration with SAP and NICT, WAT will evaluate Japanese/Chinese(T/S)/Korean <--> English structured document translation for the IT domain (Software Documentation). (T=Traditional, S=Simplified)
The purpose of this task is to identify solutions for structured document translation for the software documentation domain. Since there is no well established source of training data containing structured information, participants will have to rely on leveraging parallel data containing plain (non structured) sentences, and word aligners. Additionally, leveraging document context may lead to better performance.
For the software documentation domain evaluation, we only use the evaluation data created by SAP.
- Plain sentences along with their structured counterparts along with document level metadata information is provided.
- Basic details about the file formats. (Check repo for more information)
- This is an example of a bilingual structured document.
- This is the extracted set of English sentences (with XML annotated content) which should be translated.
- Participants are supposed to ensure that when these English sentences are translated into Japanese they should look like this
- This is the plain text version of the English sentences without any annotation.
- Test sets will be released 1-2 weeks before the submission deadlines.
We encourage the use of ANY parallel corpora available through WAT (any year), WMT (any year) or OPUS, for training.
For software documentation, corpora from OPUS such as (but not limited to) gnome, kde4 and ubuntu should be most relevant.
The structured document parallel data from Salesforce should be helpful for Chinese and Japanese if you plan to train models directly on structured parallel data.
Before using any corpora other than those listed above (non WAT, WMT or OPUS corpora) kindly ask the organizers.
Generally speaking participants are welcome to submit translations for any language pair.
- Any and all monolingual and parallel corpora may be used, keeping the following in mind:
- Monolingual data taken from common crawl or any other source might contain dumps from SAP websites which are the source of the evaluation data.
- In principle we discourage training on data explicitly taken from SAP websites.
- Note that the Chinese data is available in its traditional as well as simplified format. Although the content is the same, we encourage participants to submit results for both.
- Evaluation: We will use metrics from this paper:
- We will use the XML BLEU metric which evaluates the BLEU score by preserving the separation of the content enclosed in tags and the content not enclosed in tags.
- Additionally we will use plain sentence BLEU, XML structure accuracy and XML matching accuracy.
- Participants may use the scripts in this repo to evaluate their results locally.
- Use the script evaluate.py to get your scores where XML BLEU is the main metric.
- Participants should submit results such that each line contains a translation which may have structured content depending on the line being translated.
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com".
NICT (National Institute of Information and Communications Technology)
Last Modified: 2021-01-04