There are a lot of ready-to-use parallel corpora for training machine translation systems, however, most of them are in written languages such as web crawl, news-commentary, patents, scientific papers and so on. Even though some of the parallel corpora are in spoken language, they are mostly spoken by only one person (TED talks) or contain a lot of noise (OpenSubtitle). Most of other MT evaluation campaigns adopt the written language, monologue or noisy dialogue parallel corpora for their translation tasks. Traditional ASPEC translation tasks are sentence-level and the translation quality of them seem to be saturated. We think it's high time to move on to document-level evaluation. For the first year, WAT uses BSD Corpus (The Business Scene Dialogue corpus) for the dataset including training, development and test data. Participants of this taks must get a copy of BSD corpus by themselves.
The participants of this task need to translate all the sentences in the test.json file and submit the translations. For the English-to-Japanese translation, all the "en_sentence" need to be translated into Japanese, and vice versa.
All the translated sentences must be contained in one text file with the following conditions:
Please note that you need to make a registration to WAT2020 before submitting your translation results.
Automatic evaluation will be conducted by the automatic evaluation server. Sampled scenarios of the test data will be human-evaluated. The evaluation criteria will be announced later.
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "nakazawa -at- logos -dot- t -dot- u-tokyo -dot- ac -dot- jp".
Last Modified: 2020-08-12