WAT 2021 Newswire Tasks Description
1. Subtasks
1.1 There are two language directions:
l Japanese to English
l English to Japanese
1.2 There are two types of test sets:
l Test set I: A pair of test and reference sentences. The references were automatically extracted from English newswire sentences and not manually checked. There are no context data.
l Test set II: A test set added from WAT 2020. A pair of test and reference sentences and context data that are articles including test sentences. The references were automatically extracted from English newswire sentences and manually selected. Therefore, the quality of the references of test set II is better than that of test set I.
Participants submit the translation results of one or more of the test data.
2. Official Data
2.1 Files
filename |
Contents |
train.txt |
0.2 million sentence pairs |
devtest.txt |
2,000 sentence pairs |
dev.txt |
2,000 sentence pairs |
test.txt |
2,000 sentence pairs |
devc.tsv |
479 sentence pairs |
testc.tsv |
1851 sentence pairs |
context-devc.ja.tsv |
132 articles |
context-devc.en.tsv |
132 articles |
context-testc.ja.tsv |
546 articles |
context-testc.en.tsv |
546 articles |
3. Definition of Data Use
|
Use |
Contents |
|
Japanese to English |
Training |
train.txt, devtest.txt, dev.txt, devc.tsv, context-devc.ja.tsv, and context-dev.en.tsv |
|
Test set I |
To be translated |
Japanese sentences in test.txt |
|
Reference |
English sentences in test.txt |
||
Test set II |
To be translated |
Japanese sentences in testc.tsv |
|
Context |
context-testc.ja.tsv |
||
Reference |
English sentences in testc.tsv |
||
English to Japanese |
Training |
train.txt, devtest.txt, dev.txt, devc.tsv, context-devc.ja.tsv, and context-dev.en.tsv |
|
Test set I |
To be translated |
English sentences in test.txt |
|
Reference |
Japanese sentences in test.txt |
||
Test set II |
To be translated |
English sentences in testc.tsv |
|
Context |
context-testc.en.tsv |
||
Reference |
Japanese sentences in testc.tsv |
4. Data Format
Files with the extension ".txt" are text files and their format is "<Japanese text> ||| <English text>".
Files with the extension ".tsv" are tab separated values and their columns are Sentence ID, Japanese text and corresponding English text for {devc,testc}.tsv and Article ID, Sentence ID, Japanese text and corresponding English text for context-{devc,testc}.{en,ja}.tsv.
One line may contain more than one sentence. Sentences are separated by a " || ".
Some of personal names were anonymized by replacing them with 〇〇〇〇.
5. Policy for External Resource Usage and Declaration
Participants will be allowed to use external data other than the official data for their MT systems, if they make a declaration of the resources they used for each submission.
When participants use external resources to train their MT systems, they may not use JIJI news articles published between January, 2018 to March, 2018 because test set II were selected from JIJI news articles published in this period.
Participants may also not use sentences that are included in test set I when they use JIJI news published before June, 2017 because the test set I were selected from JIJI news articles published before June, 2017 (the specific period is unknown).
CHANGE LOG
2021-1-24: the title was updated
2020-4-11: initial version