WAT 2021 Newswire Tasks Description

 

1.    Subtasks

1.1  There are two language directions:

l  Japanese to English

l  English to Japanese

1.2  There are two types of test sets:

l  Test set I: A pair of test and reference sentences. The references were automatically extracted from English newswire sentences and not manually checked. There are no context data.

l  Test set II: A test set added from WAT 2020. A pair of test and reference sentences and context data that are articles including test sentences. The references were automatically extracted from English newswire sentences and manually selected. Therefore, the quality of the references of test set II is better than that of test set I.

Participants submit the translation results of one or more of the test data.

 

2.    Official Data

2.1  Files

filename

Contents

train.txt

0.2 million sentence pairs

devtest.txt

2,000 sentence pairs

dev.txt

2,000 sentence pairs

test.txt

2,000 sentence pairs

devc.tsv

479 sentence pairs

testc.tsv

1851 sentence pairs

context-devc.ja.tsv

132 articles

context-devc.en.tsv

132 articles

context-testc.ja.tsv

546 articles

context-testc.en.tsv

546 articles

 

3.    Definition of Data Use 

 

Use

Contents

Japanese to English

Training

train.txt, devtest.txt, dev.txt, devc.tsv, context-devc.ja.tsv, and context-dev.en.tsv

Test set I

To be translated

Japanese sentences in test.txt

Reference

English sentences in test.txt

Test set II

To be translated

Japanese sentences in testc.tsv

Context

context-testc.ja.tsv

Reference

English sentences in testc.tsv

English to Japanese

Training

train.txt, devtest.txt, dev.txt, devc.tsv, context-devc.ja.tsv, and context-dev.en.tsv

Test set I

To be translated

English sentences in test.txt

Reference

Japanese sentences in test.txt

Test set II

To be translated

English sentences in testc.tsv

Context

context-testc.en.tsv

Reference

Japanese sentences in testc.tsv

 

4.    Data Format

Files with the extension ".txt" are text files and their format is "<Japanese text> ||| <English text>".

Files with the extension ".tsv" are tab separated values and their columns are Sentence ID, Japanese text and corresponding English text for {devc,testc}.tsv and Article ID, Sentence ID, Japanese text and corresponding English text for context-{devc,testc}.{en,ja}.tsv.

One line may contain more than one sentence. Sentences are separated by a " || ".

Some of personal names were anonymized by replacing them with 〇〇〇〇.

 

5.    Policy for External Resource Usage and Declaration

Participants will be allowed to use external data other than the official data for their MT systems, if they make a declaration of the resources they used for each submission.

When participants use external resources to train their MT systems, they may not use JIJI news articles published between January, 2018 to March, 2018 because test set II were selected from JIJI news articles published in this period.

Participants may also not use sentences that are included in test set I when they use JIJI news published before June, 2017 because the test set I were selected from JIJI news articles published before June, 2017 (the specific period is unknown).

 

CHANGE LOG

2021-1-24: the title was updated

2020-4-11: initial version