The document-level newswire translation subtask uses JIJI Corpus that was constructed by Jiji Press Ltd in collaboration with the National Institute of Information and Communications Technology (NICT) and NHK. This corpus consists of a Japanese-English news corpus of 200K parallel sentences. These data come from Jiji Press news with various categories including politics, economy, nation, business, markets, sports and so on. The original news were distributed to many of newspaper companies, TV stations or portal sites. Jiji Press aims to introduce machine translation technologies into the daily editorial work in the future.
Task description is here.
JIJI Corpus includes:
Data Type | File Name | Number of sentences |
---|---|---|
TRAIN | train.txt | 200,000 |
DEV | dev.txt | 2,000 |
DEVTEST | devtest.txt | 2,000 |
TEST | test.txt | 2,000 |
Data Type | File Name | Quantity |
---|---|---|
DEV | devc.tsv | 479 sentence pairs |
context-devc.en.tsv | 132 articles | |
context-devc.ja.tsv | 132 articles | |
TEST | testc.tsv | 1,851 sentence pairs |
context-testc.en.tsv | 546 articles | |
context-testc.ja.tsv | 546 articles |
UMEMOTO, Itsuro
President’s Office
JIJI Press LTD.
5-15-8 Ginza, Chuo-ku,
Tokyo 104-8178, JAPAN
104-8178
東京都中央区銀座5-15-8
時事通信社長室
梅本逸郎
Expressions including personal information cannot be used as examples in papers or presentations.
Personal information must be anonymized when expressions including personal information are used as examples in papers or presentations.
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
2021-1-24: agreement forms were updated for WAT2021 2019-6-12: corpus and agreement forms were updated for WAT2020 2019-4-22: agreement forms were updated for WAT2019 2018-8-16: agreement forms were updated for WAT2018 2017-6-12: site open
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2021-01-24