JIJI Corpus was constructed by Jiji Press Ltd in collaboration with the National Institute of Information and Communications Technology (NICT). This corpus consists of a Japanese-English news corpus of 200K parallel sentences. These data come from Jiji Press news with various categories including politics, economy, nation, business, markets, sports and so on. The original news were distributed to many of newspaper companies, TV stations or portal sites. Jiji Press aims to introduce machine translation technologies into the daily editorial work in the future.
JIJI Corpus includes:
Data Type | File Name | Number of sentences |
---|---|---|
TRAIN | train.txt | 200,000 |
DEV | dev.txt | 2,000 |
DEVTEST | devtest.txt | 2,000 |
TEST | test.txt | 2,000 |
UMEMOTO, Itsuro
President’s Office
JIJI Press LTD.
5-15-8 Ginza, Chuo-ku,
Tokyo 104-8178, JAPAN
104-8178
東京都中央区銀座5-15-8
時事通信社長室
梅本逸郎
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
2018-8-16: agreement forms were updated for WAT2018 2017-6-12: site open
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2019-04-22