NICT_LOGO.JPG KYOTO-U_LOGO.JPG

JIJI Corpus

[HOME]

INTRODUCTION

JIJI Corpus was constructed by Jiji Press Ltd in collaboration with the National Institute of Information and Communications Technology (NICT). This corpus consists of a Japanese-English news corpus of 200K parallel sentences. These data come from Jiji Press news with various categories including politics, economy, nation, business, markets, sports and so on. The original news were distributed to many of newspaper companies, TV stations or portal sites. Jiji Press aims to introduce machine translation technologies into the daily editorial work in the future.

DETAIL

JIJI Corpus includes:

The numbers of sentences are as follows:

Data Type File Name Number of sentences
TRAIN train.txt 200,000
DEV dev.txt 2,000
DEVTEST devtest.txt 2,000
TEST test.txt 2,000
Data Type File Name Quantity
DEV devc.tsv 479 sentence pairs
context-devc.en.tsv 132 articles
context-devc.ja.tsv 132 articles
TEST testc.tsv 1,851 sentence pairs
context-testc.en.tsv 546 articles
context-testc.ja.tsv 546 articles

HOW TO OBTAIN

Back to top

AGREEMENT

Back to top

Instructions for the use of JIJI Corpus

Expressions including personal information cannot be used as examples in papers or presentations.

Personal information must be anonymized when expressions including personal information are used as examples in papers or presentations.

Back to top

CONTACT

For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Back to top

CHANGE LOG

2019-6-12: corpus and agreement forms were updated for WAT2020
2019-4-22: agreement forms were updated for WAT2019
2018-8-16: agreement forms were updated for WAT2018
2017-6-12: site open


NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2019-04-22