JIJI Corpus

[HOME]

INTRODUCTION

JIJI Corpus was constructed by Jiji Press Ltd in collaboration with the National Institute of Information and Communications Technology (NICT). This corpus consists of a Japanese-English news corpus of 200K parallel sentences. These data come from Jiji Press news with various categories including politics, economy, nation, business, markets, sports and so on. The original news were distributed to many of newspaper companies, TV stations or portal sites. Jiji Press aims to introduce machine translation technologies into the daily editorial work in the future.

DETAIL

JIJI Corpus includes:

Japanese-English news corpus

The numbers of sentences are as follows:

Data Type	File Name	Number of sentences
TRAIN	train.txt	200,000
DEV	dev.txt	2,000
DEVTEST	devtest.txt	2,000
TEST	test.txt	2,000

Test set II: New test set added at WAT 2020

Data Type	File Name	Quantity
DEV	devc.tsv	479 sentence pairs
	context-devc.en.tsv	132 articles
	context-devc.ja.tsv	132 articles
TEST	testc.tsv	1,851 sentence pairs
	context-testc.en.tsv	546 articles
	context-testc.ja.tsv	546 articles

HOW TO OBTAIN

Training Data:
1. Complete and sign the license agreement.
2. Scan and email the signed agreement to Jiji Press (umemoto -at- jiji.co.jp), and also send the original copy of the agreement to the following address:
  
  UMEMOTO, Itsuro
  President’s Office
  JIJI Press LTD.
  5-15-8 Ginza, Chuo-ku,
  Tokyo 104-8178, JAPAN
  
  104-8178
  東京都中央区銀座5-15-8
  時事通信社長室
  梅本逸郎
3. WAT organizers will email to notify the applicant of a link to download this corpus, once the Jiji press Ltd receives the original copy and approves the application. (Please note the Jiji press Ltdwill provide the e-mail address of the applicant to WAT.)

AGREEMENT

form

Instructions for the use of JIJI Corpus

Expressions including personal information cannot be used as examples in papers or presentations.

Personal information must be anonymized when expressions including personal information are used as examples in papers or presentations.

CONTACT

For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

CHANGE LOG

2019-6-12: corpus and agreement forms were updated for WAT2020
2019-4-22: agreement forms were updated for WAT2019
2018-8-16: agreement forms were updated for WAT2018
2017-6-12: site open

NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2019-04-22