JIJI Corpus

[HOME]

INTRODUCTION

JIJI Corpus was constructed by Jiji Press Ltd in collaboration with the National Institute of Information and Communications Technology (NICT). This corpus consists of a Japanese-English news corpus of 200K parallel sentences. These data come from Jiji Press news with various categories including politics, economy, nation, business, markets, sports and so on. The original news were distributed to many of newspaper companies, TV stations or portal sites. Jiji Press aims to introduce machine translation technologies into the daily editorial work in the future.

DETAIL

JIJI Corpus includes:

Japanese-English news corpus

The numbers of sentences are as follows:

Data Type	File Name	Number of sentences
TRAIN	train.txt	200,000
DEV	dev.txt	2,000
DEVTEST	devtest.txt	2,000
TEST	test.txt	2,000

HOW TO OBTAIN

Training Data:
1. Complete and sign the license agreement.
2. Scan and email the signed agreement to Jiji Press (umemoto -at- jiji.co.jp), and also send the original copy of the agreement to the following address:
  
  UMEMOTO, Itsuro
  President’s Office
  JIJI Press LTD.
  5-15-8 Ginza, Chuo-ku,
  Tokyo 104-8178, JAPAN
  
  104-8178
  東京都中央区銀座5-15-8
  時事通信社長室
  梅本逸郎
3. WAT organizers will email to notify the applicant of a link to download this corpus, once the Jiji press Ltd receives the original copy and approves the application. (Please note the Jiji press Ltdwill provide the e-mail address of the applicant to WAT.)

AGREEMENT

form

CONTACT

For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

CHANGE LOG

2018-8-16: agreement forms were updated for WAT2018
2017-6-12: site open

NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2019-04-22