WAT 2020

Khmer-English Parallel Data

[HOME]
The registration of the use of ECCC data is opened (2020/07/17)

INTRODUCTION

The parallel data for Khmer-English tanslation tasks at WAT2020 consist of two corpora, the ALT corpus and ECCC corpus.

The ALT corpus is one part from the Asian Language Treebank (ALT) Project, consisting of twenty thousand Khmer-English parallel sentences from news articles.
The ECCC corpus is extracted from document pairs of Khmer-English bi-lingual records in Estraordinary Chambers in the Court of Cambodia, collected by National Institute of Posts, Telecoms & ICT, Cambodia.

DETAIL

The numbers of sentences are as follows:

Data Type	File Name	Number of Sentences
TRAIN	train.eccc.[km\|en]	104,660
TRAIN	train.alt.[km\|en]	18,088
DEV	dev.alt.[km\|en]	1,000
TEST	test.alt.[km\|en]	1,018

HOW TO OBTAIN

Khmer-English Parallel Data for WAT2020

Please cite the following paper when using the Khmer ALT corpus.

@article{ding2018nova,
	title={NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging},
	author={Ding, Chenchen and Utiyama, Masao and Sumita, Eiichiro},
	journal={ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)},
	volume={18},
	number={2},
	pages={17},
	year={2018},
	publisher={ACM}
	}

How to get the ECCC corpus
- Please fill and sign the ECCC-corpus-licence under /wat2020.km-en/eccc/ in the zipped file.
- Send the scan of the form to the following e-mail addresses to apply the data.
- E-mail: chenchen.ding -at- nict.go.jp

CONTACT

For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

CHANGE LOG

2020-07-17: site open

NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2020-07-17