The parallel data for Khmer-English tanslation tasks at WAT2020 consist of two corpora, the ALT corpus and ECCC corpus.
The numbers of sentences are as follows:
| Data Type | File Name | Number of Sentences |
|---|---|---|
| TRAIN | train.eccc.[km|en] | 104,660 |
| train.alt.[km|en] | 18,088 | |
| DEV | dev.alt.[km|en] | 1,000 |
| TEST | test.alt.[km|en] | 1,018 |
Khmer-English Parallel Data for WAT2020
@article{ding2018nova,
title={NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging},
author={Ding, Chenchen and Utiyama, Masao and Sumita, Eiichiro},
journal={ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)},
volume={18},
number={2},
pages={17},
year={2018},
publisher={ACM}
}
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
2020-07-17: site open
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2020-07-17