The parallel data for Khmer-English tanslation tasks at WAT2020 consist of two corpora, the ALT corpus and ECCC corpus.
The numbers of sentences are as follows:
Data Type | File Name | Number of Sentences |
TRAIN | train.eccc.[km|en] | 104,660 |
train.alt.[km|en] | 18,088 | |
DEV | dev.alt.[km|en] | 1,000 |
TEST | test.alt.[km|en] | 1,018 |
Khmer-English Parallel Data for WAT2020
@article{ding2018nova, title={NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging}, author={Ding, Chenchen and Utiyama, Masao and Sumita, Eiichiro}, journal={ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)}, volume={18}, number={2}, pages={17}, year={2018}, publisher={ACM} }
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
2020-07-17: site open
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2020-07-17