This is the Indic Languages Multilingual Parallel Corpus. It covers the following languages: Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English. The corpus has been collected from OPUS and belongs to the spoken language (OpenSubtitles) domain.
Multilingual Indic Languages Tasks cover 7 Indic Languages (Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese and Urdu) and English. There are a total of 7 language directions. The spoken language domain will be the focus and the corpus used for these tasks comes from the OpenSubtitles datasets from OPUS.
The files for each language pair are present in: bilingual/X-en For each language pair the names of the files are:
|Language Pair||Train Size||Dev Size||Test Size|
The files for each language are present in: monolingual/ For each language the name of the file is: monolingual.X where X is one of bn, hi, ml, ta, te, si, en or ur.
The data is free to download. Participants are kindly requested to send an email to WAT informing us that they intend to actually participate in the task.
Indic Language Multilingual Parallel Corpus
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
2018-07-18: site open
NICT (National Institute of Information and Communications Technology)
Last Modified: 2018-07-18