Indic Languages Multilingual Parallel Corpus
for Indic Languages Multilingual Tasks



This is the Indic Languages Multilingual Parallel Corpus. It covers the following languages: Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese, Urdu and English. The corpus has been collected from OPUS and belongs to the spoken language (OpenSubtitles) domain.


Multilingual Indic Languages Tasks cover 7 Indic Languages (Bengali, Hindi, Malayalam, Tamil, Telugu, Sinhalese and Urdu) and English. There are a total of 7 language directions. The spoken language domain will be the focus and the corpus used for these tasks comes from the OpenSubtitles datasets from OPUS.


Bilingual Corpora

The files for each language pair are present in: bilingual/X-en
For each language pair the names of the files are:

where X is one of bn, hi, ml, ta, te, si or ur.


Language Pair Train Size Dev Size Test Size
Bengali-English 337,428 500 1,000
Hindi-English 84,557 500 1,000
Malayalam-English 359,423 500 1,000
Tamil-English 26,217 500 1,000
Telugu-English 22,165 500 1,000
Urdu-English 26,619 500 1,000
Sinhalese-English 521,726 500 1,000

Monolingual Corpora

The files for each language are present in: monolingual/
For each language the name of the file is: monolingual.X
where X is one of bn, hi, ml, ta, te, si, en or ur.

Language Size
Bengali 453,859
Hindi 104,967
Malayalam 402,761
Tamil 30,268
Telugu 24,750
Urdu 29,086
Sinhalese 705,793
English 2,891,079


The data is free to download. Participants are kindly requested to send an email to WAT informing us that they intend to actually participate in the task.

Indic Language Multilingual Parallel Corpus

Back to top


For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Back to top


2018-07-18: site open

NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2018-07-18