MultiIndicMT: An Indic Language Multilingual Task
Given the growing sizes of monolingual, parallel training data for Indic languages, we extend the WAT 2020 Indic languages task with additional languages and n-way evaluation corpora.
The task covers 10 Indic Languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu) and English. We will evaluate the submissions on 20 translation directions (English-Indic and Indic-English). We are also exploring the possibility of evaluation between some Indian language pairs as well. We will keep you updated on that. Individually, Indic languages are resource poor which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. The purpose of this task is to validate the utility of MT techniques that focus on multilingualism and/or monolingual data.
- We provide a single repository of training, development and testing data here.
- The training corpus comprises approximately 11 million sentence pairs between English and Indian languages.
The evaluation data (development and test sets) is sourced from the PM India dataset and is 11-way parallel. We have removed these sentences from the training splits we have provided. Furthermore, we have also removed from the training splits those sentences that belong to the WAT 2020 Indic Task's evaluation set (CVIT-MKB). This will ensure that this year's data can be safely (and fairly) used to evaluate last year's development and test sets.
- The training parallel corpora come from:
Wiki Titles (ta, gu),
- Note 1: By using English as a pivot language, you should be able to mine roughly 250,000 sentence pairs on average for most non-English language pairs.
- Note 2: There are cases where on sentence can have multiple translations which might lead to an inflated number of erroneously mined pairs via pivoting. Best stick to pivoting English sentences that have unique translations.
- Note 3: Including non-English parallel corpora may beneficial during training, especially for translation between non-English pairs.
With regards to monolingual data, we encourage the use of the AI4Bharat-IndicCorp monolingual corpus collection.
9th February, 2021: We recently became aware of the CCAligned corpus collection which contains aligned document pairs as well as aligned parallel sentence pairs. Although we cannot guarantee the quality of this data, participants are welcome to try it out.
13th April, 2021: We were requested by some users to consider the use of PMI monolingual data. We have filtered it to eliminate overlaps with the development and test sets for the 2020 and 2021 tasks. Feel free to download the filtered corpus from here and use it for backtranslation etc.
Before using any corpora other than those listed above kindly ask the organizers.
We expect participants to work on multilingual solutions spanning all languages for this task. Specifically, we encourage one-to-many, many-to-one or many-to-many models .
As contrastive solutions, participants may develop language pair specific models but only the multilingual model submissions will be considered official.
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com" or "anoop.kunchukuttan -at- gmail -dot- com".
NICT (National Institute of Information and Communications Technology)
Last Modified: 2021-01-04