MultiIndicMT: An Indic Language Multilingual Task
Given the growing sizes of monolingual, parallel training data for Indic languages, we extend the WAT 2021 Indic languages task with additional languages and n-way evaluation corpora.
The task covers 15 Indic Languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi, Sinhala, Tamil, Telugu and Urdu) and English. We will evaluate the submissions on 30 translation directions (English-Indic and Indic-English). We will also evaluate the performace of the Bengali-Hindi, Tamil-Telugu, Hindi-Malayalam and Sindhi-Punjabi Indic language pairs as well. Individually, Indic languages are resource poor, relative to European languages, which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. The purpose of this task is to validate the utility of MT techniques that focus on multilingualism and/or monolingual data.
- Evaluation data: Indic subset of FLORES101. Use dev subset for development and devtest subset will be used as the test set.
- Parallel corpora sources:
- We encourage the use of the Samanantar v0.3 corpus for the following 11 Indic languages:(Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu).
- This training corpus comprises approximately 50 million sentence pairs between English and Indian languages.
- Additionally we encourage the use of the Indic-Indic corpora.
- For the languages not covered by Samanantar we suggest obtaining corpora from OPUS. In particular CCMatrix, CCAligned, WikiMatrix, TED, Bible and Paracrawl should be good sources. Anything on OPUS is allowed in principle.
- Monolingual corpora sources:
- AI4Bharat-IndicCorp monolingual corpus collection.
- Oscar corpus.
- Multilingual C4 corpus. Follow the instructions here (see JSON format part) to get this.
- There may be overlaps between these sources so it will be up to you do deal with duplicates. We may consider doing this so check back later to see if a joint, deduplicated version is available or not.
Before using any corpora other than those listed above kindly ask the organizers.
We expect participants to work on multilingual solutions spanning all languages for this task. Specifically, we encourage one-to-many, many-to-one or many-to-many models .
Additionally, the use of massively multilingual pre-trained models and back-translation is also encouraged.
As contrastive solutions, participants may develop language pair specific models but only the multilingual model submissions will be considered official.
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com" or "anoop.kunchukuttan -at- gmail -dot- com".
NICT (National Institute of Information and Communications Technology)
Last Modified: 2022-03-24