MultiIndicMT: An Indic Language Multilingual Task

[HOME]

INTRODUCTION

This year we expand the WAT 2022 Indic languages task going from 15 to 19 Indic languages.

TASK DESCRIPTION

The task covers 19 Indic Languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sindhi [Arabic script], Sinhala, Tamil, Telugu, Santali, Kashmiri [Arabic as well as Devanagari script], Maithili, Sanskrit and Urdu) and English. We will evaluate the submissions on 38 translation directions (English-Indic and Indic-English). We will also evaluate the performace of the Bengali-Hindi, Tamil-Telugu, Hindi-Malayalam and Sindhi-Punjabi Indic language pairs as well. Individually, Indic languages are resource poor, relative to European languages, which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. The purpose of this task is to validate the utility of MT techniques that focus on multilingualism and/or monolingual data.

Corpora

Evaluation data: Indic subset of FLORES200. Use dev subset for development and devtest subset will be used as the test set.
Parallel corpora sources:
- We encourage the use of the Samanantar v0.3 corpus for the following 11 Indic languages:(Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu).
- We also allow the use of the NLLB dataset which covers more languages.
- Additionally we encourage the use of the Indic-Indic corpora.
- For the languages not covered by Samanantar we suggest obtaining corpora from OPUS. In particular CCMatrix, CCAligned, WikiMatrix, TED, Bible and Paracrawl should be good sources. Anything on OPUS is allowed in principle.
- A larger corpus surpassing NLLB and Samanantar is expected to be released by the middle of May 2023
Monolingual corpora sources:
- AI4Bharat-IndicCorp monolingual corpus collection. (A larger monolingual corpus is expected to be released by mid-May 2023)
- Oscar corpus.
- Multilingual C4 corpus. Follow the instructions here (see JSON format part) to get this.
- There may be overlaps between these sources so it will be up to you do deal with duplicates. We may consider doing this so check back later to see if a joint, deduplicated version is available or not.
Before using any corpora other than those listed above kindly ask the organizers.

Submission Details

We expect participants to work on multilingual solutions spanning all languages for this task. Specifically, we encourage one-to-many, many-to-one or many-to-many models .
Additionally, the use of massively multilingual pre-trained models and back-translation is also encouraged. You may fine-tune models like NLLB, mT5, IndicBART, or anything you consider relevant.
Do ask organizers for models than the aforementioned ones.
As contrastive solutions, participants may develop language pair specific models but only the multilingual model submissions will be considered official.

CONTACT

For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com" or "anoop.kunchukuttan -at- gmail -dot- com".

NICT (National Institute of Information and Communications Technology)
Last Modified: 2023-05-02