Indic Languages Multilingual Task
[HOME]
INTRODUCTION
Given the growing sizes of monolingual, parallel training data as well as good quality evaluation data for Indic languages we have decided to resume the 2018 Indic Multilingual task.
TASK DESCRIPTION
The task covers 7 Indic Languages (Bengali, Hindi, Malayalam, Tamil, Telugu, Gujarati and Marathi) and English. There are a total of 14 translation directions we will evaluate. Individually, Indic languages are resource poor which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. The purpose of this task is to validate the utility of MT techniques that focus on multilingualism and/or monolingual data.
Corpora
- The evaluation data belongs to the general domain and is sourced from India's Prime Ministers multilingual articles called "CVIT Mann Ki Baat". NOTE: Fixed the English side of Bengali--English pair (1st August 2020). Everything else is the same.
-
We provide filtered training data from the PM India dataset. Kindly use the Mann Ki Baat (evaluation data) and PM India training data provided by us because the original datasets have significant overlap which will lead to incorrect evaluation.
- Next, we encourage the use of the relevant parallel corpora from: CVIT-PIB (UPDATED 28th July, 2020), IITB, Mechanical Turk, JW, NLPC, UFAL EnTam, Tsardia, Wiki Titles, ALT.
-
We also suggest the use of parallel corpora from OPUS such as: bible-uedin, globalvoices, gnome, kde4, opensubtitles, tanzil, tatoeba, ubuntu, wikimedia.
-
With regards to monolingual data, we encourage the use of the Indic NLP monolingual corpus collection.
-
Before using any corpora other than those listed above kindly ask the organizers.
Submission Details
-
We expect participants to work on multilingual solutions spanning all languages for this task. Specifically, we encourage one-to-many, many-to-one or many-to-many models .
-
As contrastive solutions, participants may develop language pair specific models but only the multilingual model submissions will be considered official.
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com". For questions related to this task contact "prajdabre -at- gmail -dot- com" or "anoop.kunchukuttan -at- gmail -dot- com".
NICT (National Institute of Information and Communications Technology)
Last Modified: 2018-07-18