WAT 2023 Non-repetitive Translation Task

 

1.    Task Description

Writing style is especially important in news. In order to deliver high quality texts to readers, English news has many rules. For example, in broadcast news, writers are encouraged to follow the rules listed below [1]:

- Use the active voice rather than the passive voice
- Use positive phrases rather than negative phrases
- Avoid redundant phrases

For use in news production, machine translation systems should also be adapted to such rules. This task focuses specifically on the style of word/phrase repetition. In general, the repetition of simple words/phrases can create a monotonous or awkward impression, and in this case, it should be properly avoided [2]. In fact, in Jiji Japanese-English news articles, repetitive words/phrases in a Japanese news text are often translated with omission or paraphrase. Here are two examples:

Example 1 (omission):
Ja: 開発費を参加国間で分担できるため、開発に比べて費用を安く抑える事が可能となる。
En: It will allow the government to cut spending compared with full domestic development by sharing costs with partner countries.

Example 2 (paraphrase):
Ja: 昨年3月の声明は「戦争・軍事目的の科学研究を行わない」とする過去2回の声明を継承。
En: In the March 2017 statement, the council pledged to follow its two previous documents highlighting its determination not to conduct scientific research for military purposes.

In the first example, "開発費" is simply translated as "costs" instead of "development costs", probably due to its obviousness. The second example shows that the same two words (i.e., "声明") are translated differently in English.

The goal of this task is to produce translations that follow this style. Specifically, participants are required to control a machine translation system so that it does not output the same words/phrases for certain repetitive words/phrases in a source sentence with omission or paraphrase. (The translation direction is Japanese to English only.) In a sense, one could say that this task has the contrastive purpose to studies on improving lexical translation consistency.

It should be noted that, in principle, writers should first consider removing unnecessary words/phrases, and then try paraphrasing [2]. Paraphrasing can rather obscure the meaning and should not be used excessively [3]. For simplicity, we assume that test sentences in this task, which were actually translated with omission or paraphrase in Jiji news articles, are all worth omitting or paraphrasing. In addition, this task relaxes the above priority and assumes that paraphrasing is also equally effective in avoiding repetition.

 

2.    Dataset

We provide development and test sets for this task, which are referred to as the Jiji 2023 development/test set in this description. In both data sets, all Japanese sentences contain some repetitive words/phrases that are translated into English with omission or paraphrase. (In this task, combining repetitive words/phrases into a single word/phrase is also considered as an omission.) Repetitive words/phrases and their omitted/paraphrased translations are marked with tags in both data sets. Tagged words/phrases means that they are evaluation targets. Examples are as follows:

Input:
Src: マイナス金利や長期金利の0%誘導といった現在の政策に加え、「新たな政策も考えていきたい」と言及した。

Gold data for evaluation:
Src: マイナス金利や長期金利の0%誘導といった 現在の<target>政策</target>に加え、 「新たな<target>政策</target>も考えていきたい」と言及した。
Ref: "I want to consider new <target>steps</target>," in addition to the existing <target>measures</target>, such as guiding 10-year Japanese government bond yields to around zero and imposing a negative interest rate on part of current account deposits with the central bank.

Input:
Src: 報告書によると、県内外で避難生活を続ける人に加え、避難先で自宅を再建した人や元の住まいに戻った人が増え、状況は多様になった。

Gold data for evaluation:
Src: 報告書によると、 県内外で<target>避難</target>生活を続ける人に加え、 <target>避難</target>先で 自宅を再建した人や元の住まいに戻った人が増え、状況は多様になった。
Ref: According to a report on the survey, their situations have become diverse. || Some people continued to <target>live as evacuees</target>, while an increasing number had new homes constructed in <target>the locations they took shelter</target> or returned to their homes in the county.

Specifically, the following files are contained in the Jiji 2023 development and test sets:

File name Sentences Contents
nrep-dev.ja 80 70 Raw Japanese sentences
nrep-dev.en 80 70 Raw English sentences
tag-nrep-dev.ja 80 70 Japanese sentences with tags
tag-nrep-dev.en 80 70 English sentences with tags
nrep-test.ja 196 173 Raw Japanese sentences
tag-nrep-test.ja 196 173 Japanese sentences with tags
tag-nrep-test.en 196 173 English sentences with tags


When submitting the system results, paricipants cannot use "tag-nrep-test.ja" and must use "nrep-test.ja" instead. ("tag-nrep-test.ja" and "tag-nrep-test.en" are distributed for reference.) It should be noted that, unfortunately, the Jiji 2023 development and test sets, where examples are retrieved from Jiji news articles, are not necessarily clean sentence pairs. (Therefore, the evaluation is done by manual inspection.) Since this task places value on the style of original news writers, original texts are also used as development and test sets.


We also provide all the data from the WAT2020 Newswire tasks, which were also constructed from Jiji news articles. Specifically, these data have been continuously used from the WAT2020 Newswire tasks to the WAT2022 Newswire tasks. For simplicity we refer to these as the Jiji 2020 training, development and test sets. Participants can use all the Jiji 2020 data sets in the non-repetitive translation task, provided that the test set is not used for training and validation. The Jiji 2020 test set can be used to calculate automatic metrics (e.g., BLEU) to measure system performance for reference, although this is not directly related to the evaluation. The main files in the Jiji 2020 data sets are summarized below:

File name Sentences Contents
train.txt 200K Parallel sentences
(optionally remove noisy examples if necessary)
devc.txt 479 Parallel sentences
(which partially overlap with the Jiji 2023 development data)
testc.txt 1851 Parallel sentences
(which partially overlap with the Jiji 2023 test data)


In addition, we also allow participants to optionally use any other external data, such as JparaCrawl v3.0 [4], since the amount of the Jiji 2020 training set is limited.
When using external data, be sure to

- Check the license of the data.
- Confirm that it does not contain the Jiji 2023 test data. (We have confirmed that JparaCrawl v3.0 meets this requirement.)
- Check the "Used Other Resources" box when submitting the results.
- Include an explanation of the data in the paper.


In summary, the dataset in this task is as follows:

- Training data: the Jiji 2020 training data (and optionally external data)
- Development data: the Jiji 2023 development data (and optionally the Jiji 2020 development data and external data)
- Test data: the Jiji 2023 test data

 

3.    Evaluation

Both the adequacy and the repetition of words/phrases in the translations are checked by hand. (This task does not use automatic metrics for the evaluation.) Human annotators will be assigned to this evaluation by the organizers. The model performance is evaluated by the total number of acceptable, i.e., both adequate and non-repetitive, translations. Some Japanese sentences lack contextual information (e.g., zero pronouns), and thus translations for such information are partially excluded from the adequacy evaluation.

As for the repetition check, even if a target expression is translated with omission in the reference, a paraphrase is also allowed as long as the meaning is appropriate, and vice versa. As mentioned above, this task simplifies the writing style principal and evaluates omissions and paraphrases equally. For omissions, only redundant words/phrases must be removed while keeping necessary words/phrases. For paraphrases, repetition is basically judged in terms of lexical word stems. Take the sentence above as an example:

Src: 昨年3月の声明は「戦争・軍事目的の科学研究を行わない」とする過去2回の声明 を継承。
Ref: In the March 2017 statement, the council pledged to follow its two previous documents highlighting its determination not to conduct scientific research for military purposes.
Sys 1: In the March 2017 statement, the council pledged to follow its two previous statements highlighting its determination not to conduct scientific research for military purposes.
Sys 2: In the March 2017 declaration, the council pledged to follow its two previous documents highlighting its determination not to conduct scientific research for military purposes.

In this case, lexical choices in English for "声明" are checked. While System 1 repeats the same words, System 2 successfully selects different words. Note that systems do not have to use the same expressions as the references. Alternative expressions (e.g., "declaration ", etc.) can be used as long as their meaning is appropriate. (The tagged test sentences are only for reference to show how news writers translated repetitive words/phrases.) In this task, conversions between passive and active (e.g., "attack" and "be attacked") and parts of speech (e.g., "problematic" and "problem") are not considered as paraphrases. Conversions to idioms (e.g., "visit" and "pay a visit") are considered as paraphrases.

We also plan to compute the number of successful omissions and paraphrases in each system, as a reference.

 

4.    Application

See the WAT2023 application page.

 

5.    How to Obtain

- The Jiji 2023 and 2020 Data:

1. Complete and sign the license agreement. (See 6. Agreement)
2. Scan and email the signed agreement to Jiji Press (fukuyama -at- jiji.co.jp), and also send the original copy of the agreement to the following address:

FUKUYAMA, Toru
President’s Office
JIJI Press LTD.
5-15-8 Ginza, Chuo-ku,
Tokyo 104-8178, JAPAN

104-8178
東京都中央区銀座5-15-8
時事通信社長室
福山亨

3. WAT organizers will email to notify the applicant of a link to download this corpus, once the Jiji press Ltd receives the original copy and approves the application. (Please note the Jiji press Ltd will provide the e-mail address of the applicant to WAT.)

 

6.    Agreement

English, Japanese

 

7.    Instructions for the use of JiJi Corpus

Personal information must be anonymized if such expressions are used as examples in papers or presentations.

 

8.    Schedule

- The dataset and evaluation description in this document will be detailed by May 9, 2023
- Dataset distribution start date: May 9, 2023
- Shared Task Submission Deadline: June 16, 2023 July 18, 2023 July 7, 2023
- System Description Paper for Shared Tasks Submission Deadline: June 30, 2023 August 1, 2023 July 14, 2023
- Review Feedback of System Description Papers: July 28, 2023 August 29, 2023 July 28, 2023
- Camera-ready Deadline: August 4, 2023 September 5, 2023 August 4, 2023
- Workshop Dates: Octorber 17, 2023 September 4, 2023

 

9.    Contact

For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

 

10.    Acknowledgements

These research results were obtained from the commissioned research (No. 225) by National Institute of Information and Communications Technology (NICT), Japan.

 

11.    Reference

[1] Robert A. Papper, Broadcast News and Writing Stylebook, seventh edition, Routledge, 2020.
[2] Strunk & White, The Elements of Style Summary: Writing Tips from the Most Influential Guide to Writing, https://effectiviology.com/writing-tips-from-the-elements-of-style/#Avoid_repetition.
[3] Rene J. Cappon, The Associated Press Guide to News Writing, fourth edition, Peterson’s, 2019.
[4] Morishita et al., JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus, in Proc. of LREC, 2022.

 

CHANGE LOG

2023-4-25: initial version
2023-5-9: update 2. Dataset and 3. Evaluation
2023-5-16: update 2. Dataset and 3. Evaluation
2023-5-19: update 2. Dataset, 3. Evaluation, 8. Schedule and 11. Acknowledgements
2023-5-25: update 2. Dataset, 5. How to obtain and 8. Schedule

 

NHK (Japan Broadcasting Corporation)

 

Copyright©Jiji Press, Ltd. All rights reserved for the example sentences in this document.