WAT 2023 Non-repetitive Translation Task
1. Task Description
Writing style is especially important in news. In order to deliver high quality texts to readers,
English news has many rules. For example, in broadcast news, writers are encouraged to follow the rules
listed below
[1]:
- Use the active voice rather than the passive voice
- Use positive phrases rather than negative phrases
- Avoid redundant phrases
For use in news production, machine translation systems should also be adapted to such rules. This task
focuses specifically on the style of word/phrase repetition. In general, the repetition of
simple words/phrases can create a monotonous or awkward impression, and in this case, it should be
properly avoided [2]. In fact, in Jiji Japanese-English news articles, repetitive words/phrases in a
Japanese news text are often translated with omission or paraphrase. Here are two examples:
Example 1 (omission):
Ja: 開発費を参加国間で分担できるため、開発に比べて費用を安く抑える事が可能となる。
En: It will allow the government to cut spending compared with full domestic
development by
sharing
costs with partner countries.
Example 2 (paraphrase):
Ja: 昨年3月の声明は「戦争・軍事目的の科学研究を行わない」とする過去2回の声明を継承。
En: In the March 2017 statement, the council pledged to follow its two previous
documents
highlighting
its determination not to conduct scientific research for military purposes.
In the first example, "開発費" is simply translated as "costs" instead of "development costs", probably due to its obviousness. The second example shows that the same two words (i.e., "声明") are translated differently in English.
The goal of this task is to produce translations that follow this style. Specifically, participants are required to control a machine translation system so that it does not output the same words/phrases for certain repetitive words/phrases in a source sentence with omission or paraphrase. (The translation direction is Japanese to English only.) In a sense, one could say that this task has the contrastive purpose to studies on improving lexical translation consistency.
It should be noted that, in principle, writers should first consider removing unnecessary words/phrases, and then try paraphrasing [2]. Paraphrasing can rather obscure the meaning and should not be used excessively [3]. For simplicity, we assume that test sentences in this task, which were actually translated with omission or paraphrase in Jiji news articles, are all worth omitting or paraphrasing. In addition, this task relaxes the above priority and assumes that paraphrasing is also equally effective in avoiding repetition.
2. Dataset
We provide development and test sets for this task, which are referred to as the Jiji 2023
development/test set in this description. In both data sets, all Japanese sentences contain some
repetitive words/phrases that are translated into English with omission or paraphrase. (In this task,
combining repetitive words/phrases into a single word/phrase is also considered
as an omission.)
Repetitive words/phrases and their omitted/paraphrased translations are marked with tags
in both data sets. Tagged words/phrases means that they are evaluation targets. Examples are as
follows:
Input:
Src: マイナス金利や長期金利の0%誘導といった現在の政策に加え、「新たな政策も考えていきたい」と言及した。
Gold data for evaluation:
Src: マイナス金利や長期金利の0%誘導といった
現在の<target>政策</target>に加え、
「新たな<target>政策</target>も考えていきたい」と言及した。
Ref: "I want to consider new
<target>steps</target>," in addition
to the existing
<target>measures</target>, such as
guiding 10-year Japanese government bond yields to around zero and imposing a negative interest rate
on part of current account deposits with the central bank.
Input:
Src: 報告書によると、県内外で避難生活を続ける人に加え、避難先で自宅を再建した人や元の住まいに戻った人が増え、状況は多様になった。
Gold data for evaluation:
Src: 報告書によると、
県内外で<target>避難</target>生活を続ける人に加え、
<target>避難</target>先で
自宅を再建した人や元の住まいに戻った人が増え、状況は多様になった。
Ref: According to a report on the survey, their situations have become diverse. || Some people
continued to
<target>live as evacuees</target>, while
an increasing number had new
homes constructed in
<target>the locations they took shelter</target>
or returned to their homes in the county.
Specifically, the following files are contained in the Jiji 2023 development and test sets:
File name | Sentences | Contents |
---|---|---|
nrep-dev.ja | Raw Japanese sentences | |
nrep-dev.en | Raw English sentences | |
tag-nrep-dev.ja | Japanese sentences with tags | |
tag-nrep-dev.en | English sentences with tags | |
nrep-test.ja | Raw Japanese sentences | |
tag-nrep-test.ja | Japanese sentences with tags | |
tag-nrep-test.en | English sentences with tags |
When submitting the system results, paricipants cannot use "tag-nrep-test.ja" and must use
"nrep-test.ja"
instead.
("tag-nrep-test.ja" and "tag-nrep-test.en" are distributed for reference.)
It should be noted
that, unfortunately, the Jiji 2023 development and test sets, where examples are retrieved from Jiji
news
articles, are not
necessarily clean sentence pairs. (Therefore,
the evaluation is done by manual inspection.)
Since this task places value on the style of original news writers, original texts are also used as
development and test sets.
We also provide all the data from the WAT2020
Newswire tasks, which were also constructed from Jiji news articles.
Specifically, these data have been continuously used from the
WAT2020
Newswire tasks to the WAT2022
Newswire tasks.
For simplicity we
refer to these as the Jiji 2020 training, development and test sets.
Participants can use all the Jiji 2020 data sets in the non-repetitive translation task, provided that
the
test set is not used for training and validation.
The Jiji 2020 test set can be used to calculate automatic metrics (e.g., BLEU) to measure system
performance for reference, although this is not directly related to the evaluation.
The main files in the Jiji 2020 data sets are summarized below:
File name | Sentences | Contents |
---|---|---|
train.txt | 200K | Parallel sentences (optionally remove noisy examples if necessary) |
devc.txt | 479 | Parallel sentences (which partially overlap with the Jiji 2023 development data) |
testc.txt | 1851 | Parallel sentences (which partially overlap with the Jiji 2023 test data) |
In
addition, we also
allow participants to optionally use any other external data, such as JparaCrawl v3.0 [4], since
the
amount of the Jiji 2020
training set is limited.
When using external data, be sure to
- Check the license of the data.
- Confirm that it does not contain the Jiji 2023 test data. (We have confirmed that JparaCrawl
v3.0 meets this requirement.)
- Check the "Used Other Resources" box when submitting the results.
- Include an explanation of the data in the paper.
In summary, the dataset in this task is as follows:
- Training data: the Jiji 2020 training data (and optionally external data)
- Development data: the Jiji 2023 development data (and optionally the Jiji 2020 development
data and
external
data)
- Test data: the Jiji 2023 test data
3. Evaluation
Both the adequacy and the repetition of words/phrases in the translations are checked by hand. (This task does not use automatic metrics for the evaluation.) Human annotators will be assigned to this evaluation by the organizers. The model performance is evaluated by the total number of acceptable, i.e., both adequate and non-repetitive, translations. Some Japanese sentences lack contextual information (e.g., zero pronouns), and thus translations for such information are partially excluded from the adequacy evaluation.
As for the repetition check, even if a target expression is translated with omission in the
reference, a
paraphrase is also allowed as long as the meaning is appropriate, and vice versa. As mentioned
above,
this task simplifies the writing
style principal and evaluates omissions and paraphrases equally.
For omissions, only redundant words/phrases must be removed while keeping necessary
words/phrases. For
paraphrases, repetition is basically judged in terms of lexical word stems. Take the sentence
above as
an example:
Src: 昨年3月の声明は「戦争・軍事目的の科学研究を行わない」とする過去2回の声明
を継承。
Ref: In the March 2017 statement, the council pledged to follow its two
previous documents highlighting
its
determination not to conduct scientific research for military purposes.
Sys 1: In the March 2017 statement, the council pledged to follow its
two
previous statements
highlighting its
determination not to conduct scientific research for military purposes.
Sys 2: In the March 2017 declaration, the council pledged to follow its
two
previous documents
highlighting its
determination not to conduct scientific research for military purposes.
In this case, lexical choices in English for "声明" are checked. While System 1 repeats the same words, System 2 successfully selects different words. Note that systems do not have to use the same expressions as the references. Alternative expressions (e.g., "declaration ", etc.) can be used as long as their meaning is appropriate. (The tagged test sentences are only for reference to show how news writers translated repetitive words/phrases.) In this task, conversions between passive and active (e.g., "attack" and "be attacked") and parts of speech (e.g., "problematic" and "problem") are not considered as paraphrases. Conversions to idioms (e.g., "visit" and "pay a visit") are considered as paraphrases.
We also plan to compute the number of successful omissions and paraphrases in each system, as a reference.
4. Application
See the WAT2023 application page.
5. How to Obtain
- The Jiji 2023 and 2020 Data:
1. Complete and sign the license agreement. (See 6. Agreement)
FUKUYAMA, Toru
3. WAT organizers will email to notify the applicant of a link to download this corpus,
once the Jiji press Ltd receives
the original copy and approves the application. (Please note the Jiji press Ltd will
provide the e-mail address of the
applicant to WAT.)
2. Scan and email the signed agreement to Jiji Press (fukuyama -at- jiji.co.jp),
and also send the original copy of the agreement to the following address:
President’s Office
JIJI Press LTD.
5-15-8 Ginza, Chuo-ku,
Tokyo 104-8178, JAPAN
104-8178
東京都中央区銀座5-15-8
時事通信社長室
福山亨
6. Agreement
7. Instructions for the use of JiJi Corpus
Personal information must be anonymized if such expressions are used as
examples in papers or presentations.
8. Schedule
- The dataset and evaluation description in this document will be detailed by May 9,
2023
- Dataset distribution start date: May 9, 2023
- Shared Task Submission Deadline: June 16, 2023 July 18, 2023 July 7, 2023
- System Description Paper for Shared Tasks Submission Deadline: June 30, 2023 August 1, 2023 July 14, 2023
- Review Feedback of System Description Papers: July 28, 2023 August 29, 2023 July 28, 2023
- Camera-ready Deadline: August 4, 2023 September 5, 2023 August 4, 2023
- Workshop Dates: Octorber 17, 2023 September 4, 2023
9. Contact
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
10. Acknowledgements
These research results were obtained from the commissioned research (No. 225) by National
Institute of
Information and Communications Technology (NICT), Japan.
11. Reference
[1] Robert A. Papper, Broadcast News and Writing Stylebook, seventh edition, Routledge,
2020.
[2] Strunk & White, The Elements of Style Summary: Writing Tips from the Most Influential Guide
to
Writing, https://effectiviology.com/writing-tips-from-the-elements-of-style/#Avoid_repetition.
[3] Rene J. Cappon, The Associated Press Guide to News Writing, fourth edition, Peterson’s,
2019.
[4] Morishita et al., JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus, in Proc.
of LREC,
2022.
CHANGE LOG
2023-4-25: initial version
2023-5-9: update 2. Dataset and
3.
Evaluation
2023-5-16: update 2. Dataset and 3.
Evaluation
2023-5-19: update 2. Dataset, 3. Evaluation, 8.
Schedule
and 11.
Acknowledgements
2023-5-25: update 2. Dataset, 5. How to obtain and 8.
Schedule
NHK (Japan Broadcasting Corporation)
Copyright©Jiji Press, Ltd. All rights reserved for the example sentences in this document.