This page is for reference, and the original (in Japanese) is here.

Timely Disclosure Documents Corpus

This page describes the notes of the corpus.

'Timely Disclosure Documents Corpus' was constructed by Japan Exchange Group (JPX) and provided for WAT to encourage developments of machine translation.

1. Notes of Sources

Item	Description
Language pair	Japanese - English
Source documents	Timely Disclosure Documents (about 16,000 documents)
Author of Source documents	Companies listed on Tokyo Stock Exchange
Disclosure date of Source documents	January 2016 to June 2018
Sort order of sentences	In no particular order
Sentence Alignment	Manual

The corpus consists of 1.4M parallel sentences of Japanese and English made from past timely disclosure documents.
We made this corpus by aligning the sentences manually from past timely disclosure documents (PDF).
- Not all timely disclosure documents are translated in English.
- Not all sentences in timely disclosure documents (PDF) written in this corpus (e.g. page number).
- Not all sentences correspond one to one.
Timely disclosure documents contain important figures (e.g. sales, profits, dates) and proper nouns (e.g. name of the person, place, company, business and product).
These are critical information for investors so mistranslations should be avoided and overall translation quality should be improved.
Japanese timely disclosure documents sometimes contain Chinese proper nouns.
- 国内関連会社で発生する可能性のある損失に備え、引当金を計上したほか、前年同期に中国の深圳証券取引所に上場している関連会社の公募増資に伴う一時利益の計上がありました。
- 中国本土で技術支援を行う杭州财悦科技有限公司を設立

1.1. Unbalanced information

Since this corpus does not hold contexts, it includes pairs of sentences whose Japanese and English sentence information is not equivalent (unbalanced).
However, in this task, accurate translation based on the contexts stated below is not necessary.
- Supplement of proper nouns, supplement of abbreviated numbers, etc.

1.1.1. English translation of nouns and pronouns

Abbreviation of subjects and objects in Japanese
- There are cases where some companies omit the subject and the object in Japanese, but supplement proper nouns in English.
  - (Ja) また、育児・介護支援制度の充実を図り、仕事との両立ができるよう環境整備に取り組んでいます。
  - (En) JPX has also enriched its childcare and caregiving leave systems to create an environment that allows employees to balance work and family commitments.
Preferential use of personal pronouns in Japanese
- In Japanese timely disclosure documents, some companies frequently use pronouns such as '当社 (The Company)' and '同氏 (The Person)', but in English, those are replaced in proper nouns.
  - (Ja) 発行済株式数に占める当社保有株式の比率
  - (En) Shareholding ratio of JPX

1.1.2. Omission of figures

There are cases where some companies omit dates in Japanese, but supplement them in English.
- Accounting periods
  - (Ja) これにより、2018年3月期の期末の1株当たり配当金は、普通配当33円に加え、記念配当10円を合わせた43円となります。
  - (En) As a result, the year-end dividend per share for the fiscal year ended March 31, 2018 will be ¥43 (ordinary dividend of ¥33 plus commemorative dividend of ¥10).

1.2. Improper characters

The following reasons cause improper characters (e.g. character corruption) in this corpus:
- Incorrect character in PDF
- Misrecognition of letters at OCR
- characters of Private Use Areas
These characters may be replaced by half-width question marks (?).

1.3. Improper alignment procedures

When copying sentences from a timely disclosure documents (PDF), symbols at the beginning and end of sentences may be omitted unintentionally.
- Bullets
- Numbered lists
- Quotation marks
- Phrase point

2. Specification of this corpus

This corpus consists of four data sets: TRAIN, DEV, DEVTEST, and TEST.
DEV / DEVTEST / TEST are split into two (2) sub data sets, X_TEXTS which consists of texts and X_ITEMS which consists of others, respectively.

2.1. General

Item	Description
File format	TSV
Character code	UTF-8
Newline code	CRLF
Delimiter	Tab (U+0009)
Quote character	None
Escapechar character	Backslash (U+005C)
Prohibited characters	Tab (U+0009), Newline code(U+000D, U+000A)

Escapechar character is NOT Yen symbol (¥, U+00A5), but Backslash (\, U+005C).
These might look like same font in some platform.
An example for importing this corpus in python:


import csv
corpus_data = csv.reader(open('train.tsv'), delimiter="\t", quoting=csv.QUOTE_NONE, escapechar="\\")

2.2. Items

Col number	Name	Data type	Required
1	Document hash	String	TRUE
2	Sentence hash	String	TRUE
3	Japanese sentences	String	TRUE
4	English sentences	String

This corpus has no header.
The field "English sentences" in the new TEST is blank.

'Document hash' is a hash of 'Document ID' that is unique identifier of the source document.
'Sentence hash' is a hash of 'Document ID' and 'Sentence ID' that is unique identifier of the sentence in each source document.
The expressions for Document hash and Sentence hash are as follows:


xxxxxxxxxx
document_hash = hash(salt + document_id)
sentence_hash = hash(salt + document_id + sentence_id)

2.3. Split into texts and items

DEV / DEVTEST / TEST are split into two (2) sub data sets, X_TEXTS which consists of texts and X_ITEMS which consists of others, respectively.
Sentences which end with punctuation marks (。, U + 3002) in Japanese are classified as 'texts', and others are classified as 'items'.
- As such, some sentences which should be classified as texts might be included in X_ITEMS due to the removal of punctuation marks caused by the improper alignment procedures described in Section 1.3.
The following is examples of sentences classified as items:
- Front matters (Such as Date, Address, Publisherm, and Contact information)
- Title
- Table item names and elements
- Bullet items

2.4. Statistics

Data Type	File Name	Number of sentences	Number of unique pairs	Number of source documents
TRAIN_2016-2017	train_2016-2017.tsv	1,089,346	614,817	12,663
TRAIN_2018	train_2018.tsv	314,649	218,495	3,128
DEV_ITEMS	dev_items.tsv	2,845	2,650	242
DEV_TEXTS	dev_texts.tsv	1,153	1,148	210
DEVTEST_ITEMS	devtest_items.tsv	2,900	2,671	244
DEVTEST_TEXTS	devtest_texts.tsv	1,114	1,111	209
TEST_ITEMS	test_items.tsv	2,129	1,763	164
TEST_TEXTS	test_texts.tsv	1,153	1,135	144

Range of Disclosure date of Source documents:
- TRAIN_2016-2017: 2016-01-01 - 2017-12-31 (24 months)
- TRAIN_2018: 2018-01-01 - 2018-06-30 (6 months)
- DEV / DEVTEST / TEST: 2018-01-01 - 2018-06-30 (6 months)

The numbers of Source documents of DEV / DEVTEST / TEST before splitting:
- DEV: 251
- DEVTEST: 252
- TEST: 177

3. Data Splitting of TRAIN, DEV, DEVTEST, and TEST

The dataset of TRAIN_2016-2017 was created based on Documents disclosed from January 1, 2016 to December 31, 2017.
The datasets of DEV, DEVTEST, and TEST were created in the following procedures:
- Documents were randomly extracted from timely disclosure documents disclosed from January 1, 2018 to June 30, 2018.
- These documents were randomly divided into two document sets.
- Approximately 3,000 sentences were extracted from each document sets so the sources extracted are not biased.
The dataset of TRAIN_2018 was created based on the Documents that was not targeted for extraction of the above-mentioned extraction.
Therefore, the set of source documents for the datasets of TRAIN, DEV, DEVTEST and TEST are independent of each other.
- However, in each dataset, there are multiple sentences that are the same.
Datasets of DEV, DEVTEST and TEST contain sentences with proper nouns and figures which translation quality is emphasized.

4. Evaluation

(TBD)

5. Normalization

We normalized the sentences in this corpus as follows:

5.1 Replaced characters

As described in Reference files, character substitution of the specified code is performed.
The examples are as follows:

Code (Before)	Code (After)	Symbol (Before)	Name (Before)	Symbol (After)	Symbol (After)
FF5E	301C	～	FULLWIDTH TILDE	〜	WAVE DASH
007E	301C	~	TILDE	〜	WAVE DASH
02F7	301C	˷	MODIFIER LETTER LOW TILDE	〜	WAVE DASH
2053	301C	⁓	SWUNG DASH	〜	WAVE DASH
223C	301C	∼	TILDE OPERATOR	〜	WAVE DASH
22BF	25B3	⊿	RIGHT TRIANGLE	△	WHITE UP-POINTING TRIANGLE
25B5	25B3	▵	WHITE UP-POINTING SMALL TRIANGLE	△	WHITE UP-POINTING TRIANGLE
25FF	25B3	◿	LOWER RIGHT TRIANGLE	△	WHITE UP-POINTING TRIANGLE
2B26	25C7	⬦	WHITE MEDIUM DIAMOND	◇	WHITE DIAMOND
2B28	25C7	⬨	WHITE MEDIUM LOZENGE	◇	WHITE DIAMOND
2B2B	25C7	⬫	WHITE SMALL LOZENGE	◇	WHITE DIAMOND
25CA	25C7	◊	LOZENGE	◇	WHITE DIAMOND
2662	25C7	♢	WHITE DIAMOND SUIT	◇	WHITE DIAMOND

5.2 Unicode normalization

Mainly, NFKC (Normalization Form Compatibility Composition), with the following exception
- Numbers enclosed within a circle (U+2460 - U+2473)
  - In NFKC, they are normalized to integers
- Two dot leaders (U+2025)
- Horizontal ellipsis (U+2026)
(Reference) Unicode, Inc., UAX #15: Unicode Normalization Forms
(Reference) SADAHIRO Tomoyuki, Unicode正規化

5.3 Deleted characters

In addition to Unicode Character Categories are 'Cc', 'Cf', 'Cn', 'Co', the following characters are deleted;

Code	Symbok	Name
2412	␒	SYMBOL FOR DEVICE CONTROL TWO
2413	␓	SYMBOL FOR DEVICE CONTROL THREE
2414	␔	SYMBOL FOR DEVICE CONTROL FOUR
0327		COMBINING CEDILLA
0332		COMBINING LOW LINE
0337		COMBINING SHORT SOLIDUS OVERLAY
05B9		HEBREW POINT HOLAM
FFFC		OBJECT REPLACEMENT CHARACTER
FFFD	�	REPLACEMENT CHARACTER
2028		LINE SEPARATOR

5.4. Delete spaces

Remove extra spaces as below:
- Remove spaces at the beginning and end of sentences
- Replace one or more spaces with one space
- Removed the spaces between 'Hiragana, Katakana, CJK Unified Ideographs, and full-width symbol'
- Removed the spaces between 'Half-width alphanumeric characters' and 'Hiragana, Katakana, CJK Unified Ideographs, and full-width symbol'
- Removed the spaces between 'Hiragana, and Katakana' and 'Half-width of Voiced sound mark (FF9E) and Half-width of Semi-voiced sound mark (FF9F)'

5.5 Deleted pairs of sentences

Sentences that meets the following conditions are deleted;
- English sentence that contains 'Hiragana', 'Katakana', or 'CJK unified ideographs'.
- Japanese sentence that does not contain 'Hiragana', 'Katakana', nor 'CJK unified ideographs'.
- English sentence that does not contains 'Alphabet' (a-z, A-Z)
- Japanese and/or English sentences that contains 'Bengali' or 'Hangul'

6. FAQ

Dose this corpus include sentences of CG Reports (Corporate Governance Reports)?
- The source documents of this corpus include CG Reports.
- Problems regarding Machine Translation in CG reports are described in the following documents.
  - English
    - (Full report) Examination of Machine Translation in Corporate Governance Reports
  - Japanese
    - (Full report) Examination of Machine Translation in Corporate Governance Reports
    - Summary of report

Why some datasets include same sentences? (Why are sentences in the TEST included in other datasets?)
- Since data splitting of TRAIN / DEV / DEVTEST / TEST is divided by document units, they may include the sentences of disclosure documents of the same company at different times.
- This causes some statement duplication in some datasets.
- This task allows this duplication in order to keep the natural distribution of the sentences contained in the timely disclosure documents.

[Important] Dataset Update (announced on 2019-06-07)

Datasets provided before 2018-06-07 will be updated.
The outline of the update is as follows:
- Existing TEST will be renamed to DEVTEST, and TEST will be newly created.
  - The procedure for creating the new TEST will follow the previous procedure.
  - The field "English sentences" in the new TEST will be blank.
- Two (2) new fields, Document hash and Sentence hash, will be added to TRAIN / DEV / DEVTEST / TEST.
  - Document hash: hash of 'Document ID' that is unique identifier of the source document
  - Sentence hash: hash of 'Document ID' and 'Sentence ID' that is unique identifier of the sentence in each source document
- Roughly 200 new sentences will be added to DEV and DEVTEST.
  - The 200 sentences will be chosen from a source document different from those of existing DEV and DEVTEST.
  - It is possible to identify the 200 sentences which will be added as the Document hash of the source documents will be separately provided.
- TRAIN will be split into two (2) sub data sets according to by Disclosure date of Source documents stated below:
  - TRAIN_2016-2017: 2016-01-01 - 2017-12-31 (24 months)
  - TRAIN_2018: 2018-01-01 - 2018-06-30 (6 months)
    - This period overlaps the range of Disclosure date of the DEV / DEVTEST / TEST source documents.
  - There is no other updates on TRAIN than data splitting which is stated above.
- DEV / DEVTEST / TEST will be split into two (2) sub data sets, X_ITEMS which consists of nouns and phrases and X_TEXTS which consists of texts, respectively.
Given updates explained above, the composition of entire data sets will be as follows:

Before	After	Remarks
TRAIN (train.tsv)	TRAIN_2016-2017 (train_2016-2017.tsv)	Range of Disclosure date is independent of the other data sets
	TRAIN_2018 (train_2018.tsv)	Range of Disclosure date overlaps with DEV / DEVTEST / TEST
DEV (dev.tsv)	DEV_ITEMS (dev_items.tsv)	Nouns and phrases extracted from Before dev.tsv after 200 sentences are added
	DEV_TEXTS (dev_texts.tsv)	Texts extracted from Before dev.tsv after 200 sentences are added
TEST (test.tsv)	DEVTEST_ITEMS (devtest_items.tsv)	Nouns and phrases extracted from Before test.tsv after 200 sentences are added
	DEVTEST_TEXTS (devtest_texts.tsv)	Texts extracted from Before test.tsv after 200 sentences are added
	TEST_ITEMS (test_items.tsv)	Newly created
	TEST_TEXTS (test_texts.tsv)	Newly created

Reference diagram of distribution of Source documents (The numbers in parentheses mean Numbers of Source documents.)

Sentences added to DEV and DEVTEST (announced on 2019-06-12)

105 and 138 new sentences are added to DEV and DEVTEST.
These source documents are one (1) document each.
The Document hash of the source documents are as follows:
- DEV: VddC7JWeqCUefIR9XGzVXw==
- TEST: XMzqb0MhNhkDc5a/CZpoSQ==

Additional information pertaining to registration

When registering to the Leader board, please provide NOT the result of DEVTEST but the result of new TEST.
Please describe the TRAIN used in the comment field of the Leader board.
- Comment whether you used only the TRAIN_2016-2017, ir all the data including TRAIN_2018

Specifications of dataset before update (2018-06-07 or earlier)

Items

Col number	Name	Data type	Required
1	Japanese sentences	String	TRUE
2	English sentences	String	TRUE

Statistics

Data Type	File Name	Number of sentences	Number of unique pairs	Number of original documents
TRAIN	train.tsv	1,403,995	762,095	15,791
DEV	dev.tsv	3,893	3,671	250
TEST	test.tsv	3,877	3,620	251

Range of Disclosure date of Source documents:
- TRAIN: 2016-01-01 - 2018-06-30 (30 months)
- DEV / DEVTEST / TEST: 2018-01-01 - 2018-06-30 (6 months)

Change log

2019-06-21: Update Statistics of the numbers of Source documents, Append Related links, etc

2019-06-14: Update Statistics of TEST

2019-06-12: Update methods of split into texts and items, Statistics of DEV and DEVTEST, etc

2019-06-10: Update Statistics of TRAIN, etc

2019-06-07: Update dataset, Append FAQ, etc

2019-05-11: Open

Author

Japan Exchange Group, Inc.