This page is for reference, and the original (in Japanese) is here.
This page describes the notes of the corpus.
'Timely Disclosure Documents Corpus' was constructed by Japan Exchange Group (JPX) and provided for WAT to encourage developments of machine translation.
Timely Disclosure Documents Corpus1. Notes of Sources1.1. Unbalanced information1.1.1. English translation of nouns and pronouns1.1.2. Omission of figures1.2. Improper characters1.3. Improper alignment procedures2. Specification of this corpus2.1. General2.2. Items2.3. Split into texts and items2.4. Statistics3. Data Splitting of TRAIN, DEV, DEVTEST, and TEST4. Evaluation5. Normalization5.1 Replaced characters5.2 Unicode normalization5.3 Deleted characters5.4. Delete spaces5.5 Deleted pairs of sentences6. FAQ[Important] Dataset Update (announced on 2019-06-07)Sentences added to DEV and DEVTEST (announced on 2019-06-12)Additional information pertaining to registrationSpecifications of dataset before update (2018-06-07 or earlier)Related linksChange logAuthor
Item | Description |
---|---|
Language pair | Japanese - English |
Source documents | Timely Disclosure Documents (about 16,000 documents) |
Author of Source documents | Companies listed on Tokyo Stock Exchange |
Disclosure date of Source documents | January 2016 to June 2018 |
Sort order of sentences | In no particular order |
Sentence Alignment | Manual |
The corpus consists of 1.4M parallel sentences of Japanese and English made from past timely disclosure documents.
We made this corpus by aligning the sentences manually from past timely disclosure documents (PDF).
Not all timely disclosure documents are translated in English.
Not all sentences in timely disclosure documents (PDF) written in this corpus (e.g. page number).
Not all sentences correspond one to one.
Timely disclosure documents contain important figures (e.g. sales, profits, dates) and proper nouns (e.g. name of the person, place, company, business and product).
These are critical information for investors so mistranslations should be avoided and overall translation quality should be improved.
Japanese timely disclosure documents sometimes contain Chinese proper nouns.
Since this corpus does not hold contexts, it includes pairs of sentences whose Japanese and English sentence information is not equivalent (unbalanced).
However, in this task, accurate translation based on the contexts stated below is not necessary.
Abbreviation of subjects and objects in Japanese
There are cases where some companies omit the subject and the object in Japanese, but supplement proper nouns in English.
Preferential use of personal pronouns in Japanese
In Japanese timely disclosure documents, some companies frequently use pronouns such as '当社 (The Company)' and '同氏 (The Person)', but in English, those are replaced in proper nouns.
There are cases where some companies omit dates in Japanese, but supplement them in English.
Accounting periods
The following reasons cause improper characters (e.g. character corruption) in this corpus:
These characters may be replaced by half-width question marks (?).
When copying sentences from a timely disclosure documents (PDF), symbols at the beginning and end of sentences may be omitted unintentionally.
Item | Description |
---|---|
File format | TSV |
Character code | UTF-8 |
Newline code | CRLF |
Delimiter | Tab (U+0009) |
Quote character | None |
Escapechar character | Backslash (U+005C) |
Prohibited characters | Tab (U+0009), Newline code(U+000D, U+000A) |
Escapechar character is NOT Yen symbol (¥, U+00A5), but Backslash (\, U+005C).
These might look like same font in some platform.
An example for importing this corpus in python:
import csv
corpus_data = csv.reader(open('train.tsv'), delimiter="\t", quoting=csv.QUOTE_NONE, escapechar="\\")
Col number | Name | Data type | Required |
---|---|---|---|
1 | Document hash | String | TRUE |
2 | Sentence hash | String | TRUE |
3 | Japanese sentences | String | TRUE |
4 | English sentences | String |
xxxxxxxxxx
document_hash = hash(salt + document_id)
sentence_hash = hash(salt + document_id + sentence_id)
DEV / DEVTEST / TEST are split into two (2) sub data sets, X_TEXTS which consists of texts and X_ITEMS which consists of others, respectively.
Sentences which end with punctuation marks (。, U + 3002) in Japanese are classified as 'texts', and others are classified as 'items'.
The following is examples of sentences classified as items:
Data Type | File Name | Number of sentences | Number of unique pairs | Number of source documents |
---|---|---|---|---|
TRAIN_2016-2017 | train_2016-2017.tsv | 1,089,346 | 614,817 | 12,663 |
TRAIN_2018 | train_2018.tsv | 314,649 | 218,495 | 3,128 |
DEV_ITEMS | dev_items.tsv | 2,845 | 2,650 | 242 |
DEV_TEXTS | dev_texts.tsv | 1,153 | 1,148 | 210 |
DEVTEST_ITEMS | devtest_items.tsv | 2,900 | 2,671 | 244 |
DEVTEST_TEXTS | devtest_texts.tsv | 1,114 | 1,111 | 209 |
TEST_ITEMS | test_items.tsv | 2,129 | 1,763 | 164 |
TEST_TEXTS | test_texts.tsv | 1,153 | 1,135 | 144 |
Range of Disclosure date of Source documents:
The numbers of Source documents of DEV / DEVTEST / TEST before splitting:
The dataset of TRAIN_2016-2017 was created based on Documents disclosed from January 1, 2016 to December 31, 2017.
The datasets of DEV, DEVTEST, and TEST were created in the following procedures:
The dataset of TRAIN_2018 was created based on the Documents that was not targeted for extraction of the above-mentioned extraction.
Therefore, the set of source documents for the datasets of TRAIN, DEV, DEVTEST and TEST are independent of each other.
Datasets of DEV, DEVTEST and TEST contain sentences with proper nouns and figures which translation quality is emphasized.
(TBD)
We normalized the sentences in this corpus as follows:
As described in Reference files, character substitution of the specified code is performed.
The examples are as follows:
Code (Before) | Code (After) | Symbol (Before) | Name (Before) | Symbol (After) | Symbol (After) |
---|---|---|---|---|---|
FF5E | 301C | ~ | FULLWIDTH TILDE | 〜 | WAVE DASH |
007E | 301C | ~ | TILDE | 〜 | WAVE DASH |
02F7 | 301C | ˷ | MODIFIER LETTER LOW TILDE | 〜 | WAVE DASH |
2053 | 301C | ⁓ | SWUNG DASH | 〜 | WAVE DASH |
223C | 301C | ∼ | TILDE OPERATOR | 〜 | WAVE DASH |
22BF | 25B3 | ⊿ | RIGHT TRIANGLE | △ | WHITE UP-POINTING TRIANGLE |
25B5 | 25B3 | ▵ | WHITE UP-POINTING SMALL TRIANGLE | △ | WHITE UP-POINTING TRIANGLE |
25FF | 25B3 | ◿ | LOWER RIGHT TRIANGLE | △ | WHITE UP-POINTING TRIANGLE |
2B26 | 25C7 | ⬦ | WHITE MEDIUM DIAMOND | ◇ | WHITE DIAMOND |
2B28 | 25C7 | ⬨ | WHITE MEDIUM LOZENGE | ◇ | WHITE DIAMOND |
2B2B | 25C7 | ⬫ | WHITE SMALL LOZENGE | ◇ | WHITE DIAMOND |
25CA | 25C7 | ◊ | LOZENGE | ◇ | WHITE DIAMOND |
2662 | 25C7 | ♢ | WHITE DIAMOND SUIT | ◇ | WHITE DIAMOND |
Mainly, NFKC (Normalization Form Compatibility Composition), with the following exception
Numbers enclosed within a circle (U+2460 - U+2473)
Two dot leaders (U+2025)
Horizontal ellipsis (U+2026)
(Reference) Unicode, Inc., UAX #15: Unicode Normalization Forms
(Reference) SADAHIRO Tomoyuki, Unicode正規化
Code | Symbok | Name |
---|---|---|
2412 | ␒ | SYMBOL FOR DEVICE CONTROL TWO |
2413 | ␓ | SYMBOL FOR DEVICE CONTROL THREE |
2414 | ␔ | SYMBOL FOR DEVICE CONTROL FOUR |
0327 | COMBINING CEDILLA | |
0332 | COMBINING LOW LINE | |
0337 | COMBINING SHORT SOLIDUS OVERLAY | |
05B9 | HEBREW POINT HOLAM | |
FFFC | OBJECT REPLACEMENT CHARACTER | |
FFFD | � | REPLACEMENT CHARACTER |
2028 | LINE SEPARATOR |
Remove extra spaces as below:
Sentences that meets the following conditions are deleted;
Dose this corpus include sentences of CG Reports (Corporate Governance Reports)?
The source documents of this corpus include CG Reports.
Problems regarding Machine Translation in CG reports are described in the following documents.
Why some datasets include same sentences? (Why are sentences in the TEST included in other datasets?)
Datasets provided before 2018-06-07 will be updated.
The outline of the update is as follows:
Existing TEST will be renamed to DEVTEST, and TEST will be newly created.
Two (2) new fields, Document hash and Sentence hash, will be added to TRAIN / DEV / DEVTEST / TEST.
Roughly 200 new sentences will be added to DEV and DEVTEST.
TRAIN will be split into two (2) sub data sets according to by Disclosure date of Source documents stated below:
TRAIN_2016-2017: 2016-01-01 - 2017-12-31 (24 months)
TRAIN_2018: 2018-01-01 - 2018-06-30 (6 months)
There is no other updates on TRAIN than data splitting which is stated above.
DEV / DEVTEST / TEST will be split into two (2) sub data sets, X_ITEMS which consists of nouns and phrases and X_TEXTS which consists of texts, respectively.
Given updates explained above, the composition of entire data sets will be as follows:
Before | After | Remarks |
---|---|---|
TRAIN (train.tsv) | TRAIN_2016-2017 (train_2016-2017.tsv) | Range of Disclosure date is independent of the other data sets |
TRAIN_2018 (train_2018.tsv) | Range of Disclosure date overlaps with DEV / DEVTEST / TEST | |
DEV (dev.tsv) | DEV_ITEMS (dev_items.tsv) | Nouns and phrases extracted from Before dev.tsv after 200 sentences are added |
DEV_TEXTS (dev_texts.tsv) | Texts extracted from Before dev.tsv after 200 sentences are added | |
TEST (test.tsv) | DEVTEST_ITEMS (devtest_items.tsv) | Nouns and phrases extracted from Before test.tsv after 200 sentences are added |
DEVTEST_TEXTS (devtest_texts.tsv) | Texts extracted from Before test.tsv after 200 sentences are added | |
TEST_ITEMS (test_items.tsv) | Newly created | |
TEST_TEXTS (test_texts.tsv) | Newly created |
105 and 138 new sentences are added to DEV and DEVTEST.
These source documents are one (1) document each.
The Document hash of the source documents are as follows:
When registering to the Leader board, please provide NOT the result of DEVTEST but the result of new TEST.
Please describe the TRAIN used in the comment field of the Leader board.
Col number | Name | Data type | Required |
---|---|---|---|
1 | Japanese sentences | String | TRUE |
2 | English sentences | String | TRUE |
Data Type | File Name | Number of sentences | Number of unique pairs | Number of original documents |
---|---|---|---|---|
TRAIN | train.tsv | 1,403,995 | 762,095 | 15,791 |
DEV | dev.tsv | 3,893 | 3,671 | 250 |
TEST | test.tsv | 3,877 | 3,620 | 251 |
Range of Disclosure date of Source documents:
2019-06-21: Update Statistics of the numbers of Source documents, Append Related links, etc
2019-06-14: Update Statistics of TEST
2019-06-12: Update methods of split into texts and items, Statistics of DEV and DEVTEST, etc
2019-06-10: Update Statistics of TRAIN, etc
2019-06-07: Update dataset, Append FAQ, etc
2019-05-11: Open
Japan Exchange Group, Inc.