This page is for reference, and the original (in Japanese) is here.

 

Timely Disclosure Documents Corpus

This page describes the notes of the corpus.

'Timely Disclosure Documents Corpus' was constructed by Japan Exchange Group (JPX) and provided for WAT to encourage developments of machine translation. 

 

Timely Disclosure Documents Corpus1. Notes of Sources1.1. Unbalanced information1.1.1. English translation of nouns and pronouns1.1.2. Omission of figures1.2. Improper characters1.3. Improper alignment procedures2. Specification of this corpus2.1. General2.2. Items2.3. Split into texts and items2.4. Statistics3. Data Splitting of TRAIN, DEV, DEVTEST, and TEST4. Evaluation5. Normalization5.1 Replaced characters5.2 Unicode normalization5.3 Deleted characters5.4. Delete spaces5.5 Deleted pairs of sentences6. FAQ[Important] Dataset Update (announced on 2019-06-07)Sentences added to DEV and DEVTEST (announced on 2019-06-12)Additional information pertaining to registrationSpecifications of dataset before update (2018-06-07 or earlier)Related linksChange logAuthor

1. Notes of Sources

 

ItemDescription
Language pairJapanese - English
Source documentsTimely Disclosure Documents (about 16,000 documents)
Author of Source documentsCompanies listed on Tokyo Stock Exchange
Disclosure date of Source documentsJanuary 2016 to June 2018
Sort order of sentencesIn no particular order
Sentence AlignmentManual

 

 

 

1.1. Unbalanced information

1.1.1. English translation of nouns and pronouns

 

1.1.2. Omission of figures

 

1.2. Improper characters

 

1.3. Improper alignment procedures

 

 

2. Specification of this corpus

 

2.1. General

ItemDescription
File formatTSV
Character codeUTF-8
Newline codeCRLF
DelimiterTab (U+0009)
Quote characterNone
Escapechar characterBackslash (U+005C)
Prohibited charactersTab (U+0009), Newline code(U+000D, U+000A)

 

2.2. Items

Col numberNameData typeRequired
1Document hashStringTRUE
2Sentence hashStringTRUE
3Japanese sentencesStringTRUE
4English sentencesString 

 

 

2.3. Split into texts and items

 

2.4. Statistics

Data TypeFile NameNumber of sentencesNumber of unique pairsNumber of source documents
TRAIN_2016-2017train_2016-2017.tsv1,089,346614,81712,663
TRAIN_2018train_2018.tsv314,649218,4953,128
DEV_ITEMSdev_items.tsv2,8452,650242
DEV_TEXTSdev_texts.tsv1,1531,148210
DEVTEST_ITEMSdevtest_items.tsv2,9002,671244
DEVTEST_TEXTSdevtest_texts.tsv1,1141,111209
TEST_ITEMStest_items.tsv2,1291,763164
TEST_TEXTStest_texts.tsv1,1531,135144

 

3. Data Splitting of TRAIN, DEV, DEVTEST, and TEST

 

4. Evaluation

(TBD)

 

5. Normalization

We normalized the sentences in this corpus as follows:

 

5.1 Replaced characters

Code (Before)Code (After)Symbol (Before)Name (Before)Symbol (After)Symbol (After)
FF5E301CFULLWIDTH TILDEWAVE DASH
007E301C~TILDEWAVE DASH
02F7301C˷MODIFIER LETTER LOW TILDEWAVE DASH
2053301CSWUNG DASHWAVE DASH
223C301CTILDE OPERATORWAVE DASH
22BF25B3RIGHT TRIANGLEWHITE UP-POINTING TRIANGLE
25B525B3WHITE UP-POINTING SMALL TRIANGLEWHITE UP-POINTING TRIANGLE
25FF25B3LOWER RIGHT TRIANGLEWHITE UP-POINTING TRIANGLE
2B2625C7WHITE MEDIUM DIAMONDWHITE DIAMOND
2B2825C7WHITE MEDIUM LOZENGEWHITE DIAMOND
2B2B25C7WHITE SMALL LOZENGEWHITE DIAMOND
25CA25C7LOZENGEWHITE DIAMOND
266225C7WHITE DIAMOND SUITWHITE DIAMOND

 

5.2 Unicode normalization

 

5.3 Deleted characters

 

CodeSymbokName
2412SYMBOL FOR DEVICE CONTROL TWO
2413SYMBOL FOR DEVICE CONTROL THREE
2414SYMBOL FOR DEVICE CONTROL FOUR
0327 COMBINING CEDILLA
0332 COMBINING LOW LINE
0337 COMBINING SHORT SOLIDUS OVERLAY
05B9 HEBREW POINT HOLAM
FFFC OBJECT REPLACEMENT CHARACTER
FFFDREPLACEMENT CHARACTER
2028 LINE SEPARATOR

 

5.4. Delete spaces

 

5.5 Deleted pairs of sentences

 

 

6. FAQ

 

 

 


 

[Important] Dataset Update (announced on 2019-06-07)

BeforeAfterRemarks
TRAIN (train.tsv)TRAIN_2016-2017 (train_2016-2017.tsv)Range of Disclosure date is independent of the other data sets
 TRAIN_2018 (train_2018.tsv)Range of Disclosure date overlaps with DEV / DEVTEST / TEST
DEV (dev.tsv)DEV_ITEMS (dev_items.tsv)Nouns and phrases extracted from Before dev.tsv after 200 sentences are added
 DEV_TEXTS (dev_texts.tsv)Texts extracted from Before dev.tsv after 200 sentences are added
TEST (test.tsv)DEVTEST_ITEMS (devtest_items.tsv)Nouns and phrases extracted from Before test.tsv after 200 sentences are added
 DEVTEST_TEXTS (devtest_texts.tsv)Texts extracted from Before test.tsv after 200 sentences are added
 TEST_ITEMS (test_items.tsv)Newly created
 TEST_TEXTS (test_texts.tsv)Newly created

 

figure1

 

Sentences added to DEV and DEVTEST (announced on 2019-06-12)

 

Additional information pertaining to registration

 

Specifications of dataset before update (2018-06-07 or earlier)

 

Col numberNameData typeRequired
1Japanese sentencesStringTRUE
2English sentencesStringTRUE

 

Data TypeFile NameNumber of sentencesNumber of unique pairsNumber of original documents
TRAINtrain.tsv1,403,995762,09515,791
DEVdev.tsv3,8933,671250
TESTtest.tsv3,8773,620251

 

 

Related links

 

Change log

2019-06-21: Update Statistics of the numbers of Source documents, Append Related links, etc

2019-06-14: Update Statistics of TEST

2019-06-12: Update methods of split into texts and items, Statistics of DEV and DEVTEST, etc

2019-06-10: Update Statistics of TRAIN, etc

2019-06-07: Update dataset, Append FAQ, etc

2019-05-11: Open

 

Author

Japan Exchange Group, Inc.