NICT_LOGO.JPG KYOTO-U_LOGO.JPG

Timely Disclosure Documents Corpus

[HOME]

INTRODUCTION

Tokyo Stock Exchange is one of the largest capital markets in the world where over 3,600 companies are listed as of 2018 year end. Companies are obliged to disclose material information including financial statements, corporate actions or corporate governance policies to the public in a timely manner. Those 'timely disclosure documents' are important basis of investment decision.

Global investors have invested in Japanese companies and now consist of 30-40% shareholding. Although Japanese original documents are disclosed tens of thousands a year (77,000 documents in 2018), English disclosure documents are still limited in availability. There will be strong demand for machine translation from both listed companies and global investors because Japanese-English translation needs to be done in timely manner.

'Timely Disclosure Documents Corpus' was constructed by Japan Exchange Group (JPX) and provided for WAT to encourage developments of machine translation. The corpus, made from past timely disclosure documents, consists of 1.4M parallel sentences of Japanese and English.

Timely disclosure documents contain important figures (e.g. sales, profits, dates) and proper nouns (e.g. name of the person, place, company, business and product). These are critical information for investors so mistranslations should be avoided and overall translation quality should be improved.

You can see the original 'Timely Disclosure Documents' below:

SAMPLES

The samples of this corpus are as follows:

Japanese English
株式会社日本取引所グループ Japan Exchange Group, Inc.
業績予想及び配当予想の修正に関するお知らせ Notice of Revision to Earnings Forecast and Dividend Forecast
当社は、2017年10月30日に開示しました2018年3月期(2017年4月1日〜2018年3月31日)の通期連結業績予想及び期末の1株当たり配当予想について、下記のとおり修正することとしましたので、お知らせいたします。 We hereby announce that the consolidated earnings forecast and year-end dividend forecast for the fiscal year ending March 31, 2018 released on October 30, 2017 have been revised as follows.
剰余金の配当に関するお知らせ Notice of Dividend from Surplus
これにより、2018年3月期の期末の1株当たり配当金は、普通配当33円に加え、記念配当10円を合わせた43円となります。 As a result, the year-end dividend per share for the fiscal year ended March 31, 2018 will be ¥43 (ordinary dividend of ¥33 plus commemorative dividend of ¥10).
投資活動によるキャッシュ・フローは、無形資産の取得による支出105億37百万円等により、261億64百万円の支出となりました。 There was cash outflow of ¥26,164 million from investment activities due mainly to ¥10,537 million in purchase of intangible assets.
発行済株式数に占める当社保有株式の比率 Shareholding ratio of JPX
SGXが保有する自己株式(515,063株)を含む。 Including the shares held by SGX as treasury stock (515,063 shares).

If you need more samples, you can obtain them from here.

DETAILS

Timely Disclosure Documents Corpus includes:

The numbers of sentences are as follows:

Data Type File Name Number of sentences Number of unique pairs Number of original documents
TRAIN_2016-2017 train_2016-2017.tsv 1,089,346 614,817 12,663
TRAIN_2018 train_2018.tsv 314,649 218,495 3,128
DEV_ITEMS dev_items.tsv 2,845 2,650 242
DEV_TEXTS dev_texts.tsv 1,153 1,148 210
DEVTEST_ITEMS devtest_items.tsv 2,900 2,671 244
DEVTEST_TEXTS devtest_texts.tsv 1,114 1,111 209
TEST_ITEMS test_items.tsv 2,129 1,763 164
TEST_TEXTS test_texts.tsv 1,153 1,135 144

Datasets of DEV and TEST contain sentences that focus on the translation quality of proper nouns and figures.


Further details are as follows:

Language pair Japanese - English
Source documents Timely Disclosure Documents (16,292 documents)
Author of Source documents Companies listed on Tokyo Stock Exchange
Disclosure date of Source documents January 2016 to June 2018
Sort order of sentences In no particular order
Sentence Alignment Manual

NOTE: This section aggregates important points in using this corpus.

HOW TO OBTAIN

Back to top

AGREEMENT

Back to top

CONTACT

For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Back to top

CHANGE LOG

2019-06-21: update DETAILS
2019-05-11: site open


NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2019-05-11