JPO Patent Corpus was constructed by the Japan Patent Office (JPO). This corpus consists of a Chinese-Japanese, Korean-Japanese, and English-Japanese patent description corpus of 1M parallel sentences with four sections, which are Chemistry, Electricity, Mechanical engineering, and Physics, based on International Patent Classification (IPC).
JPO Patent corpus includes:
Parallel Corpus | Data Type | File Name | Section | Number of sentences |
---|---|---|---|---|
CJ | TRAIN | train.txt | Chemistry | 250,000 |
Electricity | 250,000 | |||
Mechanical Engineering | 250,000 | |||
Physics | 250,000 | |||
DEV | dev.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
DEVTEST | devtest.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
TEST | test.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
KJ | TRAIN | train.txt | Chemistry | 250,000 |
Electricity | 250,000 | |||
Mechanical Engineering | 250,000 | |||
Physics | 250,000 | |||
DEV | dev.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
DEVTEST | devtest.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
TEST | test.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
EJ | TRAIN | train.txt | Chemistry | 250,000 |
Electricity | 250,000 | |||
Mechanical Engineering | 250,000 | |||
Physics | 250,000 | |||
DEV | dev.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
DEVTEST | devtest.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 | |||
TEST | test.txt | Chemistry | 500 | |
Electricity | 500 | |||
Mechanical Engineering | 500 | |||
Physics | 500 |
JPO Patent Corpus was constructed from Chinese-Japanese, Korean-Japanese,
and English-Japanese patent description sentence pairs.
The Data include the information assets that the JPO and
the National Institute of Information and Communications Technology (NICT)
create jointly based on the agreement between the JPO and the NICT.
For another patent corpus, there is NTCIR-10 PatentMT Research Purpose Use of Test Collection.
For questions, comments, etc. please email to "wat -at- nlp.ist.i.kyoto-u.ac.jp".
2018-7-26: updated (for WAT2018)
2017-6-12: updated (for WAT2017)
2016-6-13: site open
JST (Japan Science and Technology Agency)
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2022-3-30