JPO Patent Corpus was constructed by the Japan Patent Office (JPO). This corpus consists of a Chinese-Japanese, Korean-Japanese, and English-Japanese patent description corpus of 1M parallel sentences with four sections, which are Chemistry (Ch), Electricity (El), Mechanical engineering (Me), and Physics (Ph), based on International Patent Classification (IPC).
These tasks evaluate performance of a translation model for each language pair.
Differing from the previous patent tasks at WAT2016-2017, new test sets are added as follows:
Corpus statistics:
Language Pair | Data Type | File Name | Size | Sections:Ratios | Published Years | Sentence Alignment |
---|---|---|---|---|---|---|
ZH<-->JA | TRAIN | train.{zh,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{zh,ja} | 2,000 | ||||
DEVTEST | devtest.{zh,ja} | 2,000 | ||||
TEST | test-n1.{zh,ja} | 2,000 | ||||
TEST | test-n2.{zh,ja} | 3,000 | Ch/El/Me/Ph:Unknown | 2016-2017 | Manual | |
TEST | test-n3.{zh,ja} | 204 | ||||
TEST | test-n.{zh,ja} | 5,204 | 2011-2013, 2016-2017 | Automatic/Manual | ||
KO<-->JA | TRAIN | train.{ko,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{ko,ja} | 2,000 | ||||
DEVTEST | devtest.{ko,ja} | 2,000 | ||||
TEST | test-n1.{ko,ja} | 2,000 | ||||
TEST | test-n2.{ko,ja} | 3,000 | Ch/El/Me/Ph:Unknown | 2016-2017 | Manual | |
TEST | test-n3.{ko,ja} | 230 | ||||
TEST | test-n.{ko,ja} | 5,230 | 2011-2013, 2016-2017 | Automatic/Manual | ||
EN<-->JA | TRAIN | train.{en,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{en,ja} | 2,000 | ||||
DEVTEST | devtest.{en,ja} | 2,000 | ||||
TEST | test-n1.{en,ja} | 2,000 | ||||
TEST | test-n2.{en,ja} | 3,000 | Ch/El/Me/Ph:Unknown | 2016-2017 | Manual | |
TEST | test-n3.{en,ja} | 668 | ||||
TEST | test-n.{en,ja} | 5,668 | 2011-2013, 2016-2017 | Automatic/Manual |
This task evaluates performance of a translation model for each predifined category of expression patterns, which corresponds to title of invention (TIT), abstract (ABS), scope of claim (CLM) or description (DES). Train/dev/devtest sets are the same data as those of the normal C<-->J tasks. Test set of this task consists of sentences each of which is annotated with a corresponding category of expression patterns.
Corpus statistics:
Language Pair | Data Type | File Name | Size | Sections | Published Years |
---|---|---|---|---|---|
ZH->JA | TEST | test-ep.{zh,ja} | 1,151 | Ch/El/Me/Ph | 2011-2013 |
For questions, comments, etc. please email to "wat -at- nlp.ist.i.kyoto-u.ac.jp".
2022-4-27: Training data size shown in corpus statistics was modified. 2018-8-7: "DETAIL" was updated 2018-7-24: "HOW TO OBTAIN" was updated 2018-7-19: updated 2018-6-30: site opened
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2022-3-30