The WAT2024 patent task uses JPO Patent Corpus (JPC).
JPO Patent Corpus (JPC) was constructed by the Japan Patent Office (JPO). This corpus consists of a Chinese-Japanese, Korean-Japanese, and English-Japanese patent description corpus of 1M parallel sentences. Most setences in the corpus were from four International Patent Classification (IPC) sections: Cemistry (Ch), Electricity (El), Mechanical engineering (Me), and Physics (Ph).
From 2022, the ko-ja test-N2 set was removed due to a technical problem, new test-N4 sets have been added, and test-2022 sets have been updated instead of previous test-N sets. (The previous patent tasks at WAT2018--2021.)
Corpus statistics:
Language Pair | Data Type | File Name | Size | Sections:Ratios | Published Years | Sentence Alignment |
---|---|---|---|---|---|---|
ZH<-->JA | TRAIN | train.{zh,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{zh,ja} | 2,000 | ||||
DEVTEST | devtest.{zh,ja} | 2,000 | ||||
TEST | test-n1.{zh,ja} | 2,000 | ||||
TEST | test-n2.{zh,ja} | 3,000 | Ch/El/Me/Ph:Unknown | 2016-2017 | Automatic | |
TEST | test-n3.{zh,ja} | 204 | Manual | |||
TEST | test-n4.{zh,ja} | 5,000 | Uncontrolled | 2019-2020 | Automatic | |
TEST | test-2022.{zh,ja} | 10,204 | 2011-2020 | Automatic/Manual | ||
KO<-->JA | TRAIN | train.{ko,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{ko,ja} | 2,000 | ||||
DEVTEST | devtest.{ko,ja} | 2,000 | ||||
TEST | test-n1.{ko,ja} | 2,000 | ||||
TEST | test-n2.{ko,ja} | 0 | N/A | |||
TEST | test-n3.{ko,ja} | 230 | Ch/El/Me/Ph:Unknown | 2016-2017 | Manual | |
TEST | test-n4.{ko,ja} | 5,000 | Uncontrolled | 2019-2020 | Automatic | |
TEST | test-2022.{ko,ja} | 7,230 | 2011-2020 | Automatic/Manual | ||
EN<-->JA | TRAIN | train.{en,ja} | 1,000,000 | Ch/El/Me/Ph:25%/25%/25%/25% | 2011-2013 | Automatic |
DEV | dev.{en,ja} | 2,000 | ||||
DEVTEST | devtest.{en,ja} | 2,000 | ||||
TEST | test-n1.{en,ja} | 2,000 | ||||
TEST | test-n2.{en,ja} | 3,000 | Ch/El/Me/Ph:Unknown | 2016-2017 | Automatic | |
TEST | test-n3.{en,ja} | 668 | Manual | |||
TEST | test-n4.{en,ja} | 5,000 | Uncontrolled | 2019-2020 | Automatic | |
TEST | test-2022.{en,ja} | 10,668 | 2011-2020 | Automatic/Manual |
Translation result submission site (JPC3)
For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
2023-04-23: Updated for WAT2024. 2023-04-20: "HOW TO OBTAIN" was updated. 2022-05-18: "HOW TO OBTAIN" was updated. 2022-04-27: Training data size shown in corpus statistics was modified. 2021-03-30: Site opened.
NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2023-04-20