NICT_LOGO.JPG KYOTO-U_LOGO.JPG

JPO Patent Corpus

[HOME]

INTRODUCTION

JPO Patent Corpus was constructed by the Japan Patent Office (JPO). This corpus consists of a Chinese-Japanese, Korean-Japanese, and English-Japanese patent description corpus of 1M parallel sentences.

DETAIL

Most setences in the corpus were from four International Patent Classification (IPC) sections: Cemistry (Ch), Electricity (El), Mechanical engineering (Me), and Physics (Ph). Differing from the previous patent tasks at WAT2018-2021, the ko-ja test-N2 set was removed due to a technical problem, new test-N4 sets have been added, and test-2022 sets have been updated instead of previous test-N sets.

Corpus statistics:

Language Pair Data Type File Name Size Sections:Ratios Published Years Sentence Alignment
ZH<-->JA TRAIN train.{zh,ja} 1,000,000 Ch/El/Me/Ph:25%/25%/25%/25% 2011-2013 Automatic
DEV dev.{zh,ja} 2,000
DEVTEST devtest.{zh,ja} 2,000
TEST test-n1.{zh,ja} 2,000
TEST test-n2.{zh,ja} 3,000 Ch/El/Me/Ph:Unknown 2016-2017 Automatic
TEST test-n3.{zh,ja} 204 Manual
TEST test-n4.{zh,ja} 5,000 Uncontrolled 2019-2020 Automatic
TEST test-2022.{zh,ja} 10,204 2011-2020 Automatic/Manual
KO<-->JA TRAIN train.{ko,ja} 1,000,000 Ch/El/Me/Ph:25%/25%/25%/25% 2011-2013 Automatic
DEV dev.{ko,ja} 2,000
DEVTEST devtest.{ko,ja} 2,000
TEST test-n1.{ko,ja} 2,000
TEST test-n2.{ko,ja} 0 N/A
TEST test-n3.{ko,ja} 230 Ch/El/Me/Ph:Unknown 2016-2017 Manual
TEST test-n4.{ko,ja} 5,000 Uncontrolled 2019-2020 Automatic
TEST test-2022.{ko,ja} 7,230 2011-2020 Automatic/Manual
EN<-->JA TRAIN train.{en,ja} 1,000,000 Ch/El/Me/Ph:25%/25%/25%/25% 2011-2013 Automatic
DEV dev.{en,ja} 2,000
DEVTEST devtest.{en,ja} 2,000
TEST test-n1.{en,ja} 2,000
TEST test-n2.{en,ja} 3,000 Ch/El/Me/Ph:Unknown 2016-2017 Automatic
TEST test-n3.{en,ja} 668 Manual
TEST test-n4.{en,ja} 5,000 Uncontrolled 2019-2020 Automatic
TEST test-2022.{en,ja} 10,668 2011-2020 Automatic/Manual

HOW TO OBTAIN

  1. Complete the license agreement (English/Japanese). (Neither stamp nor signature is essential.)
  2. Email the agreement to the Japan Patent Office (PA0630 -at- jpo.go.jp).
  3. WAT organizers/JPO staffs will email to notify the applicant of a link to download this corpus, once the JPO receives the original of the agreement and approves the application.
Back to top

CONTACT

For questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Back to top

CHANGE LOG

2023-04-20: "HOW TO OBTAIN" was updated.
2022-05-18: "HOW TO OBTAIN" was updated.
2022-04-27: Training data size shown in corpus statistics was modified.
2021-03-30: Site opened.


NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2023-04-20