JPO Patent Corpus for WAT2018-2021

[HOME]

INTRODUCTION

JPO Patent Corpus was constructed by the Japan Patent Office (JPO). This corpus consists of a Chinese-Japanese, Korean-Japanese, and English-Japanese patent description corpus of 1M parallel sentences with four sections, which are Chemistry (Ch), Electricity (El), Mechanical engineering (Me), and Physics (Ph), based on International Patent Classification (IPC).

DETAIL

Datasets for Normal Tasks

These tasks evaluate performance of a translation model for each language pair. Differing from the previous patent tasks at WAT2016-2017 , new test sets are added as follows:

test-N: Union of the following three sets

test-N1: Patent documents from patent families published between 2011 and 2013. These are the same data as the test sets used in the past years' WAT.

test-N2: Patent documents from patent families published between 2016 and 2017.

test-N3: Patent documents published between 2016 and 2017. Target sentences are manually created by translating source sentences.

Corpus statistics:

Language Pair	Data Type	File Name	Size	Sections:Ratios	Published Years	Sentence Alignment
ZH<-->JA	TRAIN	train.{zh,ja}	1,000,000	Ch/El/Me/Ph:25%/25%/25%/25%	2011-2013	Automatic
	DEV	dev.{zh,ja}	2,000
	DEVTEST	devtest.{zh,ja}	2,000
	TEST	test-n1.{zh,ja}	2,000
	TEST	test-n2.{zh,ja}	3,000	Ch/El/Me/Ph:Unknown	2016-2017	Manual
	TEST	test-n3.{zh,ja}	204		2016-2017	Manual
	TEST	test-n.{zh,ja}	5,204		2011-2013, 2016-2017	Automatic/Manual
KO<-->JA	TRAIN	train.{ko,ja}	1,000,000	Ch/El/Me/Ph:25%/25%/25%/25%	2011-2013	Automatic
	DEV	dev.{ko,ja}	2,000
	DEVTEST	devtest.{ko,ja}	2,000
	TEST	test-n1.{ko,ja}	2,000
	TEST	test-n2.{ko,ja}	3,000	Ch/El/Me/Ph:Unknown	2016-2017	Manual
	TEST	test-n3.{ko,ja}	230		2016-2017	Manual
	TEST	test-n.{ko,ja}	5,230		2011-2013, 2016-2017	Automatic/Manual
EN<-->JA	TRAIN	train.{en,ja}	1,000,000	Ch/El/Me/Ph:25%/25%/25%/25%	2011-2013	Automatic
	DEV	dev.{en,ja}	2,000
	DEVTEST	devtest.{en,ja}	2,000
	TEST	test-n1.{en,ja}	2,000
	TEST	test-n2.{en,ja}	3,000	Ch/El/Me/Ph:Unknown	2016-2017	Manual
	TEST	test-n3.{en,ja}	668		2016-2017	Manual
	TEST	test-n.{en,ja}	5,668		2011-2013, 2016-2017	Automatic/Manual

Datasets for Expression Pattern Task

This task evaluates performance of a translation model for each predifined category of expression patterns, which corresponds to title of invention (TIT), abstract (ABS), scope of claim (CLM) or description (DES). Train/dev/devtest sets are the same data as those of the normal C<-->J tasks. Test set of this task consists of sentences each of which is annotated with a corresponding category of expression patterns.

Corpus statistics:

Language Pair	Data Type	File Name	Size	Sections	Published Years
ZH->JA	TEST	test-ep.{zh,ja}	1,151	Ch/El/Me/Ph	2011-2013

HOW TO OBTAIN

WAT2018 has been finished. Please wait for next WAT announcement.

CONTACT

For questions, comments, etc. please email to "wat -at- nlp.ist.i.kyoto-u.ac.jp".

CHANGE LOG

2022-4-27: Training data size shown in corpus statistics was modified.
2018-8-7: "DETAIL" was updated
2018-7-24: "HOW TO OBTAIN" was updated
2018-7-19: updated
2018-6-30: site opened

NICT (National Institute of Information and Communications Technology)
Kyoto University
Last Modified: 2022-3-30