WP7: Case Study Patents

The work will start with the provision of user requirements (WP9) and the preparation of a parallel patent corpus (EPO) to fuel the training of statistical MT (UPC). In parallel UGOT will work on grammars covering the domain and subsequently, together with UPC, apply the hybrid (WP2, WP5) MT on abstracts and claims. Ontotext will provide semantic infrastructure with loaded existing structured data sets (WP4) from the patent domain (IPC, patent ontology, bio-medical and pharmaceutical knowledge bases, e.g. LLD). Based on the use case requirements, Ontotext will build a prototype (D7.1, D7.2) exposing multiple cross-lingual retrieval paradigms and MT of patent sections. The accuracy will be regularly evaluated through both automatic (e.g. BLEU scoring) and human based (e.g. TAUS) means (WP9).

Task List

The work package is split into 9 major tasks as follows:

Task 7.1 User Requirements and Scenarios (Task Lead: UPC)
Task 7.2 Patent corpora (Task Lead: UPC)
Task 7.3 Grammars for the patent domain (Task Lead: UGOT)
Task 7.4 Ontologies and document indexation (Task Lead: Ontotext)
Task 7.5 Prototype (Task Lead: Ontotext)
Task 7.6 SMT and Hybrid MT (Task Lead: UPC)
Task 7.7 Prototype (user interface) (Tas Lead by Ontotext)
Task 7.8 Human evaluation (Task Lead: TBD)
Task 7.9 Patent Case Study: Final Report (Task Lead: UPC)

Month 10-15 plan

Task 7.2 starts in M10 and is due to provide a first set of corpora at the end of M16. Final revision depends on the availability of the EPO data.
Task 7.3 starts in M10 and is due to provide a preliminary report at the end of M16.

Month 16-21 plan

Task 7.1 starts at M15 and is due to provide a preliminar version at the beginning of M17.
Task 7.3 will produce a more complete report by the beginning of M19.
Task 7.4 starts at M16 and is due to provide a description of the type of queries at the end of M16.
Task 7.5 starts at M16 and is due to provide a description of the Prototype architecture at the end of M16.
Task 7.6 starts along with WP5 and will produce a SMT baseline for the Patents prototype.
D7.1 deadline is M21.

User Requirements

Start: 1 Jun 2010

End: 1 Jul 2010

Timezone: Europe/Stockholm

ID:

7.1

Workpackage:

Case Study: Patents

Assignees:

aarne.ranta

Assignees:

meritxell.gonzalez

Status:

Completed

Timeframe:

May 2011 - Oct 2011

The patents case study comprises two basic scenarios: the online patent retrieval and the patent translation. In this prototype we tackle these two scenarios separately, as shown in Figure 1, even though they can be viewed as a unique multilingual patent retrieval paradigm. In the future, we plan to study how to automate the reciprocal inputs between the two processes, i.e. the annotation of translations and the translation of semantically annotated documents.

From a general perspective, two user roles may be defined in this case study: end-users looking for information related to the patents and editors adding new patent documents to a hypothetical repository.

Details are given in D71.

Patent Corpora

Start: 1 Jun 2010 15:17

Timezone: Europe/Vienna

ID:

7.2

Workpackage:

Case Study: Patents

Task leader:

meritxell.gonzalez

Assignees:

cristina.españa

Assignees:

lluis.marquez

Status:

Completed

Timeframe:

Jun 2011 - Oct 2012

Completed on:

30 November, 2012 - 23:00

Determining and gathering of bilingual and monolingual corpora for the patent case study.

SMT system is trained with te MAREC corpus (WP5).
EPO dataset is used for testing pourposes (WP5).
www-EPO dataset will be used to fill the retrieval databases (WP7)

Grammars for the patent domain

Start: 1 Aug 2010

Timezone: Europe/Stockholm

ID:

7.3

Workpackage:

Case Study: Patents

Task leader:

ramona.enache

Assignees:

aarne.ranta

Assignees:

ramona.enache

Status:

Ongoing

Timeframe:

Jan 2011 - Nov 2012

There are two subtasks here:

Grammars for translation of the patent documents.
Grammars for online-translation of CNL queries

Ontologies and Document Indexation

Start: 0

Timezone:

ID:

7.4

Workpackage:

Case Study: Patents

Task leader:

meritxell.gonzalez

Assignees:

borislav.popov

Assignees:

mariana.damova

Dependencies:

Grammars for the patent domain

Status:

Ongoing

Timeframe:

Jun 2011 - Oct 2012

Developing an ontology capturing the structure of patent documents; and indexing the patents documents according to the semantic knowledge.

Patents Retrieval System

Start: 1 Jul 2010

Timezone: Europe/Vienna

ID:

7.5

Workpackage:

Case Study: Patents

Task leader:

lluis.marquez

Assignees:

borislav.popov

Assignees:

milen.chechev

Assignees:

petar

Relevant Deliverables:

Patent Case Study Final Report

Relevant Deliverables:

Patent MT and Retrieval Prototype

Relevant Deliverables:

Patent MT and Retrieval Prototype Beta

Dependencies:

Grammars for the patent domain

Dependencies:

Ontologies and Document Indexation

Dependencies:

Patent Corpora

Status:

Completed

Timeframe:

Jun 2011 - Dec 2012

Contact @UPC: Lluis and Cristina

DEPENDENCIES:

TASK 1, 2, 3 and 4
WP4. Knowledge Engineering
TASK 8 (for final version of prototype)

Participants:

Ontotext,
UGOT,
UPC

Contact point @Ontotext: Borislav Popov

DEADLINES: Beta = M21; Final = M27

Machine Translation Systems

Start: 22 Mar 2010

Timezone: Europe/Vienna

ID:

7.6

Workpackage:

Case Study: Patents

Assignees:

aarne.ranta

Assignees:

cristina.españa

Assignees:

lluis.marquez

Assignees:

meritxell.gonzalez

Assignees:

ramona.enache

Relevant Deliverables:

Patent Case Study Final Report

Relevant Deliverables:

Patent MT and Retrieval Prototype

Relevant Deliverables:

Patent MT and Retrieval Prototype Beta

Dependencies:

Grammars for the patent domain

Status:

Completed

Timeframe:

Jan 2012 - Dec 2012

Completed on:

11 January, 2013 (All day)

Contact @UPC: Lluis and Cristina

DEPENDENCIES:

TASK 2, 3
WP5. A baseline of the WP5 system will be integrated in the prototype.

Patents abstracts and claim are translated using the baseline of the hybrid system.

Protoype (User Interface)

Start: 1 Dec 2010

End: 31 Oct 2011

Timezone: Europe/Vienna

ID:

7.7

Workpackage:

Case Study: Patents

Task leader:

borislav.popov

Assignees:

borislav.popov

Assignees:

cristina.españa

Assignees:

lluis.marquez

Assignees:

meritxell.gonzalez

Assignees:

milen.chechev

Relevant Deliverables:

Patent MT and Retrieval Prototype

Relevant Deliverables:

Patent MT and Retrieval Prototype Beta

Dependencies:

Machine Translation Systems

Dependencies:

Patents Retrieval System

Status:

Completed

Timeframe:

Jun 2011 - Sep 2012

DEPENDENCIES:

TASK 1
TASK 8 (for final version of prototype)

Participants:

Ontotext,
UGOT,
UPC

Contact point @Ontotext: Borislav Popov

DEADLINES: Beta = M21; Final = M27

Evaluations

Start: 1 Jun 2011

End: 30 Jun 2012

Timezone: Europe/Vienna

ID:

7.8

Workpackage:

Case Study: Patents

Assignees:

aarne.ranta

Assignees:

maria.mateva

Assignees:

meritxell.gonzalez

Relevant Deliverables:

Patent Case Study Final Report

Dependencies:

Grammars for the patent domain

Dependencies:

Ontologies and Document Indexation

Dependencies:

Patents Retrieval System

Dependencies:

Protoype (User Interface)

Status:

Planned

DEPENDENCIES:

TASK 5

Note: Deadlines have been delayed 3 months due to the WP delay.

DEADLINE: M31 (to allow for final report)

Subtasks

Preparation starts M19 (at the very latest)
Hiring translators
Producing guidelines for translators
Full evaluation starts at latest M28
- Evaluation will make use of the TAUS criteria

TAUS Evaluation Criteria:

Excellent (4):
- Accurately transfers all info; correct terminology, correct grammar. Understanding not improved by reading the source text.
Good (3):
- Contains minor mistakes; would not need to refer to source text to correct the mistakes.
Medium (2):
- Significant errors in output. Would need to read the source text to correct the errors.
Poor (1):
- Serious errors in output. Would need to read the source text to understand the output. Would probably need to retranslate from scratch.