WP7: Case Study Patents

The work will start with the provision of user requirements (WP9) and the preparation of a parallel patent corpus (EPO) to fuel the training of statistical MT (UPC). In parallel UGOT will work on grammars covering the domain and subsequently, together with UPC, apply the hybrid (WP2, WP5) MT on abstracts and claims. Ontotext will provide semantic infrastructure with loaded existing structured data sets (WP4) from the patent domain (IPC, patent ontology, bio-medical and pharmaceutical knowledge bases, e.g. LLD). Based on the use case requirements, Ontotext will build a prototype (D7.1, D7.2) exposing multiple cross-lingual retrieval paradigms and MT of patent sections. The accuracy will be regularly evaluated through both automatic (e.g. BLEU scoring) and human based (e.g. TAUS) means (WP9).

Task List

The work package is split into 9 major tasks as follows:

  • Task 7.1 User Requirements and Scenarios (Task Lead: UPC)
  • Task 7.2 Patent corpora (Task Lead: UPC)
  • Task 7.3 Grammars for the patent domain (Task Lead: UGOT)
  • Task 7.4 Ontologies and document indexation (Task Lead: Ontotext)
  • Task 7.5 Prototype (Task Lead: Ontotext)
  • Task 7.6 SMT and Hybrid MT (Task Lead: UPC)
  • Task 7.7 Prototype (user interface) (Tas Lead by Ontotext)
  • Task 7.8 Human evaluation (Task Lead: TBD)
  • Task 7.9 Patent Case Study: Final Report (Task Lead: UPC)


Month 10-15 plan

  • Task 7.2 starts in M10 and is due to provide a first set of corpora at the end of M16. Final revision depends on the availability of the EPO data.
  • Task 7.3 starts in M10 and is due to provide a preliminary report at the end of M16.

Month 16-21 plan

  • Task 7.1 starts at M15 and is due to provide a preliminar version at the beginning of M17.
  • Task 7.3 will produce a more complete report by the beginning of M19.
  • Task 7.4 starts at M16 and is due to provide a description of the type of queries at the end of M16.
  • Task 7.5 starts at M16 and is due to provide a description of the Prototype architecture at the end of M16.
  • Task 7.6 starts along with WP5 and will produce a SMT baseline for the Patents prototype.
  • D7.1 deadline is M21.

User Requirements

1 Jun 2010
1 Jul 2010
Europe/Stockholm
ID: 
7.1
Workpackage: 
Case Study: Patents
Assignees: 
aarne.ranta
Assignees: 
meritxell.gonzalez
Status: 
Completed
Timeframe: 
May 2011 - Oct 2011

The patents case study comprises two basic scenarios: the online patent retrieval and the patent translation. In this prototype we tackle these two scenarios separately, as shown in Figure 1, even though they can be viewed as a unique multilingual patent retrieval paradigm. In the future, we plan to study how to automate the reciprocal inputs between the two processes, i.e. the annotation of translations and the translation of semantically annotated documents.

From a general perspective, two user roles may be defined in this case study: end-users looking for information related to the patents and editors adding new patent documents to a hypothetical repository.

Details are given in D71.

Patent Corpora

1 Jun 2010 15:17
Europe/Vienna
ID: 
7.2
Workpackage: 
Case Study: Patents
Task leader: 
meritxell.gonzalez
Assignees: 
cristina.españa
Assignees: 
lluis.marquez
Status: 
Completed
Timeframe: 
Jun 2011 - Oct 2012
Completed on: 
30 November, 2012 - 23:00

Determining and gathering of bilingual and monolingual corpora for the patent case study.

  • SMT system is trained with te MAREC corpus (WP5).
  • EPO dataset is used for testing pourposes (WP5).
  • www-EPO dataset will be used to fill the retrieval databases (WP7)

Grammars for the patent domain

1 Aug 2010
Europe/Stockholm
ID: 
7.3
Workpackage: 
Case Study: Patents
Task leader: 
ramona.enache
Assignees: 
aarne.ranta
Assignees: 
ramona.enache
Status: 
Ongoing
Timeframe: 
Jan 2011 - Nov 2012

There are two subtasks here:

  • Grammars for translation of the patent documents.
  • Grammars for online-translation of CNL queries

Ontologies and Document Indexation

0
ID: 
7.4
Workpackage: 
Case Study: Patents
Task leader: 
meritxell.gonzalez
Assignees: 
borislav.popov
Assignees: 
mariana.damova
Status: 
Ongoing
Timeframe: 
Jun 2011 - Oct 2012

Developing an ontology capturing the structure of patent documents; and indexing the patents documents according to the semantic knowledge.

Patents Retrieval System

1 Jul 2010
Europe/Vienna
ID: 
7.5
Workpackage: 
Case Study: Patents
Task leader: 
lluis.marquez
Assignees: 
borislav.popov
Assignees: 
milen.chechev
Assignees: 
petar
Relevant Deliverables: 
Patent Case Study Final Report
Relevant Deliverables: 
Patent MT and Retrieval Prototype
Relevant Deliverables: 
Patent MT and Retrieval Prototype Beta
Dependencies: 
Patent Corpora
Status: 
Completed
Timeframe: 
Jun 2011 - Dec 2012

Contact @UPC: Lluis and Cristina

DEPENDENCIES:

  • TASK 1, 2, 3 and 4
  • WP4. Knowledge Engineering
  • TASK 8 (for final version of prototype)

Participants:

  • Ontotext,
  • UGOT,
  • UPC

Contact point @Ontotext: Borislav Popov

DEADLINES: Beta = M21; Final = M27

Machine Translation Systems

22 Mar 2010
Europe/Vienna
ID: 
7.6
Workpackage: 
Case Study: Patents
Assignees: 
aarne.ranta
Assignees: 
cristina.españa
Assignees: 
lluis.marquez
Assignees: 
meritxell.gonzalez
Assignees: 
ramona.enache
Relevant Deliverables: 
Patent Case Study Final Report
Relevant Deliverables: 
Patent MT and Retrieval Prototype
Relevant Deliverables: 
Patent MT and Retrieval Prototype Beta
Status: 
Completed
Timeframe: 
Jan 2012 - Dec 2012
Completed on: 
11 January, 2013 (All day)

Contact @UPC: Lluis and Cristina

DEPENDENCIES:

  • TASK 2, 3
  • WP5. A baseline of the WP5 system will be integrated in the prototype.

Patents abstracts and claim are translated using the baseline of the hybrid system.

Protoype (User Interface)

1 Dec 2010
31 Oct 2011
Europe/Vienna
ID: 
7.7
Workpackage: 
Case Study: Patents
Task leader: 
borislav.popov
Assignees: 
borislav.popov
Assignees: 
cristina.españa
Assignees: 
lluis.marquez
Assignees: 
meritxell.gonzalez
Assignees: 
milen.chechev
Relevant Deliverables: 
Patent MT and Retrieval Prototype
Relevant Deliverables: 
Patent MT and Retrieval Prototype Beta
Dependencies: 
Machine Translation Systems
Dependencies: 
Patents Retrieval System
Status: 
Completed
Timeframe: 
Jun 2011 - Sep 2012

DEPENDENCIES:

  • TASK 1
  • TASK 8 (for final version of prototype)

Participants:

  • Ontotext,
  • UGOT,
  • UPC

Contact point @Ontotext: Borislav Popov

DEADLINES: Beta = M21; Final = M27

Evaluations

1 Jun 2011
30 Jun 2012
Europe/Vienna
ID: 
7.8
Workpackage: 
Case Study: Patents
Assignees: 
aarne.ranta
Assignees: 
maria.mateva
Assignees: 
meritxell.gonzalez
Relevant Deliverables: 
Patent Case Study Final Report
Status: 
Planned

DEPENDENCIES:

  • TASK 5

Note: Deadlines have been delayed 3 months due to the WP delay.

DEADLINE: M31 (to allow for final report)

Subtasks

  • Preparation starts M19 (at the very latest)
  • Hiring translators
  • Producing guidelines for translators
  • Full evaluation starts at latest M28
    • Evaluation will make use of the TAUS criteria






TAUS Evaluation Criteria:

  • Excellent (4):
    • Accurately transfers all info; correct terminology, correct grammar. Understanding not improved by reading the source text.
  • Good (3):
    • Contains minor mistakes; would not need to refer to source text to correct the mistakes.
  • Medium (2):
    • Significant errors in output. Would need to read the source text to correct the errors.
  • Poor (1):
    • Serious errors in output. Would need to read the source text to understand the output. Would probably need to retranslate from scratch.