Patent Corpora

1 Jun 2010 15:17
Europe/Vienna
ID: 
7.2
Workpackage: 
Case Study: Patents
Task leader: 
meritxell.gonzalez
Assignees: 
cristina.españa
Assignees: 
lluis.marquez
Status: 
Completed
Timeframe: 
Jun 2011 - Oct 2012
Completed on: 
30 November, 2012 - 23:00

Determining and gathering of bilingual and monolingual corpora for the patent case study.

  • SMT system is trained with te MAREC corpus (WP5).
  • EPO dataset is used for testing pourposes (WP5).
  • www-EPO dataset will be used to fill the retrieval databases (WP7)

Comments

Utrecht meeting notes

The corpus gathered from the EPO website was translated and added to the retrieval databases. However, the German and French text were annotated using the general pipeline developed for English.

The current ongoing work related to this task is the translation of the corpus from English to French/Germana WITH the annotations in order to preserve them in these two languages.

notes from 2nd review in Barcelona

Contact Pluto project to share corpus pre-processing tools

Zürich meeting minutes

Alternative corpus of 7705 documents directly from EPO site:

  • 6M lines with claims
  • 3M lines are trilingual
  • 2M documents with claims only in English
  • 66 documents with claims only in German
  • 34 documents with claims only in French

Proposal:

  • use this for building a new language model
  • include all these in the retrieval system

Work in Progress:

  • preparing the data for translation. Currently we have FR2EN.