Patent Corpora

Start: 1 Jun 2010 15:17

Timezone: Europe/Vienna

ID:

7.2

Workpackage:

Case Study: Patents

Task leader:

meritxell.gonzalez

Assignees:

cristina.españa

Assignees:

lluis.marquez

Status:

Completed

Timeframe:

Jun 2011 - Oct 2012

Completed on:

30 November, 2012 - 23:00

Determining and gathering of bilingual and monolingual corpora for the patent case study.

SMT system is trained with te MAREC corpus (WP5).
EPO dataset is used for testing pourposes (WP5).
www-EPO dataset will be used to fill the retrieval databases (WP7)

Comments

Utrecht meeting notes

Submitted by meritxell.gonzalez on 1 October, 2012 - 11:34.

The corpus gathered from the EPO website was translated and added to the retrieval databases. However, the German and French text were annotated using the general pipeline developed for English.

The current ongoing work related to this task is the translation of the corpus from English to French/Germana WITH the annotations in order to preserve them in these two languages.

notes from 2nd review in Barcelona

Submitted by meritxell.gonzalez on 20 March, 2012 - 14:37.

Contact Pluto project to share corpus pre-processing tools

Zürich meeting minutes

Submitted by olga.caprotti on 8 March, 2012 - 14:21.

Alternative corpus of 7705 documents directly from EPO site:

6M lines with claims
3M lines are trilingual
2M documents with claims only in English
66 documents with claims only in German
34 documents with claims only in French

Proposal:

use this for building a new language model
include all these in the retrieval system

Work in Progress:

preparing the data for translation. Currently we have FR2EN.

What links here

No backlinks found.

Demos

Recent News

Recent Publications

Patent Corpora

Comments

Utrecht meeting notes

notes from 2nd review in Barcelona

Zürich meeting minutes

See also

What links here

Wiki index

EVENTS

Current signups for