Parallel corpus compilation in Patents domain

Start: 0

Timezone:

ID:

5.1

Workpackage:

Statistical and Robust Translation

Task leader:

cristina.españa

Assignees:

cristina.españa

Relevant Deliverables:

Description of the final collection of corpora

Dependencies:

Patent Corpora

Status:

Ongoing

Timeframe:

Sep 2010 - Dec 2010

A parallel in-domain corpus compilation. It will be built from the set of Patents provided by WP7. It also implies the definition of development and test sets gathered from the corpus.

M16. UPC. (Corresponding to D51, M18)

Login to post comments
Calendar

Comments

Available a first corpus

Submitted by cristina.españa on 6 October, 2010 - 11:39.

A part of the MAREC corpus has been obtained from IRF under a personal research license. The data corresponds to the corpus given to participants of the CLEF-IP '10 (http://www.ir-facility.org/research/evaluation/clef-ip-10)

Some details of the ongoing work:

I've been gathering the Patents corpus and maybe it is useful for you to have more examples. I have different versions now:

Here are the claims for Patents in an easy xml format. All of them are classified as A61P.
http://www.lsi.upc.edu/~cristinae/dwnld/patents000000.en

A clean version in raw text format (there are more patents in this file, all of them, but less claims because only those well aligned in 3 languages are considered)
http://www.lsi.upc.edu/~cristinae/dwnld/patsA61P.train.en

And finally I've also calculated a high order language model. It's 2GB, tell me if you want it because I have to think where to host it. But at least I send you a file so that you can see the most frequent 10-grams (with numbers substituted by XX)
http://www.lsi.upc.edu/~cristinae/dwnld/10grams.en.csv

and the same for 5-grams
http://www.lsi.upc.edu/~cristinae/dwnld/5grams.en.csv

All this is for English, but I can generate the same for French and German, and the Xgram you prefer!

Olga, I don't know if Sebastià got the Maths corpus, but if this information in the language model is useful to build the grammars we can see if we can also use it there.

What links here

No backlinks found.

Demos

Recent News

Recent Publications

Parallel corpus compilation in Patents domain

Comments

Available a first corpus

Some details of the ongoing work:

See also

What links here

EVENTS

Current signups for