Parallel corpus compilation in Patents domain
0
ID:
5.1
Workpackage:
Statistical and Robust Translation
Task leader:
cristina.españa
Assignees:
cristina.españa
Relevant Deliverables:
Description of the final collection of corpora
Dependencies:
Patent Corpora
Status:
Ongoing
Timeframe:
Sep 2010 - Dec 2010 A parallel in-domain corpus compilation. It will be built from the set of Patents provided by WP7. It also implies the definition of development and test sets gathered from the corpus.
M16. UPC. (Corresponding to D51, M18)
What links here
No backlinks found.
Comments
Available a first corpus
A part of the MAREC corpus has been obtained from IRF under a personal research license. The data corresponds to the corpus given to participants of the CLEF-IP '10 (http://www.ir-facility.org/research/evaluation/clef-ip-10)
Some details of the ongoing work:
I've been gathering the Patents corpus and maybe it is useful for you to have more examples. I have different versions now:
Here are the claims for Patents in an easy xml format. All of them are classified as A61P.
http://www.lsi.upc.edu/~cristinae/dwnld/patents000000.en
A clean version in raw text format (there are more patents in this file, all of them, but less claims because only those well aligned in 3 languages are considered)
http://www.lsi.upc.edu/~cristinae/dwnld/patsA61P.train.en
And finally I've also calculated a high order language model. It's 2GB, tell me if you want it because I have to think where to host it. But at least I send you a file so that you can see the most frequent 10-grams (with numbers substituted by XX)
http://www.lsi.upc.edu/~cristinae/dwnld/10grams.en.csv
and the same for 5-grams
http://www.lsi.upc.edu/~cristinae/dwnld/5grams.en.csv
All this is for English, but I can generate the same for French and German, and the Xgram you prefer!
Olga, I don't know if Sebastià got the Maths corpus, but if this information in the language model is useful to build the grammars we can see if we can also use it there.