Patents

MOLTO's Open Day
Barcelona, 23rd May 2013

Presenter Notes

Goals



Extension of MOLTO's MT approach to deal with non-restricted language


Exploration and development of translation engines especialised for patents


Patents MT&IR prototype, translation tools




Presenter Notes

Division of work




relacions

Presenter Notes

WP5, Statistical and Robust MT


By GF's robustness

  • Robust parsing


By hybrid SMT-GF techniques

  • GF grammar for patents
  • SMT trained on patents
  • On top SMT decoding

Presenter Notes

Translation Engines

Presenter Notes

Translation Engines



systems

Presenter Notes

SMT for biomedical patents


Standard SMT system with

  • Corpus: MAREC corpus (biomedical domain, EN-DE-FR)
    • In-domain tokeniser
  • Language model: 5-gram interpolated Kneser-Ney discounting, SRILM Toolkit
  • Alignments: GIZA++ Toolkit
  • Translation model: Moses package
  • Weights optimization: MERT against BLEU
  • Decoder: Moses

Presenter Notes

GF for biomedical patents






1. Lexicon




2. Grammar

Presenter Notes

GF for biomedical patents

chunking

Presenter Notes

Domain-specific grammar


Built on the Resource Grammar


Extensions

  • 4 categories for chunk parsing and 3 for agreement
  • 8 functions that deal with agreements and parsing chunks
  • 9 extensions to the resource grammar dealing with general-purpose constructions
  • 13 constructions to deal with claim-specific data
  • 8 structural words typical for patent domain case

Presenter Notes

Domain-specific base lexicons


Core lexicon

  • Most frequent words, SMT translation, one-to-one correspondance


Static lexicon

  • SMT lexical/translation tables, one-to-many correspondance


Dictionary of compounds (German)

  • SMT tables, dictionary, many-to-one correspondance

Presenter Notes

Domain-specific runtime lexicons


Safe building

  • Unknown lexical items are looked up in the monolingual dictionaries
  • Two levels of confidence: unambiguous vs. ambiguos


Unsafe building

  • Uses smart paradigms to build items absent in the dictionary
  • Three levels of confidence: unambiguous, ambiguos and unsafe

Presenter Notes

Further hybridisation


SMT & GF integration led by GF

  • GF grammar with SMT built lexicon and disambiguation by frequency counts
  • Robust parsing with statistical models for searching the space and for disambiguation


SMT & GF integration led by SMT

  • Additional SMT decoding on top of GF and SMT to choose the best translation options

Presenter Notes

SMT & GF integration led by SMT


Hard Integration

  • GF phrases are forced to appear
  • SMT complements
  • Top SMT reorders


Soft Integration

  • GF and SMT phrases interact
  • Top SMT reorders and chooses the best option
  • LM plays an important role in choosing

Presenter Notes

Hybrid system


Characteristics and options


  • Static vs. dynamic lexicon (two types)
  • Base vs. extended lexicons
  • Single vs. multiple GF translations available
  • Hard vs. soft integration
  • Integration at decoding time vs. tuning

Presenter Notes

Hybrid system


Number of phrases from every system choosen at the end


origin

Presenter Notes

Hybrid system

Automatic evaluation, En2Fr

evalFr

Presenter Notes

Hybrid system

Automatic evaluation, En2De

evalDe

Presenter Notes

Hybrid system

Manual evaluation, En2Fr & EnDe


TAUS scale evaluation

[4, 3, 2, 1]



manualEval

Presenter Notes

Hybrid system

Manual evaluation, En2Fr & EnDe



Some improvements to do after manual evaluation. Still,

  • Exact matches with the reference translation got sometimes ranked as 1 or 2
  • Even the published human translations are evaluated as unpublishable
  • In some cases very low BLEU scoring sentences got a 4

Presenter Notes

Patent translator usage

Presenter Notes

Patent translator usage




1. One-click system




2. Web application




3. Offline translation in the retrieval system

Presenter Notes

WP5, One-click System



Perl script that runs the translator



 molto-server:hybrid$ perl H1PTrad.pl

 Usage: perl H1PTrad.pl -v # -m [runtime|unsafe|demo] <input> [src2trg] 
 -v: verbosity [0,1,2]
 -m: mode [runtime|unsafe|demo]
 input: file to translate
 src2trg: language pair

 Ex: 
 perl H1PTrad.pl -v 1 -m demo /Users/systems/input/patsA61P.test.en en2fr

Presenter Notes

WP3, Simple Translation Tool


inTheCloud

Presenter Notes

WP7, Patent translation & retrieval


Prototype available in EN, DE and FR

prototype

http://molto-patents.ontotext.com         

Presenter Notes

WP7, Patent translation & retrieval


Architecture

  • SMT-based pipeline for automatic translation of annotated documents

  • Multilingual document retrieval (query flagship)

  • GF-based querying subsystem for automatic translation of CNL queries to SPARQL (query flagship)

  • User Interface (query flagship)

Presenter Notes

Prototype Architecture


retrieval

Presenter Notes

Semantic annotations




anotacions

Presenter Notes

Offline translation of the full dataset



translation



Online API @ http://falkor.lsi.upc.edu/MOLTO/

Presenter Notes

Retrieved translated patent

English text

ui-en

French text

ui-fr

Presenter Notes

Related publications

Presenter Notes

Publications I

Conferences & Workshops

SUBMITTED

Hybrid Translation for European Biomedical Patents (submitted), Cristina España-Bonet, Ramona Enache, Aarne Ranta, Lluís Màrquez 5th Workshop on Patent Translation, Machine Translation Summit XIV (MTSummit 2013), 2-6 September 2013, Nice, France

MT Techniques in a Retrieval System of Semantically Enriched Patents (submitted), Meritxell Gonzàlez, Maria Mateva, Ramona Enache, Cristina España-Bonet, Lluís Màrquez Machine Translation Summit XIV (MTSummit 2013), 2-6 September 2013, Nice, France

Presenter Notes

Publications II

Conferences & Workshops (published)

PUBLISHED

CNLs for multilingual queries in MOLTO, Olga Caprotti, Milen Chechev, Ramona Enache, Meritxell Gonzalez, Aarne Ranta, Jordi Saludes Third Workshop on Controlled Natural Language (CNL 2012). 29–31 August 2012, Zurich, Switzerland

A Hybrid System for Patent Translation, Ramona Enache, Cristina España-Bonet, Aarne Ranta, Lluís Màrquez Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT12), Trento, Italy, May 8-30, 2012

The Patents Retrieval Prototype in the MOLTO project, Milen Chechev, Meritxell Gonzàlez, Lluís Màrquez, Cristina España-­Bonet. Worl Wide Web Conference 2012. 16th-­20th April 2012, Lyon, France

Patent translation within the MOLTO project, Cristina España-Bonet, Ramona Enache, Adam Slaski, Aarne Ranta, Lluís Màrquez, Meritxell Gonzàlez Proceedings of the 4th Workshop on Patent Translation, MT Summit XIII, Xiamen, China, September 23, 2011.

Presenter Notes

Publications III

Reports

MOLTO-­Patents: recent issues, solutions and perspectives Laura Tolosi, Maria Mateva Ontotext AD, September 2012, Bulgaria

WP7 Semantic Infrastructure & Prototype Building Milen Chechev. Ontotext AD, June 2011, Bulgaria

Towards a RB-SMT Hybrid System for Translating Patent Claims - Results and Perspectives Ramona Enache and Adam Slaski University of Goteborg, May 2011, Internal Report.

Theses

Ramona Enache (PhD, Forthcoming). Frontiers of Multilingual Grammar Development (prelim.). Göteborg : University of Gothenburg.

Ramona Enache (Licentiate, 2011). Automating the development of multilingual grammars. Göteborg : University of Gothenburg.

Presenter Notes

Patents

MOLTO's Open Day
Barcelona, 23rd May 2013

Presenter Notes

Extra Slides

Presenter Notes

Unsafe lexicon building

lexicon

Presenter Notes

German compounds lexicon


nucleotide sequence -> Nucleotidsequenz



Word-to-word GIZA aligments not enough


Solution adopted:

  • Split compounds, word-to-word mapping, join afterwards

Presenter Notes