Patents

MOLTO's Open Day
Barcelona, 23rd May 2013

Presenter Notes

Goals



Extension of MOLTO's MT approach to deal with non-restricted language


Exploration and development of translation engines especialised for patents


Patents MT&IR prototype, translation tools




Presenter Notes

Division of work




relacions

Presenter Notes

WP5, Statistical and Robust MT


By GF's robustness

  • Robust parsing


By hybrid SMT-GF techniques

  • GF grammar for patents
  • SMT trained on patents
  • On top SMT decoding

Presenter Notes

Translation Engines

Presenter Notes

Translation Engines



systems

Presenter Notes

SMT for biomedical patents


Standard SMT system with

  • Corpus: MAREC corpus (biomedical domain, EN-DE-FR)
    • In-domain tokeniser
  • Language model: 5-gram interpolated Kneser-Ney discounting, SRILM Toolkit
  • Alignments: GIZA++ Toolkit
  • Translation model: Moses package
  • Weights optimization: MERT against BLEU
  • Decoder: Moses

Presenter Notes

GF for biomedical patents






1. Lexicon




2. Grammar

Presenter Notes

GF for biomedical patents

chunking

Presenter Notes

Domain-specific grammar


Built on the Resource Grammar


Extensions

  • 4 categories for chunk parsing and 3 for agreement
  • 8 functions that deal with agreements and parsing chunks
  • 9 extensions to the resource grammar dealing with general-purpose constructions
  • 13 constructions to deal with claim-specific data
  • 8 structural words typical for patent domain case

Presenter Notes

Domain-specific base lexicons


Core lexicon

  • Most frequent words, SMT translation, one-to-one correspondance


Static lexicon

  • SMT lexical/translation tables, one-to-many correspondance


Dictionary of compounds (German)

  • SMT tables, dictionary, many-to-one correspondance

Presenter Notes

Domain-specific runtime lexicons


Safe building

  • Unknown lexical items are looked up in the monolingual dictionaries
  • Two levels of confidence: unambiguous vs. ambiguos


Unsafe building

  • Uses smart paradigms to build items absent in the dictionary
  • Three levels of confidence: unambiguous, ambiguos and unsafe

Presenter Notes

Further hybridisation


SMT & GF integration led by GF

  • GF grammar with SMT built lexicon and disambiguation by frequency counts
  • Robust parsing with statistical models for searching the space and for disambiguation


SMT & GF integration led by SMT

  • Additional SMT decoding on top of GF and SMT to choose the best translation options

Presenter Notes

SMT & GF integration led by SMT


Hard Integration

  • GF phrases are forced to appear
  • SMT complements
  • Top SMT reorders


Soft Integration

  • GF and SMT phrases interact
  • Top SMT reorders and chooses the best option
  • LM plays an important role in choosing

Presenter Notes

Hybrid system


Characteristics and options


  • Static vs. dynamic lexicon (two types)
  • Base vs. extended lexicons
  • Single vs. multiple GF translations available
  • Hard vs. soft integration
  • Integration at decoding time vs. tuning

Presenter Notes

Hybrid system


Number of phrases from every system choosen at the end


origin

Presenter Notes

Hybrid system

Automatic evaluation, En2Fr

evalFr

Presenter Notes

Hybrid system

Automatic evaluation, En2De

evalDe

Presenter Notes

Hybrid system

Manual evaluation, En2Fr & EnDe


TAUS scale evaluation

[4, 3, 2, 1]



manualEval

Presenter Notes

Hybrid system

Manual evaluation, En2Fr & EnDe



Some improvements to do after manual evaluation. Still,

  • Exact matches with the reference translation got sometimes ranked as 1 or 2
  • Even the published human translations are evaluated as unpublishable
  • In some cases very low BLEU scoring sentences got a 4

Presenter Notes

Patent translator usage

Presenter Notes

Patent translator usage




1. One-click system




2. Web application




3. Offline translation in the retrieval system

Presenter Notes

WP5, One-click System



Perl script that runs the translator



 molto-server:hybrid$ perl H1PTrad.pl

 Usage: perl H1PTrad.pl -v # -m [runtime|unsafe|demo] <input> [src2trg] 
 -v: verbosity [0,1,2]
 -m: mode [runtime|unsafe|demo]
 input: file to translate
 src2trg: language pair

 Ex: 
 perl H1PTrad.pl -v 1 -m demo /Users/systems/input/patsA61P.test.en en2fr

Presenter Notes

WP3, Simple Translation Tool


inTheCloud

Presenter Notes

WP7, Patent translation & retrieval


Prototype available in EN, DE and FR

prototype

http://molto-patents.ontotext.com         

Presenter Notes

WP7, Patent translation & retrieval


Architecture

  • SMT-based pipeline for automatic translation of annotated documents

  • Multilingual document retrieval (query flagship)

  • GF-based querying subsystem for automatic translation of CNL queries to SPARQL (query flagship)

  • User Interface (query flagship)

Presenter Notes

Prototype Architecture