Patent Translation

Patent translation

Goals

Explore and build translation engines especialised for patent translation

Integrate the translations into the patents retrieval system

Division of work

relacions

Patents

Patent documents

Meta-information

IPC classification A61P

Specific therapeutic activity of chemical compounds or medical preparations.

IPC

Patent documents

Text

Abstracts and claims

claims

Patent documents

Language

Claims are written in a lawyerish style and using a very specific vocabulary of chemistry, full of compounds names.

The use according to claim 7, wherein said cancer diseases comprise bladder, lung, mamma, melanoma and prostate carcinomas.
A compound according to claim 1 wherein it is (2S)-2-[(4S)-4-(2,2-difluorovinyl)-2-oxopyrrolidinyl]butanamide.
The pharmaceutical composition according to claim 1 or 2, wherein said platinum anticancer agent is selected from at least one of the complexes having structures of: ** IMAGE **.

Corpus

Table here

Pre-process

Different process for the translation engine and retrieval system

Common step: tokenising
Main difference: mark-up and semantic annotations

Tokenisation

Esquema + Exemple

8-difluoro-2- [ 3-fluoro-4 - [ ( L-lysyl ) amino ] phenyl ] -7-methyl-4H-1-benzopyran-4-one

vs.

8-difluoro-2-[3-fluoro-4-[(L-lysyl)amino]phenyl]-7-methyl-4H-1-benzopyran-4-one

Translation engines

Engines

Plot with SMT, GF => HYBRID

SMT for biomedical patents

Standard SMT system with

Corpus: pre-processed corpus
Language model: 5-gram interpolated Kneser-Ney discounting, SRILM Toolkit
Alignments: GIZA++ Toolkit
Translation model: Moses package
Weights optimization: MERT against BLEU
Decoder: Moses

Evaluation in the biomedical domain

Syntactic metrics for MT evaluation

MALT dependency parser for English and French
Berkeley parser for German
Similarity is computed as the overlap of the linguistic elements in the reference and the candidate.
Liguistic elements can be either the lexical items, or the results of the parse, such as part-of-speech and phrase constituents.

SMT, automatic evaluation

En2Fr & En2De results

Also other language pairs?

GF for biomedical patents

Lexicon
Grammar

Translation by chunking

Methodology

chunking

Lexicon building

Methodology

lexicon

Lexicon building

German lexicon

nucleotide sequence -> Nucleotidsequenz

Word-to-word GIZA aligments not enough

Solution adopted:

Split compounds, word-to-word mapping, join afterwards

Lexicon building

Static vs. Runtime lexicons

RAMONA something here, please

Construction?

Lexicons for French and German

Sizes and sources for static, safe, unsafe, parse, noparse

RAMONA, please

French concrete grammar

Specific issues

NPs and AdvP are mapped into GF categories and linearised
VP, RelP and AdjP are linked to a NP in order to be linearised
Disambiguation of multiple linearisations by frequency counts in the corpus

French concrete grammar

Table with % of chunks translated

I need to choose only the representative systems (3?)

German concrete grammar

Specific issues

Nominalisation

immunising the mouse-> das Immunisieren von der Mouse

Gerund translated into infinitive + preposition (+ article)

Relative sentences

Pharmaceutical composition comprising an aqueous solution

Gerund and participle sentences not common in German

They are replaced by a relative clause during chunking

German concrete grammar

Table with % of chunks translated

As before I need to choose only the representative systems (3?)

GF, automatic evaluation

Evaluation with lexical and syntactic metrics

1008 fragments from the MAREC test set

...

GF, automatic evaluation for En2Fr

GFEn2FR

GF, automatic evaluation for En2De

GFEn2De

GF, robust parsing with patents

Robust parsing applied to patents

Pre-process and cleaning

From

The use of claim 23 , wherein the amount of said composition is from 100 mg to 800 mg of ibuprofen .

To

the use of claim 2 3 wherein the amount of said composition is from 1 0 0 mg to 8 0 0 mg of ibuprofen

Pre-process necessary for parsing

GF, robust parsing with patents

Parsing

With C Runtime, parseEng, DictEng, ExtraLex
Advantages: robustness
Disadvantages: cleaning and length (<26 tokens)

Linearisation

With parseGer, DictGer, ExtraLexGer

Use of generic resources (parseEng, DictEng, parseGer, DictGer) and domain lexicons (ExtraLex, ExtraLexGer)

GF, robust parsing evaluation

Experiment

Marec test set, 1008 fragments

Cleaning: 537 fragments
Properly linearised: 98 fragments
Evaluation with lexical and syntactic metrics

GF, robust parsing evaluation

RobustEn2De

Further hybridisation

SMT & GF integration lead by GF

GF grammar with SMT built lexicon and disambiguation by frequency counts
Robust parsing with statistical models for searching the space and for disambiguation

Further hybridisation

SMT & GF integration lead by SMT

Additional SMT decoding on top of GF and SMT to choose the best translation options

Hard Integration -- GF phrases are forced to appear -- SMT complements -- top SMT reorders
Soft Integration -- GF and SMT phrases interact -- top SMT reorders and chooses the best option -- LM plays an important role in choosing

Further hybridisation

SMT & GF integration lead by SMT

Integration only at decoding time Either Soft or Hard, it is applied on the test set
MERT with GF The final decoder weights are obtained also with an integration in development

Hybrid system

Final system

Characteristics and options

static vs. dynamic lexicon (two types)
base vs. extended lexicons
single vs. multiple GF translations available
hard vs. soft integration
integration at decoding time vs. tuning

Hybrid system

Number of phrases from every system choosen at the end

origin

Hybrid system

Automatic evaluation En2Fr

Table with the best systems

Hybrid system

Automatic evaluation En2De

Table with the best systems

Manual evaluation

Setup

Experiment definition

JUSSI, after the evaluation

Manual evaluation

Results

Table?

JUSSI, after the evaluation

Manual evaluation

Conclusions

JUSSI, after the evaluation

Patent translator usage

One-click system
Offline translation in the retrieval system
Translation tools?
Webservice?

One-click system

Perl script that runs the translator

 csmisc14:hybrid cristina$ perl H1PTrad.pl 

 Usage: perl H1PTrad.pl -v # -m [runtime|unsafe|demo]  [src2trg] 
 -v: verbosity [0,1,2]
 -m: mode [runtime|unsafe|demo]
 input: file to translate
 src2trg: language pair

 Ex: perl H1PTrad.pl -v 1 -m demo /Users/systems/input/patsA61P.test.en en2fr

Patent translation & retrieval

Architecture

retrieval

SMT-based pipeline for automatic translation of annotated documents.
multilingual document retrieval, discussed in the query flagship.
GF-based querying subsystem for automatic translation of CNL queries to SPARQL. Further discussed in the query flagship.
User Interface, shown as the case study in the query flagship.

Patent translation & retrieval

Dataset

7,705 documents, dated 2010 to 2012, downloaded from the EPO website
4,485 of them have claims, description and/or abstracts in English, the selected language to annotate the documents

Documents	Claims	Descriptions	Abstracts
English	4,485	62,638	3,832	2,518
German	2,047	32,007	192	80
French	2,011	31,487	130	44

Patent translation & retrieval

Semantic annotations and UTF-8 encode

``` The use of a compound of the formula:

or isomers i.e. geometric, optical, entianomeric, diasteriomeric, epimeric, stereoisomeric, tautomeric, conformational, or anomeric forms, salts, solvates and chemically protected forms thereof, in the preparation of a medicament for inhibiting the activity of PARP

, wherein: A and B together represent a fused aromatic ring, optionally substituted with one or more substituent groups selected from halo, nitro, hydroxy, ether, thiol, thioether, amino, C ${1-7}$ alkyl, C ${3-20}$ heterocyclyl and C $_{5-20}$ aryl;

R C is -CH $2$-R L , where R L is a C ${5-20}$ aryl group, optionally substituted with one or more substituent groups selected from C ${1-7}$ alkyl, C ${5-20}$ aryl, C $_{3-20}$ heterocyclyl, halo, hydroxy, ether, nitro, cyano, acyl, carboxy, ester, amido, amino, sulfonamido, acylamido, ureido, acyloxy, thiol, thioether, sulfoxide and sulfone; and

R N is hydrogen. ```

Patent translation & retrieval

Offline translation of the full dataset

translation

Cleaning and markup
text extraction, tokenization and segmentation
translation - Retraining of a new SMT using UTF-8 encoding
postprocess, XML formatting and merge (EN,DE,FR)

Patent translation & retrieval

Online API

http://falkor.lsi.upc.edu/MOLTO/
Allows to upload a single file. It should contain text in English and annotations.
It returns the same document with the english sections translated into German and French.

Patent translation & retrieval

The patents retrieval prototype

http://molto-patents.ontotext.com
The interface is available in EN, DE and FR

Patent translation & retrieval

English text

ui-en

French text

ui-fr

Webservice?

Something here?

Translator tools

screenshot if integrated

Attachment	Size
AND_Q_DOC-en.png	114.08 KB
AND_Q_DOC-fr.png	102.52 KB
architectureWP7.png	352.55 KB
architectureWP7.png	352.55 KB
chunking.png	20.01 KB
GFEn2De.png	66.08 KB
GFEn2Fr.png	68.34 KB
HphrasesEn2Fr.png	79.29 KB
lexicon.png	22.21 KB
patXML1.png	173.17 KB
patXML2.png	90.13 KB
process.png	140.27 KB
relacions.jpg	22.99 KB
RobustEn2De.png	77.91 KB

Login to post comments
Slides

What links here

No backlinks found.

Demos

Recent News

Recent Publications

Patent Translation

Patent translation

Goals

Division of work

Patents

Patent documents

Meta-information

Patent documents

Text

Patent documents

Language

Corpus

Pre-process

Tokenisation

Translation engines

Engines

SMT for biomedical patents

Standard SMT system with

Evaluation in the biomedical domain

Syntactic metrics for MT evaluation

SMT, automatic evaluation

GF for biomedical patents

Translation by chunking

Methodology

Lexicon building

Methodology

Lexicon building

German lexicon

*nucleotide sequence* -> Nucleotidsequenz

Lexicon building

Static vs. Runtime lexicons

Lexicons for French and German

French concrete grammar

Specific issues

French concrete grammar

Table with % of chunks translated

German concrete grammar

Specific issues

*immunising* the mouse-> *das Immunisieren von* der Mouse

Pharmaceutical composition *comprising an aqueous solution*

German concrete grammar

Table with % of chunks translated

GF, automatic evaluation

GF, automatic evaluation for En2Fr

GF, automatic evaluation for En2De

GF, robust parsing with patents

Robust parsing applied to patents

From

To

GF, robust parsing with patents

Parsing

Linearisation

GF, robust parsing evaluation

Experiment

GF, robust parsing evaluation

Further hybridisation

SMT & GF integration lead by GF

Further hybridisation

SMT & GF integration lead by SMT

Further hybridisation

SMT & GF integration lead by SMT

Hybrid system

Final system

Hybrid system

Number of phrases from every system choosen at the end

Hybrid system

Automatic evaluation En2Fr

Hybrid system

Automatic evaluation En2De

Manual evaluation

Manual evaluation

Setup

Manual evaluation

Results

Manual evaluation

Conclusions

Patent translator usage

nucleotide sequence -> Nucleotidsequenz

immunising the mouse-> das Immunisieren von der Mouse

Pharmaceutical composition comprising an aqueous solution