First experiments for patent translation from English to German and Bulgarian by using the robust GF parser

Recently I did a major improvement in the efficiency of the robust GF parser and now I am eager to try it out on the patent corpus. The first thing to notice is that although the parser is now usable even for nontrivial sentences, for some long sentences it still fails with "out of memory" error. The borderline is somewhere around sentence length of 20 tokens. There is still room for improvement and I know what must be done but before that I want to do a pilot experiment in translation.

The first step is to compile the parsing grammars for English, Bulgarian and German into one .pgf. The Bulgarian grammar is loaded with a large dictionary partly bootstraped from Apertium and later extended by hand. It is interesting that the original source was the English<->Macedonian dictionary which with some effort was converted to English<->Bulgarian. The German grammar has a dictionary bootstraped from Wiktionary and extended with lexica taken from the grammar that Ramona and Cristina used in the previous machine translation experiments. So far everything was easy and the outcome is a monster grammar with a .pgf file of 14Mb. Loading the grammar is possible but the standard GF shell (in Haskell) needs about 400Mb of virtual memory to load it. The C parser is of course not that hungry and it needs only 200Mb.

Now we can parse claims in English with the robust parser in the C runtime but the C runtime still cannot linearize partial trees. The workaround is to use the C runtime for parsing and the Haskell runtime for linearization. The following are four claims that were translated in this way. Those who can read German or Bulgarian will notice that the translations are not perfect but still I think that they are pretty good for a first try.

for the manufacture of a medicament for the treatment or prevention of cachexia in a mammal
за производството от медикамент за третирането или предпазване от ?32 в бозайник
für das Produkt zur ein Medikament für die Behandlung oder Prävention zur ?32 in einem Säugetier

a use according to claim 2 , in which R represents a hydrogen atom , a fluorine atom , a chlorine atom or a methyl group
употреба според твърдение 2 в което R представя [CompoundCN] , [CompoundCN] , [CompoundCN] или [CompoundCN]
eine Anwendung gemäß Anspruch einigen 2 , in dem R einen [CompoundCN] repräsentiert , einem [CompoundCN] , einem [CompoundCN] oder einem [CompoundCN]

a use according to claim 3 , in which R represents a hydrogen atom
употреба според твърдение 3 в което R представя [CompoundCN]
eine Anwendung gemäß Anspruch einigen 3 , in dem R einen [CompoundCN] repräsentiert

a use according to claim 5 , in which R1 represents an amino group or an acetylamino group
употреба според твърдение 5 в която R1 представя ?37 група или ?46 група
eine Anwendung gemäß Anspruch einigen 5 , in der R1 eine ?37 Gruppe oder eine ?46 Gruppe repräsentiert

The views expressed in this blog are personal and do not in any way reflect the view of the MOLTO Consortium