WP5 Statistical and robust translation - M18
Summary of progress
M18 is the date where Milestone S5 (First prototypes of the baseline combination models) should be achieved. The baseline systems for this workpackage, as described is Task 5.4, include an statistical machine translation system (SMT) trained with patents data, and the GF multilingual translation with a specific grammar for patents.
The SMT system was mainly developed in the previous six months and was already reported in the First Year Report. In the following section we explain the most significant results which have been accomplish with respect to the GF system.
For Task 5.5, we have started the work towards the hybrid system. Parts of the GF system such as the lexicon building already make use of statistical components. Besides, the methodology to combine SMT and GF alignments is established waiting to be applied to the patents domain.
The work done for these tasks has been recently published in the "MT Summit XIII 4th Workshop on Patent Translation" with the title "Patent translation within the MOLTO project".
At the same time of writing this report, the Deliverable D5.1 Description of the final collection of corpora corresponding to Tasks 5.1 and 5.2 has been submitted as a regular publication. It is a public document accessible from the MOLTO web page.
Highlights
A first implementation of the English-to-French patent translator with GF is available. The translation process can be divided according to the action of three modules: a generic pre-processing, the on-line lexicon building, and the patents grammar.
The generic processing consists of an on-purpose tokeniser that deals with compound nouns, phrases separated by hyphens, chemical compounds, etc. The Stanford POS-tagger is used for named entities recognition and a recogniser of numbers has been developed. Chemical compounds after being tagged can be independently translated by the compounds grammar. This grammar is in an early stage of development within this workpackage.
The second module is devoted to the lexicon building. To do this, the GF library multilingual lexicon is extended with nouns, adjectives, verbs and adverbs. The abstract syntax for these PoS is created from the claims in English and words are lemmatised and corrected manually from noise and ambiguities. The appropriate inflection is generated using the implemented GF paradigms and the English dictionary of the GF library for English, which is the starting language. Base forms are then translated into French and the inflection is generated in the same way. This process will be extended to other languages later on the project.
Finally, the core of the translator is the patents grammar. The GF resource grammar has been extended with functions that implement constructions that occur in patent claims. The grammar is also in its first stages and nowadays it has a huge number of ambiguities and its coverage is around 15% on complete sentences. This figure can increase up to a 60% when dealing with chunks instead of full sentences.
Deviations from Annex I
This workpackage is tightly related to WP7. The delay on the patents corpus from WP7 has implied a reordering of some tasks within WP5. This explains the work done for Task 5.5 substituting parts of Task 5.4 which will be finished in the next months. Also because of the delay on the approval of the data, Deliverable 5.1 could be updated soon.
- Printer-friendly version
- Login to post comments
- Slides
What links here
No backlinks found.