WP5 Statistical and robust translation - M24

olga.caprotti

Multilingual Online Translation

Summary of progress

The milestone MS7 has been achieved in M24 (First prototypes of hybrid combination: The methods are implemented and evaluated on a specific test set).

The work of the fourth semester corresponds to the three last tasks of the WP (T5.4 Baseline systems, T5.5 Hybrid Models and T5.6 Systems evaluation, see http://www.molto-project.eu/workplan/statistical-and-robust-translation). The baseline systems have been improved by extending the GF translator. Now the translator is able to deal with chunks so that the coverage has been widened (Task 5.4).

For Task 5.5 we have implemented two kinds of hybrid models which we call Soft and Hard integrations. The following section outlines its main characteristics.

Finally, for Task 5.6 both the baselines and the hybrid systems have been evaluated using a variety of lexical metrics and compared with generic public available translators such as Google and Bing. Also a manual evaluation has been carried out in order to compare the most promising hybrid system according to the automatic evaluation and the pure SMT translator.

The work done for these tasks has been submitted to the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) and the submitted paper with title "A Hybrid System for Patent Translation" can be found in MOLTO web page.

At the same time of writing this report, the deliverable D5.2 Description and evaluation of the combination prototypes is being submitted as a regular publication. It is a public document accessible from the MOLTO web page.

Highlights

Two kinds of hybrid translators for patents have been developed. The final systems are not only a combination of two different engines but the subsystems also mix different components. We have developed a GF translator for the specific domain that uses an in-domain SMT system to build the lexicon; an SMT system is on top of it to translate those phrases not covered by the grammar.

In the previous report we showed that the GF grammar-based system alone could not parse most patent sentences. Consequently, the current translation system aims at using GF for translating patent chunks, and assemble the results in a later phase. As explained in D5.2, this implies several modifications to the GF baseline itself.

To gain robustness in the final system, the output of the GF translator is used as a priori information for a higher level SMT system. The SMT baseline is fed with phrases which are integrated in two different ways. First, what we call "Hard Integration", phrases with GF translation are forced to be translated this way. The system can reorder the chunks and translates the untranslated chunks, but there is no interaction between GF and pure SMT phrases. Second, in the "Soft Integration" system, phrases with GF translation are included in the translation table with a certain probability so that the phrases coming from the two systems interact.

The hybrids exploit the high coverage of statistical translators and the high precision of GF to deal with specific issues of the language. At this moment the grammar tackles agreement in gender, number and between chunks, and reordering within the chunks. Although the cases where these problems apply are not extremely numerous both manual and automatic evaluations consistently show their preference for the hybrid system in front of the two individual translators. In the near future we plan to widen the number of issues approached by the grammar. Also, modifications with SMT components to the GF translator and new kinds of combination of phrases will be introduced.

Use of resources for Period 2

Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
UGOT 0 0 1 (R. Enache) 0
UPC 7.75 (L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) 11.25 (C. España, M. Gonzalez, X. Carreras) 0 0
UHEL 0 0 0 0
OntoText 0 0 0 0

Deviations from Annex I

The final hybrid translators have been developed for the French-English language pair. We also aim at including German, so in the following months the concrete syntax for German will be completed. We plan to complete the task in May and it does not affect any other tasks of the project. The systems in the three languages will be available for the final evaluation.