Statistical and Robust Translation

ID: 
5
Leader: 
cristina.españa
Timeline: 
October, 2010 - August, 2012

Use of resources

Node Budgeted Period 1 Period 2 Period 3 (est)
UGOT 9 0 RE:0.7 X
UHEL 3 0 X X
UPC 38 0 19 X
Ontotext
BI
UZH

Objectives

The goal is to develop translation methods that complete the grammar-based methods of WP3 to extend their coverage and quality in unconstrained text translation. The focus will be placed on techniques for combining GF-based and statistical machine translation. The WP7 case study on translating Patents text is the natural scenario to test the techniques developed in this package. Existing corpora for the WP7 will be used to adapt SMT and grammar based systems to the Patents domain. This research will be conducted on a variety of languages of the project (at least three).

Description of work

The work in this package is organized in three main lines:

  1. Extend the GF domain grammar for the Patents domain developed in WP7 by introducing probabilistic predictions.

  2. Adapt a state-of-the-art SMT system to the Patents domain, by using in-domain multilingual corpora provided by WP7 and synthetic aligned corpora generated in a controlled environment by the GF grammar from (1). All corpora used for domain adaptation will have to be pre-processed with linguistic analyzers.

  3. Develop combination approaches to integrate grammar-based and statistical MT models in a hybrid MT system. At least four variants will be studied (i) (baseline) cascade of independent MT systems; (ii) (hard integration) GF partial output is fixed in a regular SMT decoding (Moses to be used); (iii) (soft integration I) GF partial output, in the form of phrase pairs, is integrated as a discriminative probability feature model in a phrase-based SMT system (Moses to be used); (iv) (soft integration II) GF partial output, in the form of tree fragment pairs, is integrated as a discriminative probability model in a syntax-based SMT system to be used).

The contribution by partners will be as follows: UGOT will work on the domain GF grammar probabilities and the generation of synthetic corpora for SMT adaptation. UPC will lead the Package, provide the SMT technology (phrase and syntax-based), coordinate the corpora compilation/alignment, and develop the combined MT models. Mxw will act as a corpora provider for training and adapting the SMT systems. UHEL will work on the usability aspects of the combined system, which are preparatory for WP3.