Statistical and Robust Translation

ID:

Leader:

cristina.españa

Workplan wiki:

Timeline:

October, 2010 - August, 2012

Use of resources

Node	Budgeted	Period 1	Period 2	Period 3 (est)
UGOT	9	0	RE:0.7	X
UHEL	3	0	X	X
UPC	38	0	19	X
Ontotext
BI
UZH

Objectives

The goal is to develop translation methods that complete the grammar-based methods of WP3 to extend their coverage and quality in unconstrained text translation. The focus will be placed on techniques for combining GF-based and statistical machine translation. The WP7 case study on translating Patents text is the natural scenario to test the techniques developed in this package. Existing corpora for the WP7 will be used to adapt SMT and grammar based systems to the Patents domain. This research will be conducted on a variety of languages of the project (at least three).

Description of work

The work in this package is organized in three main lines:

Extend the GF domain grammar for the Patents domain developed in WP7 by introducing probabilistic predictions.
Adapt a state-of-the-art SMT system to the Patents domain, by using in-domain multilingual corpora provided by WP7 and synthetic aligned corpora generated in a controlled environment by the GF grammar from (1). All corpora used for domain adaptation will have to be pre-processed with linguistic analyzers.
Develop combination approaches to integrate grammar-based and statistical MT models in a hybrid MT system. At least four variants will be studied (i) (baseline) cascade of independent MT systems; (ii) (hard integration) GF partial output is fixed in a regular SMT decoding (Moses to be used); (iii) (soft integration I) GF partial output, in the form of phrase pairs, is integrated as a discriminative probability feature model in a phrase-based SMT system (Moses to be used); (iv) (soft integration II) GF partial output, in the form of tree fragment pairs, is integrated as a discriminative probability model in a syntax-based SMT system to be used).

The contribution by partners will be as follows: UGOT will work on the domain GF grammar probabilities and the generation of synthetic corpora for SMT adaptation. UPC will lead the Package, provide the SMT technology (phrase and syntax-based), coordinate the corpora compilation/alignment, and develop the combined MT models. Mxw will act as a corpora provider for training and adapting the SMT systems. UHEL will work on the usability aspects of the combined system, which are preparatory for WP3.

Tasks

ID		Status	Timeframe
5.1	Parallel corpus compilation in Patents domain	Ongoing	Sep 2010 - Dec 2010
5.2	Out-of-domain corpus	Completed
5.3	Robust Parsing	Ongoing
5.4	Baseline systems	Ongoing
5.5	Hybrid Models	Ongoing
5.6	Systems evaluation	Ongoing

What links here

No backlinks found.

Demos

Recent News

Recent Publications

Statistical and Robust Translation

Use of resources

Objectives

Description of work

Tasks

See also

What links here

EVENTS

Current signups for