WP5: Statistical and robust translation
WP5 requirements
Objectives
[From DoW]
The goal is to develop translation methods that complete the grammar-based methods of WP3 to extend their coverage and quality in unconstrained text translation. The focus will be placed on techniques for combining GF- based and statistical machine translation. The WP7 case study on translating Patents text is the natural scenario to test the techniques developed in this package. Existing corpora for the WP7 will be used to adapt SMT and grammar- based systems to the Patents domain. This research will be conducted on a variety of languages of the project (at least three).
Deliverables
Del. no |
Del. title |
Nature |
Date |
D 5.1 |
Description of the final collection of corpora |
RP |
M18 |
D 5.2 |
Description and evaluation of the combination prototypes |
RP |
M24 |
D 5.3 |
WP5 final report: statistical and robust MT | RP,Main |
M30 |
Description of work
[10 Robust and statistical translation methods in DoW]
The concrete objectives in this proposal around robust and statistical MT are:
- Extend the grammar-based approach by introducing probabilistic information and confidence scored predictions.
- Construct a GF domain grammar and a domain-adapted state-of-the-art SMT system for the Patents use case.
- Develop combination schemes to integrate grammar-based and statistical MT systems in a hybrid approach.
- Fulfill the previous objectives on a variety of language pairs of the project (covering three languages at least).
Most of the objectives depend on the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.
Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information (1 and 2). We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. This compilation will rely on publicly available corpora and resources for MT (e.g., the multilingual corpus with transcriptions of European Parliament Sessions).
Domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study). This corpora will come from the compilation to be made at WP7, leaded by Mxw.
We already have the European Parliament corpus compiled and annotated for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.
Combination of grammar-based and statistical paradigms is a novel and active research line in MT. (...) We plan explore several instantiations of the fallback approach. From simple to complex:
• Independent combination: in this case, the combination is set as a cascade of independent processors. When Grammar-based MT does not produce a complete translation, the SMT system is used to translate the input sentence. This external combination will be set as the baseline for the rest of combination schemes.
• Construction of a hybrid system based on both paradigms. In this case, a more ambitious approach will be followed, which consists of constructing a truly hybrid system which incorporates an inference procedure able to deal with multiple proposed fragment translations, coming from grammar-based and SMT systems. Again we envision several variants:
• Fix translation phrases produced by the partial GF analyses in the SMT search. In this variant we assume that the partial translations given by GF are correct so we can fix them and let SMT to fill the remaining gaps and do the appropriate reordering. This hard combination is easy to apply but not very flexible.
• Use translation phrase pairs produced by the partial GF analyses, together with their probabilities, to form an extra feature model for the Moses decoder (probability of the target sentence given the source).
• Use tree fragment pairs produced by the partial GF analyses, together with their probabilities, to feed a syntax based SMT model, such as the one by Carreras and Collins (2009) . In this case the search process to produce the most probable translation is a probabilistic parsing scheme.
The previous text describes the hybrid MT systems we consider to include. The baseline is clear. In fact, one can define three baselines: a raw GF system, a raw SMT system and the naïve combination of both. Regarding real hybrid systems there is much more to explore. Here we list four approaches to be pursued:
Hard integration. Force fixed GF translations within a SMT system.
Soft integration I. Led by SMT. GF partial output, as phrase pairs, is integrated as a discriminative probability feature model in a phrase-based SMT system.
Soft integration II. Led by SMT. GF partial output, as tree fragment pairs, is integrated as a discriminative probability model in a syntax-based SMT system.
Soft integration III. Led by GF. Complement with SMT options the GF translation structure and perform statistical search to find the final translation.
At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.
In the evaluation process, these families of methods will be compared to the baseline(s) introduced above according to several automatic metrics.
WP5 evaluation
WP5 is going to have its own internal evaluation complementary to that of WP9. Since statistical methods need of fast and frequent evaluations, most of the evaluation within the package will be automatic. For that, one needs to define the corpora and the set of automatic metrics to work with.
Corpora
Statistical methods are linked to patents data. This is the quasi-open domain where the hybridization is going to be tested. The languages of the corpus are not still completely defined, but by looking at other works with patents we guess they will probably be English, German, and French or Spanish.
Besides the large training corpus, we need at least two smaller data sets, one for development purposes and another one for testing. The order of magnitude of these sets is usually around 1,000 aligned segments or sentences. We expect to reach this size, but the final amount will depend on the available data.
Metrics
BLEU (Papineni et al. 2002) is the de facto metric used in most machine translation evaluation. We plan to use it together with other lexical metrics such as WER or NIST in the development process of the statistical and hybrid systems.
Lexical metrics have the advantage of being language-independent, since most of them are based on n-gram matching. However, they are not able to catch all the aspects of a language and they have been shown not to always correlate well with human judgements. So, whenever it is possible, it is a good practice to include syntactic and/or semantic metrics as well.
The IQmt package1 provides tools for (S)MT translation quality evaluation. For a few languages, it provides metrics to do this deep analysis. At the moment, the package supports English and Spanish, but other languages are planed to be included soon. We will use IQmt for our evaluation on the supported language pairs.
1http://www.lsi.upc.es/~nlp/IQMT/
- Printer-friendly version
- Login to post comments
- Slides
What links here
No backlinks found.