WP5 Statistical and Robust Translation - M6
WP5 is planned to span from Month 7 to Month 30, but it is being conditioned by the delay on the Patents data. So, there is already some ongoing work we detail in the folowing.
Work towards Milestone MS7 (Month 24)
MS7: First prototypes of hybrid combination models.
Most of the objectives of the package depend on the compilation of the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.
At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.
Work towards Deliverable D51 (Month 18)
D51 : Description of the final collection of corpora.
Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information. We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. At the moment, we have compiled and annotated the European Parliament corpus for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.
On the other hand, domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study, WP7). We cannot build the final corpus, but some of the MOLTO members have join the IRF so that a set of Patents data are available for individual research purposes. This has allowed to compile a preliminar parallel corpus on which we can start shortly to build a domain GF grammar and to develop a first pure SMT domain-adapted translator.
Attachment | Size |
---|---|
ProgressReport_WP5.odt | 34.6 KB |
- Printer-friendly version
- Login to post comments
- Slides
What links here
No backlinks found.