Contract No.: | FP7-ICT-247914 |
---|---|
Project full title: | MOLTO - Multilingual Online Translation |
Deliverable: | D9.1. MOLTO test criteria, methods and schedule |
Security (distribution level): | Confidential |
Contractual date of delivery: | M7 |
Actual date of delivery: | October 2010 |
Type: | Report |
Status & version: | Draft (evolving document) |
Author(s): | L. Carlson et al. |
Task responsible: | UHEL |
Other contributors: |
Abstract
The present paper is the summary of deliverable D 9.1 as of M6. Workpackage materials can be found at the UHEL MOLTO website (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/WebHome). This document also links to the MOLTO official website (http://www.molto-project.eu/).
(The official MOLTO website is the prime place for coordinating the project as (long as) material on it is uncluttered, reliable and up to date. For local work, informal project communication and creative planning, the UHEL MOLTO website is open to all MOLTO partners.)
This paper is structured into an introduction followed by sections per workpackage, The WPs are divided into the front end WPs (WP3 and use cases) and the back end ones (WPs 2,4,5). For each WP we survey promises from DoW, ongoing work, and derive requirements from them, followed by evaluation plans or recommendations. Text in brackets refer to source. Action points are in boldface.
The wealth of cited content aims to bring different strains of documented work planned or in progress together, in order to get an updated view of the ongoing MOLTO process, and thus cover the bases for making the tool and user WP requirements meet. We take as base what the technology offers and scale user expectations from that.
We go over the later WP9 tasks first:
D 9.1 is to define the requirements for both the generic tools and the case studies in a coherent way that can lead to maximal synergy between work packages. To do this need to detail the project plan and schedule. This then implies the main outline of the evaluation schedule.
The MOLTO dependency chart only shows dependencies for WP 9 with the use cases WP 6-8 plus the dissemination WP 10. The boldfaced bits above entail that there are dependencies to the tools workpackages as well.
By the MOLTO timetable, WPs 2,4,9 (tools, ontology, req/eval) started at once. Translation tools WP3 and use case WPs 5-6 start at m7 (Varna). Patents use case WP7 has not started due to failure of partner.
By the DoW, MOLTO aims to have working prototypes on the way. So far, each partner has been providing their own demos. Progressively, there will be more need for integration, WP3 in particular will use most of the rest as components. In the best case, integration can be just plugging in APIs, with local bilateral negotiation at best between a provider and a user. But to ensure this, we must agree in time what the APIs will provide.
As suggested in the DoW text (but not spelled out in the schedule), specification/version checkpoints should be agreed more often between the tools WPs. At Varna, we get the first update of the tools and ontology workpackages. We should get together to fix times and expected contents for the remaining internal checkpoints as well. It would help to add checkpoint dates plus time dependencies to the above schedule (turn it into a Gantt chart proper --- the “Gantt chart” in the DoW is more like a PERT chart.) It also helps to be clear just what capabilities each release is planned to offer. Proposals what to insert into the project schedule are made along the way below.
Checkpoints can be constructed from the deliverables list and the milestones table.
The deliverables list implies these checkpoints with implications to the evaluation timetable:
Milestone MS3 may need updating relative to the deliverables list. No important deliverables are scheduled between M6 and M12 that would motivate a demonstrator there. A more appropriate place for the next version of translation tool (after the Phrasebook) is after M18 . M18 should make available ontology interoperability, and along with that, new lexical tools.
Having fixed the schedule some, we go through the WP 9.1 tasks boldfaced from the WP9 statement of purpose.
[From DoW] The work will start with collecting user requirements for the grammar development IDE (WP2), translation tools (WP3), and the use cases (WP6-8). We will define the evaluation criteria and schedule in synchrony with the WP plans (D9.1). We will define and collect corpora including diagnostic and evaluation sets, the former, to improve translation quality on the way, and the latter to evaluate final results.
We have not been able to do much interviewing here because the patent user partner (WP7) is missing and the two others have not started their WPs yet. We have not got real end users in the use cases. In the mathematics case, the end users could be math teaching platform developers; in the patent case, patent office staff; in the museum case, museum workers. These are content professionals with more than average technical facility.
The use cases were scheduled as follows.
This problem was implicit in the original timetable which expected WP9 to work on the use cases before the use case WP's started working. This was noted in the kickoff meeting and agreed that this task would be rescheduled as necessary.
Pending user input, we decided to derive requirements from MOLTO's promises and compare them to the tools resources. The promises made by MOLTO from DoW are summarised below.
[DoW 5]
The single most important S&T innovation of MOLTO will be a mature system for multilingual on-line translation, scalable to new languages and new application domains. The single most important tangible product of MOLTO is a software toolkit, available via the MOLTO website. The toolkit is a family of open-source software products:
A helpful list of quality dimensions relevant to MOLTO evaluation can be derived from the DoW list of links between the main objectives and the tasks in WP’s:
Here are some measurable expected outcomes. Most of them are directly applicable as testable quantitative evaluation measures. It is another thing how many test rounds we can do, given the need of fresh test subjects.
Feature | Current | Projected | Remarks |
---|---|---|---|
Languages | up to 7 | up to 15 | languages treated simultaneously |
Domain size | 100’s of words | 1000’s of words | 4 domains with substantial applications (“substantial” not quantified here) |
Robustness | none | open-text capability | translation quality: “complete” or “useful” on the TAUS scale (Translation Automation Users Society) |
Development per domain | months | days | |
Development per language | days | hours | |
Learning (grammarians) | weeks | days | |
Learning (authors) | days | hours | source authoring: the MOLTO tool for writing translatable controlled text can be learned in less than one hour, the speed of writing translatable controlled text is in the same order of magnitude as writing unlimited plain text |
The number 18 of grammar library languages is the minimum number of languages we expect to be available at the end of MOLTO. The number 3 to 15 is the number of languages actually implemented in MOLTO’s domain grammars (3 in WP7, 15 in WP6 and WP8).
The measurements of all these features are performed within WP9 in connection to the project milestones. The advisory group will confirm the adequacy and accuracy of the measurements.
The objects of evaluation – even the translated texts – vary considerably per WP. We detail some criteria per WP below. Evaluation criteria and methods have been collected on the UHEL MOLTO website (esp. https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook).
Not much could be done here (yet). We have not got patent corpora. The mathematicians have yet to collect their word problems. We got a small museum text corpus (approx. 25000 words in Swedish, a set of 9 short passages translated into English presumably by non-native speakers) from Gothenburg.
We have translated parts of this corpus both manually and using MT for test material in BLEU evaluation. A pilot comparing BLEU scores on this material to a manual error analysis is on the way.
A small test GF grammar for a sample of the corpus has been written (link). It has helped making more concrete the requirements on grammar-ontology interoperability (below).
We have also fetched the usual EU multilingual corpora on our test platform (hippu.csc.fi).
We have found time to install an evaluation platform, collect and test standard issue translation quality evaluation tools, to develop forthcoming MOLTO lexicon tools, to learn GF and develop ideas about the ontology to grammar interface. The IQmt evaluation platform was tested on a small sample of machine and human translated text (English into Finnish) (see https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook).
UHEL also took part in the MOLTO phrasebook task, a demo for translating touristic phrases between 14 European languages: Bulgarian, Catalan, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Romanian, Spanish, Swedish. This experiment presents one way evaluate the effort required for adding new language versions (more on this below).
We divide the rest of the paper by WPs into the front end: translation tool, the use cases and associated lingware (ontologies and grammars), and the back end: the translation system (WPs 2,4,5), presented in this order. We also try to form an idea about what WPs are currently about to see how they are construing their tasks. Information about this (at least task titles) was found on MOLTO website.
The MOLTO workflow is a break to tradition in the professional translation business as well as the consumer end in that it merges the roles of content author and translator. In professional translation, a document is authored at source and the translator's work on the source is read-only. At the consumer end, MT is largely used for gisting from unknown languages to familiar ones.
The main impact is expected to be on how the possibilities of translation are viewed in general. The field is currently dominated by open-domain browsing-quality tools (Google translate and Systran), and domain-specific high-quality translation is considered expensive and cumbersome.
MOLTO will change this view by making it radically easier to provide high-quality translation on its scope of application—that is, where the content has enough semantic structure—and it will also widen this scope to new domains. Socioeconomically, this will make web content more available in different languages, including interactive web pages.
At the end of MOLTO, the technology will be illustrated in case studies that involve up to 15 languages with a vocabulary of up to 2,000 special terms (in addition to basic vocabulary provided by the resource grammar).
The generic tools developed MOLTO will moreover make it possible for third parties to create such translation systems with very little effort. Creating a translation system for a new language covering an unlimited set of documents in a domain will be as smooth (in terms of skill and effort) as creating an individual translation of one document.
(The last sentence sounds like a tall order. But probably it just points out that once MOLTO has been primed for one text it can translate any number of (sufficiently) similar ones.)
The MOLTO change of roles will also entail a change of scenarios.
Translator's new role (parallel to WP3: Translator's tools) will be designed and described in the D9.1 deliverable. Most current translator's workbench software treat the original text as read-only source. The tools to be developed within WP3 (+ 2) will lead towards more mutable role of source text. The translation process will resemble more like structured document editing or multilingual authoring than transformation from a fixed source to a number of target languages.
Since the MOLTO scenario implies major differences to the received translation workflow and current roles and requirements from translation client, translator, revisor etc. MOLTO is not likely to impact translation business at large in the near future. Instead, it has its chances in entering and creating new workflows, in particular, in multilingual web publishing. Multilingual websites are currently developed by means of crowdsourcing translation with tools borrowed from the software localization business. (links). MOLTO could complement or replace this workflow with its new role cast of a content producer or technical editor that generates multilingual content from a single language source. Applications may include multilingual Wikipedia articles, e-commerce sites, medical treatment recommendations, tourist phrasebooks, social media , SMS.
The introductory scenario of this proposal, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). Any change made in any of the languages gets automatically translated to the other languages.
As for CAT in general, the advantages of MOLTO can be particularly clear in versioning of already existing sites.
We next review user requirements by type of user and the expected expertise of each. Consider the role cast around MOLTO. The role cast in MOLTO can have at least these:
• Author
• Editor
• Translator
• Checker
• Ontologist
• Terminologist
• Grammarian
• Engineer
So far, all of these roles are merged. Different use scenarios may separate some and merge others. Peculiar to MOLTO is the merge of the author/editor/translator roles. In the MOLTO scenario, the editor-translators cannot be expected to know (all) the target language(s). The target checker(s) and terminologist(s)-grammarian(s) are likely to be different from them, possibly a widely distributed crowd.
The translator's tool serves primarily for author/editor/translator/checker roles. It links to TF which serves ontologist/terminologist roles (and connects them to the former). Presumably, the Grammar IDE supports the last four roles on the above list.
The author is likely to be some sort of an expert on the subject matter, but not necessarily an expert on ontology work. The editor, if separate from the author, could be less of a subject expert but possibly more of an ontologist. How much of a difference there need be between these roles depends on the cleverness of the MOLTO tools.
Say an author types away and MOLTO counters with questions caused by the underlying ontology (of type do you mean this or that?) Unless the author agrees with the ontology, he may be hard put to answer, while an editor/ontologist (familiar with the ontology and/or the way MOLTO works) may know how to proceed – to choose the right thing or to realize the right alternative is missing and how to fix it.
Analogous comments can be made of the relations between author, translator, checker and terminologist. It is all very well for the author to immediately see translations in umpteen languages he does not know. He has no way of knowing whether they are correct (unless MOLTO provides some way for him to check – say back translation with paraphrase?). Also, concrete grammars may ask awkward questions (of the type do you mean male or female, familiar or polite?). To get things right, the author would need to know whether one should be familiar or polite in language N. Here, he needs (to be) a translator or native checker. Considerations like this need to be taken into account in WP3 requirements analysis.
The following lengthy quote from DoW recaps the main ingredients of the translation tools made available to WP3 by WP2.
[9 Translator’s tools in DoW]
For the translator’s tools, there are three different use cases:
• restricted source
• production of source in the first place
• modifying source produced earlier
• unrestricted source
Working with restricted source language recognizable by a GF grammar is straightforward for the translating tool to cope with, except when there is ambiguity in the text. The real challenge is to help the author to keep inside the restricted language. This help is provided by predictive parsing, a technique recently developed for GF (Angelov 2009). Incremental parsing yields word predictions, which guide the author in a way similar to the T9 method1 in mobile phones. The difference from T9 is, however, that GF’s work prediction is sensitive to the grammatical context. Thus it does not suggest all existing words, but only those words that are grammatically correct in the context.
Predictive parsing is a good way to help users produce translatable content in the first place. When modifying the content later, e.g. in a wiki, it may not be optimal ... This is where another utility of the abstract syntax comes in: [syntax editing]. in the abstract syntax tree, all that is changed is the noun, and the regenerated concrete syntax string automatically obeys all the agreement rules. This functionality is implemented in the GF syntax editor (Khegai & al. 2003).
The predictive parser of GF does not try to resolve ambiguities, but simply returns all alternatives in the parse chart. This is not always a problem, since it may be the case that the target language has exactly the same ambiguity and then it remains hidden in the translation. In practise this happens often in closely related languages. But if the ambiguity makes a difference in translation, it has to be resolved. There are two complementary approaches: using statistical models for ranking or using manual disambiguation. … For users less versed in abstract syntax, however, a better choice is to show the ambiguities as different translation results. Then the user just has to select the right alternatives. The choice is propagated back in the abstract syntax, which has the cumulative effect that a similar ambiguity in a third language gets fixed as well. This turns out to be very useful in a collaborative environment such as Wikipedia.
Both predictive parsing and syntax editing are core functionalities of GF and work for all multilingual grammars. While the MOLTO project will exploit these functionalities with new grammars, it will also develop them into tools fitting better into users’ work flows. Thus the tools will not require the installation of specific GF software: they will work as plug-ins to ordinary tools such as web browsers, text editors, and professional translators’ tools such as SDL and WordFast.
The snapshot in Figure 2 is from an actual web-based translation prototype using GF. It shows a slot in an HTML page, built by using JavaScript via the Google Web Toolkit (Bringert & al. 2009). The translation is performed in a server, which is called via HTTP. Also client-side translators, with similar user interfaces, can be built by converting the whole GF grammar to JavaScript (Meza Moreno and Bringert 2008).
To deal with unrestricted legacy input, such as in the patent case study, predictive parsing and syntax editing are not enough. The translator will then be given two alternatives: to extend the grammars, or to use statistical translation.
For grammar extension, some functionalities of the grammar writer’s tools are made available to the translator—in particular, lexicon extension (to cope with unknown words) and example-based grammar writing (to cope with unknown syntactic structures). In statistical translation, the worst-case solution is to fall-back to phrase-based statistical translation. In MOLTO, we will study the ways to specialize this to translation in limited domains, so that the quality is higher than in general-purpose phrase-based translation. We will also study other methods to help translators with unexpected input.
WP3 has its main deliverables at months 18, 24 and 30.
Del. no |
Del. title |
Nature |
Date |
D 3.1 |
MOLTO translation tools API |
P |
M18 |
D 3.2 |
MOLTO translation tools prototype |
P |
M24 |
D 3.3 |
MOLTO translation tools / workflow manual |
RP, Main |
M30 |
[WP3 in DoW]
The standard working method in current translation tools is to work on the source and translation as a bilingual text. Translation suggestions are sought from TM (Translation Memory) based on similarity, or generated by a MT system, are presented for the user to choose from and edit manually. The MOLTO translator tool extends this with two additional constrained-language authoring modes, a robust statistical machine translation (UPC) mode, plus vocabulary and grammar extension tools (UGOT), including: (i) mode for authoring source text while context-sensitive word completion is used to help in creating translatable content; (ii) mode for editing source text using a syntax editor, where structural changes to the document can be performed by manipulating abstract syntax trees; (iii) back-up by robust and statistical translation for out-of-grammar input, as developed in WP5; (iv) support of on-the-fly extension by the translator using multilingual ontology-based lexicon builder; and (v) example-based grammar writing based on the results of WP2.
The WP will build an API (D3.1, UHEL) and a Web-based translator tool (D3.2, by Ontotext and UGOT). The design will allow the usage of the API as a plug-in (UHEL) to professional translation memory tools such as SDL and WordFast. We will apply UHEL’s ContentFactory for distributed repository system and a collaborative workflow for multilingual terminology.
This is what we say about the eventual translation platform in DoW (section numbering 1.2.5 seems a random error):
1.2.5 Multilingual services
MOLTO will provide a unique platform for multilingual document management, satisfying the five desired features listed in Section 1.1. [?] It will enable truly collaborative creation and maintenance of content, where input provided in any language of the system is immediately ported to the other languages, and versions in different languages are thereby kept in synchrony. This idea has had previous applications in GF (Dymetman & al. 2000, Khegai & al. 2003, Meza Moreno and Bringert 2008). In MOLTO, it will be developed into a technology that can be readily applied by non-experts in GF to any domain that allows for an ontology-based interlingua.
The methodology will be tested on three substantial domains of application: mathematics teaching material, patents, and museum object descriptions. These case studies are varied enough to show the generalisability of the MOLTO technology, and also extensive enough to produce useful prototypes for end users of translations: mathematics students, intellectual property researchers, and visitors to museums. End users will have access in their own languages to information that may be originally produced in other languages.
This does not actually say that all three use cases use one and the same platform (unless 'unique' means just one). It is not even sure they want the same features. The mathematicians are likely to need some math editing tool and perhaps access to a computational algebra solver. Patent translators may need access to patent corpora and databases. Museum people may need to work with images. Future MOLTO users may have their own favourite platforms with such facilities in place.
Rather, the WP3 translation tools deliverable should be a set of plugins usable in many different platforms, in turn variously using the common GF back-end plugins listed above.
Still, we need a flagship demonstrator for the project. The flagship demonstrator should be a generic web editing platform. Minimally, it can be an extension of the existing GF web translation demo. In the best case, it could be installed as a set of plugins to some existing web platform like Mediawiki, Drupal and/or some open source CAT tool(s).
The demonstrator should be able to have at least the following plugins:
• GF translation editor (including autocompletion and syntax editing)
• GF grammar IDE
• TF ontology/lexicon manager
• Ontotext ontology tools (if separate from above)
• SMT translator (if separate from above)
• TM (translation memory)
The TM on the list is a stand-in for tools to support non-constrained editing. (It appears that some use cases will need to mix GF translation with manual (CAT or SMT supported) translation.
All or parts of some existing web translation/localization platform(s) could be taken as starting point. Or conversely, some existing CAT tool components could be plugged into ours. (The latter plan may now seem more promising.)
Translator’s tools promised by WP2 include
• text input + prediction (= autocompletion from grammar)
• syntax editor for modification
• disambiguation
• on the fly extension
The MOLTO worfklow and role play must be spelled out in the grammar tool manual (D 2.3) and the MOLTO translation tools / workflow manual (D 3.3). We should start writing these manuals now, to fix and share our ideas about the user interfaces.
The main claims to fame in MOLTO are to produce high automatic translation quality, particularly in view of faithfulness, into multiple languages from one pre-editable source, and as a way to that, practically (= economically) feasible multilingual online translation editing with a minimum of training:
[DoW]
The expertise needed for using the translation system will be minimal, due to the guidance provided by MOLTO.
Feature |
Current |
Projected |
|
Learning (authors) |
days |
hours |
|
These claims should then be among the items to evaluate.
Quantified evaluation of translation tool features make sense starting with the translation tool prototype developed in WP3 (M24). The tests can be developed and calibrated on the initial demonstrator at M18.
We distinguish below between evaluating the translation result and evaluating the translation process.
3a. Evaluating the translation result
We argue below that there is little sense for WP9 to quantitatively measure MOLTO translation quality with standard MT eval tools except at the end of MOLTO (D 9.2). On the way there, WPs (in particular the GF grammar and SMT WPs) should institute their own progress evaluation schedules. They may then outsource translation quality evaluations to WP9 when appropriate. What we want to avoid is an externally imposed evaluation drill during WP work which can produce skewed results and cause useless delays on the way.
We have created a UHEL MOLTO TWiki website to coordinate our workpackages internally (link). The website is open for other MOLTO partners as well.
We have installed standard SMT evaluation tools (hippo.csc.fi). A Pilot study on measuring translation fidelity have been conducted in PhD project associated to MOLTO (Maarit Koponen).
This is what MOLTO promised in the DoW about translation quality assessment:
To measure the quality of MOLTO translations, we compare them to
(i) statistical and symbolic machine translation (Google, SYSTRAN); and
(ii) human professional translation.
We will use both
automatic metrics (IQmt and BLEU; see section 1.2.8 for details (???)) and
TAUS quality criteria (Translation Automation Users Society1)
As MOLTO is focused on information-faithful grammatically correct translation in special domains, TAUS results will probably be more important.
Given MOLTO’s symbolic, grammar-based interlingual approach, scalability, portability and usability are important quality criteria.
These criteria are quantified in (D9.1) and reported in the final evaluation (D9.2).
In addition to the WP deliverables, there will be continuous evaluation and monitoring with internal status reports according to the schedule defined in D9.1.
The criteria (scalability, portability, and usability) mean that MOLTO should have wider coverage, be easier to extend and need less expertise than similar (symbolic, grammar-based, interlingual) solutions heretofore.
[12 Translation quality]
We will compare the results of MOLTO to other translation tools, by using both automatic metrics (BLEU, Bilingual Evaluation Understudy, Papineni & al. 2002) and, in particular, the human evaluation of “utility”, as defined by TAUS. The comparison is performed with the freely available general-purpose tools Google translate and Systran. While the comparison is “unfair” in the sense that MOLTO is working with special-purpose domain grammars, we want to perform measurements that confirm that MOLTO’s quality really is essentially better. Comparisons with domain-specific systems will be performed as well, if any such systems can be found. Domain-specific translation systems are still rare and/or not publicly available.
Regarding automatic metrics for MT, the usage of lexical n-gram based metrics (WER, PER, BLEU, NIST, ROUGE, etc.) represents the usual practice in the last decade. However, recent studies showing some limitations of lexical metrics at capturing certain kind of linguistic improvements and making appropriate rankings of heterogeneous MT systems Callison-Burch et al. (2006); Callison-Burch et al. (2007); Callison-Burch et al. (2008); Giménez (2008) have fostered research on more sophisticated metrics, which can combine several aspects of syntactic and semantic information. The IQmt suite1, developed by the UPC team, is one of the examples in this direction Giménez and Amigó (2006); Giménez and Màrquez (2008). In IQmt, a number of automatic metrics for MT, which exploit linguistic information from morphology to semantics, are available for the English language and will be extended to other languages (e.g., Spanish) soon. These metrics are able to capture more subtle improvements in translation and show high correlation with human assessments Giménez and Màrquez (2008); Callison-Burch et al. (2008). We plan to use IQmt in the development cycle whenever it is possible. For languages not covered in IQmt, we will rely on BLEU (Papineni et al. 2002).
Regarding human evaluation, the TAUS method is the more appropriate one for the MOLTO tasks, since we are aiming for reliable rendering of information. It consists of inspection of a significant number of source/target segments to determine the effectiveness of information transfer. The evaluator first reads the target sentence, then reads the source to determine whether additional information was added or misunderstandings identified.
The scoring method is as follows:
4. Complete: All of the information in the source was available from the target; reading the source did not add to information or understanding.
3. Useful: The information in the target was correct and clear, but reading the source added some additional information or understanding.
2. Marginal: The information in the target was correct, but reading the source provided significant additions or clarifications.
1. Poor: The information in the target was unclear and/or incorrect; reading the source would be necessary for understanding.
We aim to reach “complete” scores in mathematics and museum translation, and “useful” scores in patent translation.
Dimensions not mentioned in the TAUS scoring are “grammaticality” and “naturalness” of the produced text. The grammar-based method of MOLTO will by definition guarantee grammaticality; failures in this will be fixed by fixing the grammars. Some naturalness will be achieved in the sense of “idiomaticity”: the compile-time transfer technique presented in Section 1.2.3 will guarantee that forms of expression which are idiomatic for the domain are followed. The higher levels of text fluency reachable by Natural Language Generation techniques such as aggregation and referring expression selection have been studied in some earlier GF projects, such as (Burke and Johannisson 2005). Some of these techniques will be applied in the mathematics and cultural heritage case studies, but the main focus is just on rendering information correctly. On all these measures, we expect to achieve significant improvements in comparison to the available translation tools, when dealing with in-grammar input.
Applying BLEU and similar methods which compare MT output to human model translations promises to be laborious in the case of MOLTO because we have a large number of less-common target languages and lack use case related corpora. Though we have not full knowledge yet what corpora we shall have access to, they are not likely to provide a wealth of (preferably many parallel) human model translations for comparison in the special domains we have:
• We expect the mathematics WP to involve a small number (tens or hundreds) of short (one-paragaph) examples
• The museum corpus (at least so far) is not much larger (25K words in all). The largest subset is Swedish only.
• We do not know yet what to expect from the patent partner.
The main difficulty for automatic comparison measures are ambiguities in natural languages: Usually, there is more than one correct translation for a source sentences; there are ambiguities in the choice of synonyms as well as in the order of the words. Allowance for free variation through synonymy and paraphrase (free translation in general) is made with more comparison text. For instance, the NIST evaluation campaign uses four parallel translations (to the same language) of texts in the order of 15-20K words.
What is more to the point, BLEU results are not likely to prove MOLTO's strengths, because they are not sensitive to fidelity, being in this respect like the n-gram SMT methods they simplify. Preliminary tests to this effect have been conducted by Maarit Koponen (links).
BLEU and similar tests have been developed in the context of SMT and for the assimilation (gisting) scenario. Most of the weight in BLEU or WER like measures comes from matched words and shorter n-grams. These measures point in the right direction as long as translation quality is low (as long as long distance dependencies and fidelity do not matter).
The distinction between fluency and fidelity in human evaluation measures is not made for automatic evaluation measures. Each such measure is considered to judge the overall quality of a candidate sentence or system, rather than the quality with respect to certain aspects. Leusch (link) shows that some measures have preferences for certain aspects – the unigram PER correlates with adequacy to a higher degree than the bigram PER, whereas this is vice versa on the fluency, but the observation remains to be exploited.
To evaluate fidelity as well as fluency, more grammar sensitive measures are needed. In smaller use cases, human evaluation is likely to be the cost effective solution (link). An innovative approach suggested by work in Koponen (to appear) would to develop the MOLTO evaluation methodology using MOLTO's own technology. The idea is to use simplified (MOLTO or other) parsing grammars to test fidelity and domain ontologies to test fluency.
Fidelity (preservation of grammatical relations) would be gauged by using simplified grammars to parse summaries of text and comparing MOLTO translations of summaries with summaries of translations. The assumption is (like it implicitly is in BLEU) that the translator is more reliable with shorter bits (and there are more of them).
Acceptability of lexical variation in the target text would be checked (not against parallel human translations but) against multilingual domain ontologies (e.g., use vessel or boat instead of ship).
Note the analogy here to BLEU's use of n-grams as a simplification of SMT methods to compare SMT to human targets. Work developing these ideas is in progress in a PhD project associated to MOLTO (Koponen to appear). The planned GF/SMT hybrid system is interesting here. It suggests analogous ideas for hybridizing statistical and grammar based evaluation measures.
At the evaluation phase towards the end of MOLTO, a comparison of (say) the patent case output to competing methods using generic tools like the SMT evaluation tools and TAUS criteria is worth doing, and has been promised in the DoW. On the way there, however, we prefer developing and applying MOLTO specific evaluation methods.
UHEL needs to synchronise evaluation plans with the SMT workpackage.
3b. Evaluating the translation process
WP9 aims to set requirements and evaluate the MOLTO translation workflow from the beginning. We argue below that evaluating the translation workflow and translator productivity are particularly important in MOLTO. For related work in other projects, see (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook) Our initial proposals follow below.
The MOLTO pre-editing strategy lets an author or technical editor modify the text, the translator enrich the vocabulary, and the grammarians perfect the grammar until the translation result is acceptable. Therefore the success criterion for the MOLTO approach must be how much effort it takes to get a translation from initial state to a break-even point (as defined by the use case). A translation can always be made better with more work on the tool, but the crux is when the result pays the effort. The DoW sets these quantitative expectations on source editing:
1. source authoring: the MOLTO tool for writing translatable controlled text can be learned in less than one hour, the speed of writing translatable controlled text is in the same order of magnitude as writing unlimited plain text
“Of the same order” mathematically means that writing with MOLTO is not ten times slower than writing without it. We should clock this.
We pick up this discussion again under WP2 in connection with measuring the vocabulary and grammar extension effort.
The description of this case study in Dow and the MOLTO website makes apparent that the math use case demonstrator is not so much a translation editor as natural language front end to computer algebra.
Leader: jordi.saludes
Timeline: July, 2010 - May, 2012
The ultimate goal of this package is to have a multilingual dialog system able to help the math student in solving word problems.
The UPC team, being a main actor in the past development of GF mathematical grammars and having ample experience in mathematics teaching, will be in charge of the tasks in this work package with help from UGot and UHEL on technical aspects of GF and translator’s tools, along with Ontotext on ontology representation and handling. We will start by compiling examples of word problems. In parallel, we will take the mathematical multilingual GF library which was developed in the framework of the WebALT project and organize the existing code into modules, remove redundancies and format them in a way acceptable for enhancement by way of the grammar developer’s and translator’s tools of work packages 2 and 3 (D6.1). The next step will be writing a GF grammar for commanding a generic computer algebra system (CAS) by natural language imperative sentences and integrating it into a component (D6.2) to transform the commands issued to the CAS (Maybe as a browser plugin). For the final deliverable (D6.3), we will use the outcome of work package 4 to add small ontologies describing the word problem: We will end with a multilingual system able to engage the student into a dialog about the progress being made in solving the problem. It will also help in performing the necessary computations.
The impression is confirmed by an email From Jordi Saludes:
"The simplest implementation will be a terminal-based question/answer
system like ELIZA, but focused on solving word problems. It will start by
giving the statement of the problem, then it will do computations for the
student/user, list unknowns, list relations between unknowns, state the
progress of the resolution and, maybe, give hints.
We are thinking about the kind of word problems which require solving a system
of (typically two) linear equations. In Spain these are addressed to first or
second year high school students."
On the way to the demonstrator, the plan is to devise small ontologies describing math word problems and verbalise them using the MOLTO platform and WebAlt project math GF grammars. These phases of the work can be evaluated on the lines indicated under WP2-3. Since the corpus is small, manual quality evaluation using TAUS criteria is appropriate. We need to buy TAUS criteria if we are not getting them from the patent partner.
ID |
|
Task leader |
Status |
New comments |
6.0 |
Hold |
|
||
6.1 |
Planned |
|
||
6.2 |
Planned |
|
||
6.3 |
Ongoing |
|
||
6.4 |
Planned |
|
||
6.5 |
Planned |
|
||
6.6 |
Planned |
|
||
6.7 |
Planned |
|
||
6.8 |
Planned |
|
ID |
|
Due date |
Dissemination level |
Nature |
Publication |
D6.1 |
1 June, 2011 |
Public |
Prototype |
|
The description of this use case is on hold pending a new partner. There is another EU project about translating patents. One way to assess MOLTO could be to compare our results to them.
PLuTO will develop a rapid solution for patent search and translation by integrating a number of existing components and adapting them to the relevant domains and languages. CNGL bring to the target platform a state-of-the-art translation engine, MaTrEx, which exploits hybrid statistical, example-based and hierarchical techniques and has demonstrated high quality translation performance in a number of recent evaluation campaigns. ESTeam contributes a comprehensive translation software environment to the project, including server-based, multi-layered, multi-domain translation memory technology. Information retrieval expertise is provided by the IRF which also provides access to its data on patent search use-cases and a large scale, multilingual patent repository. PLuTO will also exploit the use-case holistic machine translation expertise of Cross Language, who have significant experience in the evaluation of machine translation, while WON will be directly involved in all phases of development, providing valuable user feedback. The consortium also intends to collaborate closely with the European Patent Office in order to profit from their experience in this area.
WP No 8 Leader UGOT Start M13 End M30
WP Title Case Study: Cultural Heritage
The objective is to build an ontology-based multilingual grammar for museum information starting from a CRM ontology for artefacts at Gothenburg City Museum[1], using tools from WP4 and WP2. The grammar will enable descriptions of museum objects and answering to queries over them, covering 15 languages for baseline functionality and 5 languages with a more complete coverage. We will moreover build a prototype of a cross-language retrieval and representation system to be tested with objects in the museum, and automatically generate Wikipedia articles for museum artefacts in the 5 languages with extensive coverage.
The work is started by a study of the existing categorizations and metadata schemas adopted by the museum, as well as a corpus of texts in the current documentation which describe these objects (D8.1, UGOT and Ontotext). We will transform the CRM model into an ontology aligning it with the upper-level one in the base knowledge set (WP4) and modeling the museum object metadata as a domain specific knowledge base. Through the interoperability engine from WP4 and the IDE from WP2, we will semi-automatically create the translation grammar and further extend it (D8.2, UGOT, UHEL, UPC, Ontotext). The final result will be an online system enabling museum (virtual) visitors to use their language of preference to search for artefacts through semantic (structured) and natural language queries and examine information about them. We will also automatically generate a set of articles in the Wikipedia format describing museum artefacts in the 5 languages with extensive grammar coverage (D8.3, UGOT, Ontotext).
Del. no |
Del. title |
Nature |
Date |
D 8.1 |
Ontology and corpus study of the cultural heritage domain |
O |
M18 |
D 8.2 |
Multilingual grammar for museum object descriptions |
P |
M24 |
D 8.3 |
Translation and retrieval system for museum object descriptions |
P,Main |
M30 |
CIDOC Conceptual Reference Model (CRM), a high-level ontology to enable information integration for cultural heritage data and their correlation with library and archive information. The CIDOC CRM is now in the process to become an ISO standard.
The CIDOC CRM analyses the common conceptualizations behind data and metadata structures to support data transformation, mediation and merging. It is property-centric, in contrast to terminological systems. It is now in a very stable form, and contains 80 classes and 130 properties, both arranged in multiple isA hierarchies.
Semantic Computing Research Group (SeCo, Eero Hyvönen) has an Ontology for museum domain (MAO). MAO is an ontology for the museum domain, used for describing content such as museum items. MAO is ontologically mapped to the Finnish General Upper Ontology YSO and has been created as part of the FinnONTO-project. The most important application of MAO is The Semantic Portal for Finnish Culture Kulttuurisampo. Seco is specialised in indexing websites with ontologies. They are currently translating their ontologies into Finnish and Swedish.
To be completed...
The deliverables promised from WP2:
ID |
|
Due date |
Dissemination level |
Nature |
Publication |
D2.1 |
1 March, 2011 |
Public |
Prototype |
|
|
D2.2 |
1 September, 2011 |
Public |
Prototype |
|
|
D2.3 |
1 March, 2012 |
Public |
Regular Publication |
|
[this comes from the MOLTO website:]
The objective is to develop a tool for building domain-specific grammar-based multilingual translators. This tool will be accessible to users who have expertise in the domain of translation but only limited knowledge of the GF formalism or linguistics. The tool will integrate ontologies with GF grammars to help in building an abstract syntax. For the concrete syntax, the tool will enable simultaneous work on an unlimited number of languages and the addition of new languages to a system. It will also provide linguistic resources for at least 15 languages, among which at least 12 are official languages of the EU.
The top-level user tool is an IDE (Integrated Development Environment) for the GF grammar compiler. This IDE provides a test bench and a project management system. It is built on top of three more general techniques: the GF Grammar Compiler API (Application Programmer’s Interface), the GF-Ontology mapping (from WP4), and the GF Resource Grammar Library. The API is a set of functions used for compiling grammars from scratch and also for extending grammars on the fly. The Library is a set of wide-coverage grammars, which is maintained by an open source project outside MOLTO but will be via MOLTO efforts made accessible for programmers on lower levels of linguistic expertise. Thus we rely on the available GF resource grammar library and its documentation, available through digitalgrammars.com/gf/lib. The API is also used in WP3, as a tool for limited grammar extension, mostly with lexical information but also for example-based grammar writing. UGOT designs APIs and the IDE, coordinates work on grammars of individual languages, and compiles the documentation. UHEL contributes to terminology management and work on individual languages. UPC contributes to work on individual languages. Ontotext works on the Ontology-Grammar interface and contributes to the ontology-related part of the IDE.
Here we try to make a bit clearer what the functionalities of the WP2 tools are, and how they relate to the translator's tool.
We surmise that the grammar compiler's IDE is meant primarily for grammarian/engineer roles, i.e. for extending the system to new domains and languages. But it may contain facilities or components which are also relevant for the translation tool. In many scenarios, we must allow the translator to extend the system, i.e. switch to some of the last four roles. Just how the translation tool is linked to the grammar IDE needs specifying.
What the average user can do to fix the translation depends on how user friendly we can get. Minimally, a translator only supplies a missing translation on the fly, and all necessary adaptation is handled by the system. Maximally, an ontology or grammar needs extending as a separate chore by hand, using the grammar IDE.
An author/editor/translator can be expected to translate with the given lingware. The next level of involvement is extending the translation. This may cause entries or rules to be added to a text, company, or domain specific ontology/lexicon/grammar. If the tool is used in an organization, roles may be distributed to different people and questions of division of labor and quality control (as addressed in TF) already arise.
For it is not only, even in the first place, a question of being able to change the grammar technically, but managing the changes. A change in the source may cause improvement in some languages, deterioration in others. The author can't possibly check the repercussions in all languages. Assume each user site makes its own local changes. How many different versions of MOLTO lingware will there be? One for each website maintained with MOLTO? – how can sites share problems and solutions? A picture of a MOLTO community not unlike the one envisaged for multilingual ontology management TF starts to form. The challenge is analogous to ontology evolution. There are hundreds of small university ontologies in Swoogle. Quality can be created in the crowd, but there must be an organisation for it (cf. Wikipedia).
The MOLTO worfklow and role play must be spelled out in the grammar tool manual (D 2.3) and the MOLTO translation tools / workflow manual (D 3.3). We should start writing these manuals now, to fix and share our ideas about the user interfaces.
The way disambiguation now works is that translation of a vague source against a finer grained target generates the alternative translations with disambiguating metatext to help choose the intended meaning. (try I love you in http://www.grammaticalframework.org/demos/phrasebook/. Compare to Boitet et al.'s 1993 dialogue based MT system Lidia e.g. http://www.springerlink.com/content/kn8029t181090028/)
This facility could link to the ontology as a source of disambiguating metatext, either from meta comments or directly verbalised from ontology).
Some of the GF 3.2 features, like parse ranking and example based grammar generation, have consequences to front end design, as enabling technology.
[11 Productivity and usability]
Our case studies should show that it is possible to build a completely functional high-quality translation system for a new application in a matter of months—for small domains in just days.
The effort to create a system dynamically applicable to an unlimited number of documents will be essentially the same as the effort it currently takes to manually translate a set of static documents.
The expertise needed for producing a translation system will be low, essentially amounting to the skills of an average programmer who has practical knowledge of the targeted language and of the idiomatic vocabulary and syntax of the domain of translation.
1. localization of systems: the MOLTO tool for adding a language to a system can be learned in less than one day, and the speed of its use is in the same order of magnitude as translating an example text where all the domain concepts occur
The role requirements for extending the system remain quite high, not because of the requirements on the individual skills, but because it is less common to find their combination in one person.
The user requirements entail an important evaluation criterion: the guidance provided by MOLTO. It should also lead to system requirements, like online help, examples, profiling capabilities.
One part of MOLTO adaptivity is meant to come from the grammar IDE. Another part should come from ontologies. While the former helps extending GF “internally”, the latter should allow bringing in semantics and vocabulary from OWL ontologies. We discuss these two parts in this order.
[8 Grammar engineering for new languages in DoW]
In the MOLTO project, grammar engineering in GF will be further improved in two ways:
• An IDE (Integrated Development Environment), helping programmers to use the RGL and manage large projects.
• Example-Based Grammar Writing, making it possible to bootstrap a grammar from a set of example translations.
The former tool is a standard component of any library-based software engineering methodology. The latter technique uses the large-coverage RGL for parsing translation examples, which leads to translation rule suggestions.
The task of building a new language resource from scratch currently is described in http://grammaticalframework.org/doc/gf-lrec-2010.pdf. As this is largely a one-shot language engineering task outside of MOLTO (MOLTO was supposed to have its basic lingware done ahead of time), it should not call for evaluation here.
Building a multilingual application for a given abstract domain grammar by way of applying and extending concrete resource grammars can use a lighter process. The proposed example-based grammar writing process is described in the Phrasebook deliverable (http://www.molto-project.eu/node/1040). The tentative conclusions were:
• The grammarian need not be a native speaker of the language. For many languages, the grammarian need not even know the language, native informants are enough. However, evaluation by native speakers is necessary.
• Correct and idiomatic translations are possible.
• A typical development time was 2-3 person working days per language.
• Google translate helps in bootstrapping grammars, but must be checked. In particular, we found it unreliable for morphologically rich languages.
• Resource grammars should give some more support e.g. higher-level access to constructions like negative expressions and large-scale morphological lexica.
Effort and Cost
Based on this case study, we roughly estimated the effort used in constructing the necessary sources for each new language and compiled the following summarizing chart.
Language |
Language skills |
GF skills |
Informed development |
Informed testing |
Impact of external tools |
RGL Changes |
Overall effort |
Bulgarian |
### |
### |
- |
- |
? |
# |
## |
Catalan |
### |
### |
- |
- |
? |
# |
# |
Danish |
- |
### |
+ |
+ |
## |
# |
## |
Dutch |
- |
### |
+ |
+ |
## |
# |
## |
English |
## |
### |
- |
+ |
- |
- |
# |
Finnish |
### |
### |
- |
- |
? |
# |
## |
French |
## |
### |
- |
+ |
? |
# |
# |
German |
# |
### |
+ |
+ |
## |
## |
### |
Italian |
### |
# |
- |
- |
? |
## |
## |
Norwegian |
# |
### |
+ |
- |
## |
# |
## |
Polish |
### |
### |
+ |
+ |
# |
# |
## |
Romanian |
### |
### |
- |
- |
# |
### |
### |
Spanish |
## |
# |
- |
- |
? |
- |
## |
Swedish |
## |
### |
- |
+ |
? |
- |
## |
The phrasebook deliverable is one simple example what can be done to evaluate the grammar workpackage's promises. The results from the Phrasebook experiment may be positively biased because the test subjects were very well qualified. But this and similar tests can be repeated with more “ordinary people”, and changes in the figures followed as the grammar IDE is developing.
It could be instructive to repeat the exact same test with different subjects and compare the solutions, to see how much creativity was involved in the solutions. The less there is variation the better the chances to automate the process. Even failing that, analysis of the variant solutions could help suggest guidelines and best practices to the manual. Possible variation here also raises the issue of managing changes in a community of users.
Ontotext contributions to MOLTO through WP4 are
• Semantic infrastructure
• Ontology-grammar interoperability
The semantic infrastructure in MOLTO will also act as a central multi-paradigm index for (i) conceptual models—upper-level and domain ontologies; (ii) knowledge bases; (iii) content and metadata as needed by the use cases (mathematical problems, patents, museum artefact descriptions); and provide NL-based and semantic (structured) retrieval on top of all modalities of the data modelled.
In addition to the traditional triple model for describing individual facts,
<subject, predicate, object>
the semantic infrastructure, will build on quintuple-based facts,
<subject, predicate, object, named graph, triple set>
The infrastructure will include: inference engine (TRREE7), semantic database (OWLIM8), semantic data integration framework (ORDI9) and a Multi-paradigm semantic retrieval engine, all of which are previous work, resulting from private (Ontotext) and public funding (TAO10. TripCom11). This approach will enable MOLTO’s baseline and use case driven knowledge modelling with the necessary expressivity of metadata-about-metadata descriptions for provenance of the diverse sources of structured knowledge (upper-level, domain specific and derived (from grammars) ontologies; thesauri; domain knowledge bases; content and its metadata)
From Ontotext webpages, we can guess that the infrastructure builds on the following technologies:
• KIM is a platform for semantic annotation, search, and analysis
• OWLIM is the most scalable RDF database with OWL inference
• PROTON is a top ontology developed by Ontotext.
Milestone MS2 says the knowledge representation infrastructure is opened for retrieval access to partners at M6. The infrastructure deliverable D4.1 is due at M8.
[7 Grammar-ontology interoperability for translation and retrieval in DoW]
At the time of the TALK project, an emerging topic was the derivation of dialogue system grammars from OWL ontologies. A prototype tool for extracting GF abstract syntax modules from OWL ontologies was thereby built by Peter Ljunglöf at UGOT. This tool was implemented as a plug-in to the Protégé system for building OWL ontologies3 and intended to help programmers with OWL background to build GF grammars. Even though this tool remained as a prototype within the TALK project, it can be seen as a proof of concept for the more mature tools to be built in the MOLTO project.
A direct way to map between ontologies and GF abstract grammars is a mapping between OWL and GF syntaxes.
In slightly simplified terms, the OWL-to-GF mapping translates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows:
Class(pp:integer ...) <==> cat integer ;
ObjectProperty(pp:div <==> fun div :
domain(pp:integer) integer -> integer -> prop ;
range(pp:integer))
Less syntax-directed mappings may be more useful, depending on what information is relevant to pass between the two formalisms. The mapping is then also less generic, as it depends on the intended use and interpretation of the ontology. The mapping through SPARQL queries below is one example. A mapping over TF could be another one.
The GF-Protégé plug-in brings us to the development cost problem of translation systems. We have noticed that in the GF setting, building a multilingual translation system is equivalent to building a multilingual GF grammar, which in turn consists of two kinds of components:
• a language-independent abstract syntax, giving the semantic model via which translation is performed;
• for each language, a concrete syntax mapping abstract syntax trees to strings in that language.
In MOLTO, GF abstract syntax can also be derived from sources other than OWL (e.g. from OpenMath4 in the mathematical case study) or even written from scratch and then possibly translated into OWL ontologies, if the inference capabilities of OWL reasoning engines are desired. The CRM ontology (Conceptual Reference Model5) used in the museum case study is already available in OWL.
MOLTO’s ontology-grammar interoperability engine will thus help in the construction of the abstract syntax by automatically or semi-automatically deriving it from an existing ontology. The mechanical translation between GF trees and OWL representations then forms the basis of using GF for translation in the Semantic Web context, where huge data sets become available in RDF and OWL in initiatives like Open Linked Data (LOD).
The interoperability between GF and ontologies will also provide humans with natural ways of interaction with knowledge based systems in multiple languages, expressing their need for information in NL and receiving the matching knowledge expressed in NL as well:
Human -> NL -> GF -> ontology -> GF -> NL -> Human
providing an entirely new dimension to the usability of semantics-based retrieval systems, and opening extensive structured bodies of knowledge in human understandable ways.
Note also that the OWL to GF mapping also allows a wider human input to GF. OWL ontologies are written by humans (at present at least, by many more humans than GF grammars).
MOLTO website gives detail what is going to delivered first by way of ontology-GF interoperability. The first round uses GF grammar to translate NL questions to SPARQL query language (http://www.molto-project.eu/node/987).
The ontology-GF mapping here is a NL interface to PROTON ontologies, by way of parsing (fixed) NL to (fixed) GF trees and transforming the trees into SPARQL queries to run on the ontology DB.
Indirectly, this does define a mapping between (certain) GF trees and RDF models, using SPARQL in the middle. SPARQL is not RDF but a SPARQL query does retrieve a RDF model given a dataset, but the model depends on the dataset. With an OWL reasoner thrown in, we can get OWL query results.
What WP3 had in mind is a tool to translate between OWL models and GF grammars, i.e. convert OWL ontology content into GF abstract syntax. This tool is forthcoming next according to the MOLTO presentation slides (http://www.molto-project.eu/node/1008).
This was confirmed by email from Petar (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanWP4).
The translation tools WP3 will consider using TermFactory multilingual ontology model and tools
as middleware between (non-linguistic) ontology and GF grammar. The idea is to (semi)automatically match or bridge third party ontologies to TF, a platform for collaborative development of ontology-based multilingual terminology. It then remains to define an automatic conversion between TF and GF.
The Varna meeting should adjudicate between WP3 and WP4 here.
A concrete subtask that arises here is to define an interface between the knowledge representation infrastructure (due Nov 2010) and TF (finished in ContentFactory project end of 2010).
Since the aims are more related to use cases and framework development, than enhancing performance of existing technologies, the evaluation to be done during the project will be more of a qualitative than quantitative kind.
The evaluation of these features should reflect and demonstrate the multiple possibilities of GF that are gained through inter-operation with external ontologies. The evaluation of progress will exploit proof-of-concept demos and plans for further development. For further discussion, see https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanD91
[From DoW]
The goal is to develop translation methods that complete the grammar-based methods of WP3 to extend their coverage and quality in unconstrained text translation. The focus will be placed on techniques for combining GF- based and statistical machine translation. The WP7 case study on translating Patents text is the natural scenario to test the techniques developed in this package. Existing corpora for the WP7 will be used to adapt SMT and grammar- based systems to the Patents domain. This research will be conducted on a variety of languages of the project (at least three).
Del. no |
Del. title |
Nature |
Date |
D 5.1 |
Description of the final collection of corpora |
RP |
M18 |
D 5.2 |
Description and evaluation of the combination prototypes |
RP |
M24 |
D 5.3 |
WP5 final report: statistical and robust MT | RP,Main |
M30 |
[10 Robust and statistical translation methods in DoW]
The concrete objectives in this proposal around robust and statistical MT are:
Most of the objectives depend on the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.
Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information (1 and 2). We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. This compilation will rely on publicly available corpora and resources for MT (e.g., the multilingual corpus with transcriptions of European Parliament Sessions).
Domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study). This corpora will come from the compilation to be made at WP7, leaded by Mxw.
We already have the European Parliament corpus compiled and annotated for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.
Combination of grammar-based and statistical paradigms is a novel and active research line in MT. (...) We plan explore several instantiations of the fallback approach. From simple to complex:
• Independent combination: in this case, the combination is set as a cascade of independent processors. When Grammar-based MT does not produce a complete translation, the SMT system is used to translate the input sentence. This external combination will be set as the baseline for the rest of combination schemes.
• Construction of a hybrid system based on both paradigms. In this case, a more ambitious approach will be followed, which consists of constructing a truly hybrid system which incorporates an inference procedure able to deal with multiple proposed fragment translations, coming from grammar-based and SMT systems. Again we envision several variants:
• Fix translation phrases produced by the partial GF analyses in the SMT search. In this variant we assume that the partial translations given by GF are correct so we can fix them and let SMT to fill the remaining gaps and do the appropriate reordering. This hard combination is easy to apply but not very flexible.
• Use translation phrase pairs produced by the partial GF analyses, together with their probabilities, to form an extra feature model for the Moses decoder (probability of the target sentence given the source).
• Use tree fragment pairs produced by the partial GF analyses, together with their probabilities, to feed a syntax based SMT model, such as the one by Carreras and Collins (2009) . In this case the search process to produce the most probable translation is a probabilistic parsing scheme.
The previous text describes the hybrid MT systems we consider to include. The baseline is clear. In fact, one can define three baselines: a raw GF system, a raw SMT system and the naïve combination of both. Regarding real hybrid systems there is much more to explore. Here we list four approaches to be pursued:
Hard integration. Force fixed GF translations within a SMT system.
Soft integration I. Led by SMT. GF partial output, as phrase pairs, is integrated as a discriminative probability feature model in a phrase-based SMT system.
Soft integration II. Led by SMT. GF partial output, as tree fragment pairs, is integrated as a discriminative probability model in a syntax-based SMT system.
Soft integration III. Led by GF. Complement with SMT options the GF translation structure and perform statistical search to find the final translation.
At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.
In the evaluation process, these families of methods will be compared to the baseline(s) introduced above according to several automatic metrics.
WP5 is going to have its own internal evaluation complementary to that of WP9. Since statistical methods need of fast and frequent evaluations, most of the evaluation within the package will be automatic. For that, one needs to define the corpora and the set of automatic metrics to work with.
Statistical methods are linked to patents data. This is the quasi-open domain where the hybridization is going to be tested. The languages of the corpus are not still completely defined, but by looking at other works with patents we guess they will probably be English, German, and French or Spanish.
Besides the large training corpus, we need at least two smaller data sets, one for development purposes and another one for testing. The order of magnitude of these sets is usually around 1,000 aligned segments or sentences. We expect to reach this size, but the final amount will depend on the available data.
BLEU (Papineni et al. 2002) is the de facto metric used in most machine translation evaluation. We plan to use it together with other lexical metrics such as WER or NIST in the development process of the statistical and hybrid systems.
Lexical metrics have the advantage of being language-independent, since most of them are based on n-gram matching. However, they are not able to catch all the aspects of a language and they have been shown not to always correlate well with human judgements. So, whenever it is possible, it is a good practice to include syntactic and/or semantic metrics as well.
The IQmt package1 provides tools for (S)MT translation quality evaluation. For a few languages, it provides metrics to do this deep analysis. At the moment, the package supports English and Spanish, but other languages are planed to be included soon. We will use IQmt for our evaluation on the supported language pairs.
1http://www.lsi.upc.es/~nlp/IQMT/