Contract No.: | FP7-ICT-247914 |
---|---|
Project full title: | MOLTO - Multilingual Online Translation |
Deliverable: | D1.5 Periodic Management Report T24 |
Security (distribution level): | Confidential |
Contractual date of delivery: | M24 |
Actual date of delivery: | 5 Apr 2012 |
Type: | Report |
Status & version: | Final |
Author(s): | O. Caprotti et al. |
Task responsible: | UGOT |
Other contributors: | All |
Progress report for the fourth semester of the MOLTO project lifetime, 1 Sep 2011 - 29 Feb 2012.
The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 39 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as those used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data. GF has been applied in several small-to-medium size domains, typically targeting up to ten languages but MOLTO will scale this up in terms of productivity and applicability.
A part of the scale-up is to increase the size of domains and the number of languages. A more substantial part is to make the technology accessible to domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, with the tools produced by MOLTO, this can be done by just extending a lexicon and writing a set of example sentences.
MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.
While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.
The MOLTO Enlarged EU proposal adds two countries (Switzerland and The Netherlands) and two work packages. The Semantic Wiki work package builds a system that integrates the functionalities of MOLTO tools with a collaborative environment, where users can create content in different languages, and all edits become immediately visible in all languages, via automatic semantic-based translation. The Interactive Knowledge-Based System work package puts MOLTO technology to use in an enterprise environment, for the localization of end-user oriented systems to new languages and the generation of high-quality explanations in natural language. Noteworthy in this work package is the fact that translation grammars are constructed in house by Be Informed's non-expert staff without the intervention of grammar specialists.
MOLTO technology will be released as open-source libraries, which can be plugged into standard translation tools and web pages and thereby fit into standard workflows. It will be demonstrated in web-based demos and applied in three case studies: mathematical exercises in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages.
The results achieved during the first 24 months of the projects have been demonstrated during the 4th Project Meeting. They include:
A detailed list with short abstracts is available at http://www.molto-project.eu/content/molto-4th-project-meeting-demos.
In the past semester we reported:
The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software flagship products:
These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.
The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.
This section describes the progress of each workpackage and discusses changes to the workplan, if necessary.
The work during the fourth semester has proceeded in parallel in all workpackages leading to a number of demonstrative prototypes.
In WP2, the work concentrated on finishing the Cloud-based Editor and the Eclipse Plugin for GF, and in WP3, the design of an integrated architecture for the translation tools led to the adoption of the third-party platform GlobalSight as the translators workflow management framework in which to integrate the MOLTO tools.
WP4 has come to its conclusion delivering a prototype on the company's website Ontotext for showing GF-OWL interoperability.
The first hybrid models for the statistical and robust translation promised in WP5 have been implemented and evaluated on a specific testset, the results are available in a deliverable and will form the basis for the next developments. The MGL developed in the use case study of mathematics has been equipped with a command line interface to the Sage suite of Computer Algebra Systems, thus providing a natural language dialog to a computation system.
The Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.
In the cultural heritage museum study, the ad-hoc ontology facts, stored in the knowledge representation infrastructure delivered by WP4, can be queried in natural language in 5 languages.
The milestones for the period have all been achieved as described later in the report.
This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:
Moreover, if applicable:
This WP has delivered the GF grammar development infrastructure in anticipated ways, resulting in two IDE's (a cloud-based and an Eclipse-based) and a faster grammar compiler. The WP and its last deliverable have been extended in time to allow for interaction with the MOLTO-Enlarged EU, which was delayed. The skeleton of the final deliverable has been discussed during the 4th Project Meeting.
GF 3.3.3: faster compilation of grammars, permitting on-the-fly changes of running translation systems.
GF Cloud-Based IDE: an IDE for beginners, as well as for on-the-fly changes of running translation systems. New features in this year:
GF-Eclipse plugin: an IDE for power users, with features such as
GF Resource Grammar Library has 7 new languages since March 2012: Hindi, Latvian, Nepali, Persian, Punjabi, Sindhi, and Thai. Some MOLTO applications (e.g. the Phrasebook and the Math library) are ported to some of these languages.
RGL support of lexicon building was evaluated in the article by Détrez and Ranta, Smart Paradigms and the Predictability and Complexity of Inflectional Morphology, to appear in EACL 2012.
As a tutorial and reference for GF, a book has been published: Ranta, Grammatical Framework - Programming with Multilingual Grammars, CSLI, Stanford, 2011.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 5 (R. Enache) | 6 (J. Camilleri) |
UPC | 1 (J. Saludes) | 1 (C. España) | 0 | 0 |
UHEL | 0 | 0 | 0 | 5 (L. Alanko) |
OntoText | 0 | 0 | 0 | 4,2(M.Chechev,M.Damova,K.Krustev) |
We moved Deliverable 2.3, "User Manual and Best Practices", to Month 27 (due 20 June 2012). The reason is that we want to include the experience from the new kind of users from Be Informed, and the start of the MOLTO enlargement was delayed.
The work done during the last year is related to the promises of WP3: to combine MOLTO tools with traditional CAT tools. As described in the appendix D9.1A, MOLTO tools would be used to translate real time multilingually some formulaic parts of a more complex document type, such as descriptions of chemical formulas in a patent. The rest of the document would be translated with more traditional methods. We have chosen the translation management system GlobalSight to combine the workflows.
We have been modifying the editor released in MS3 adding term management and user authentication. We've been also developing a term search; currently it is a separate component, but we're planning to attach it to the editor. The search can be tested at http://tfs.cc/molto_term_editor./editor_sparql.html.
Term management platform TermFactory (TF), a related project run by Lauri Carlson, is under development. The plan is to connect TF to the editor in order to allow on-the-fly user extensions of the lexicon of the grammar. The work done in WP2 by UGOT is in synergy with our WP: they have been developing ways to change the GF grammar without full recompilation thus in a significantly faster time.
As for publications, a master's thesis called ''Ontology-based lexicon management in a multilingual translation system'', written within the project, will be finished during Spring 2012.
As a part of MS8 (due September 2012), GlobalSight is now running on our server at http://tfs.cc/globalsight/.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 4 (A. Slaski, S. Virk, N. Frolov) |
UPC | 0,25 (L. Màrquez) | 1,75 (M. Gonzàlez) | 0 | 0 |
UHEL | 2 (L. Carlson) | 0 | 0 | 5 (I. Listenmaa), 6 (J. Shen), 6 C. Li) |
OntoText | 0 | 0 | 0 | 4,1(M.Damova, M.Chechev, S.Enev) |
This WP has delivered two way interoperability between the natural language and ontology. The prototype was build and made publicly available on http://molto.ontotext.com. The prototype integrates the infrastructure for knowledge modeling, semantic indexing and retrieval with tools for NL queries to the semantic repository and verbalization of the results.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UPC | 0 | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 4(B.Popov) | 0 | 0 | 10,1(M.Chechev,P.Mitankin,M.Nozchev,F.Alexiev,A.Ilchev,I.Kabakov,K.Krustev,V.Zhikov) |
The milestone MS7 has been achieved in M24 (First prototypes of hybrid combination: The methods are implemented and evaluated on a specific test set).
The work of the fourth semester corresponds to the three last tasks of the WP (T5.4 Baseline systems, T5.5 Hybrid Models and T5.6 Systems evaluation, see http://www.molto-project.eu/workplan/statistical-and-robust-translation). The baseline systems have been improved by extending the GF translator. Now the translator is able to deal with chunks so that the coverage has been widened (Task 5.4).
For Task 5.5 we have implemented two kinds of hybrid models which we call Soft and Hard integrations. The following section outlines its main characteristics.
Finally, for Task 5.6 both the baselines and the hybrid systems have been evaluated using a variety of lexical metrics and compared with generic public available translators such as Google and Bing. Also a manual evaluation has been carried out in order to compare the most promising hybrid system according to the automatic evaluation and the pure SMT translator.
The work done for these tasks has been submitted to the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) and the submitted paper with title "A Hybrid System for Patent Translation" can be found in MOLTO web page.
At the same time of writing this report, the deliverable D5.2 Description and evaluation of the combination prototypes is being submitted as a regular publication. It is a public document accessible from the MOLTO web page.
Two kinds of hybrid translators for patents have been developed. The final systems are not only a combination of two different engines but the subsystems also mix different components. We have developed a GF translator for the specific domain that uses an in-domain SMT system to build the lexicon; an SMT system is on top of it to translate those phrases not covered by the grammar.
In the previous report we showed that the GF grammar-based system alone could not parse most patent sentences. Consequently, the current translation system aims at using GF for translating patent chunks, and assemble the results in a later phase. As explained in D5.2, this implies several modifications to the GF baseline itself.
To gain robustness in the final system, the output of the GF translator is used as a priori information for a higher level SMT system. The SMT baseline is fed with phrases which are integrated in two different ways. First, what we call "Hard Integration", phrases with GF translation are forced to be translated this way. The system can reorder the chunks and translates the untranslated chunks, but there is no interaction between GF and pure SMT phrases. Second, in the "Soft Integration" system, phrases with GF translation are included in the translation table with a certain probability so that the phrases coming from the two systems interact.
The hybrids exploit the high coverage of statistical translators and the high precision of GF to deal with specific issues of the language. At this moment the grammar tackles agreement in gender, number and between chunks, and reordering within the chunks. Although the cases where these problems apply are not extremely numerous both manual and automatic evaluations consistently show their preference for the hybrid system in front of the two individual translators. In the near future we plan to widen the number of issues approached by the grammar. Also, modifications with SMT components to the GF translator and new kinds of combination of phrases will be introduced.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 1 (R. Enache) | 0 |
UPC | 7.75 (L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) | 11.25 (C. España, M. Gonzalez, X. Carreras) | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 0 | 0 | 0 | 0 |
The final hybrid translators have been developed for the French-English language pair. We also aim at including German, so in the following months the concrete syntax for German will be completed. We plan to complete the task in May and it does not affect any other tasks of the project. The systems in the three languages will be available for the final evaluation.
Deliverable D6.2 has been released as tagged SVN content publicly available at svn://molto-project.eu/tags/D6.2. Bug fixing and some more features may continue to be developed in the head branch.
With respect to M18, we added an upper layer to the MGL library to support commands issued to a Computer Algebra System (CAS) and to render the answers in the natural languages as text or speech using actual concrete syntaxes for 3 languages: English, Spanish and German.
We developed software components to interact with a CAS (Sage) both externally using the http
protocol, or inside the Sage shell and notebook interfaces. Furthermore, we developed a testing procedure to assist in regression tests for the tool.
Developed a prototype to engage Sage in a dialog using natural language that runs on Linux and Mac OSX. The system assists command composing by providing autocompletion and gives spoken output on demand.
Developed a Sage interface to issue commands to a Sage process from the native Sage shell or notebook. In Linux it provides autocompletion using the native shell mechanism for it.
The dialog prototype has been demonstrated at DEIMS12
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UPC | 6 (J. Saludes) | 0 | 0 | 6 (A. Ribó Mor) |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 0 | 0 | 0 | 0 |
During this period, WP7 has done a step forward in the development of the prototype: the Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.
Due the incremental development of the prototype, most of the tasks span till M27, when the final prototype must be delivered. The following lines describe the progress of the following tasks:
In relation to Task 7.2, the EPO provided a parallel corpus of patents from which only 66 patents belongs to the biomedical domain. We downloaded an alternative corpus of 7,705 document directly from their website (i.e. publicly available) The following summarizes the content of these documents: 4,274 out of the 7,705 documents have claims (6M lines), 2,058 out of them are trilingual (3M lines). 2,116 documents have claims written only in English, 66 have claims only in German (260K lines), 34 only in French (88K lines). There are no extra files having other combination of languages.
Regarding Task 7.4 and Task 7.5, the ontologies, indexes, databases and retrieval engines have been set up for the specific domain and using the patent documents described above. The semantic annotation process is carried out by a GATE pipeline on the English texts. We are working to export the annotations during the translation process in order to be able to show the annotations also in the French and German texts.
As for Task 7.3 and Task 7.6, the grammars development and SMT adaptation to the domain is being developed jointly with WP5 tasks. The grammars have been developed for English and French, and in the following will be developed also for German.
Finally, regarding Task 7.7, the interface allows accessing the system in three different ways: the controlled language, SPARQL and terms in the index. In the future we will include free text and a combination of it with the controlled language.
Since M21 there is a fully functional version of the prototype at http://molto-patents.ontotext.com/. The demo allows querying the system in English and French. The patents in the database has original text in English, French and German.
The retrieval system can be queried in three different ways. The NL-based interface allows the user to query the system in English and French using written natural language. The SPARQL interface, more suitable for advanced users, allows to accurately browse the repository using SPARQL queries The keyword-based visual browsing interface uses the RelFinder tool in which the user can search for keywords using the autocomplete functionality. The results from the RelFinder search are visualised as graphs.
The visualization of the results displays the list of classes from the ontologies that match the query and the list of patent documents indexed under the matching criteria. The interface provides also a link to access the semantically annotated documents and the original patents. The interface that shows the annotated documents highlights on the text the words that are related to any semantic item. Colors are given according to the semantic annotations type. The right side of the page gives the list of semantic types and colors that are present in the text.
A paper about the Patent retrieval system was accepted at WWW2012 Conference, to be held in April.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 3 (R. Enache) | 3 (A. Slaski) |
UPC | 0 | 7,5 (M. Gonzales, C. España) | 0 | 0 |
UHEL | 0 | 0 | 0 | |
OntoText | 0 | 0 | 0 | 8,8 (M.Chechev, M.Damova, V.Zhikov, I.Kabakov) |
In general lines, we are achieving the objectives related to WP7 tasks within the timeframe. However, due the several issues related to the gathering of the corpora, the databases of the retrieval system do not include yet automatic translations of the patent document but only real translations. The issue affects directly the annotation process of Tasks 7.5, but it does not imply a delay for the whole prototype. The estimation is that the automatic translations and annotations will be included in the final prototype.
The work package has started by data collection, proceeded with developing the ontology interface, and lately focused on the baseline translator. The translator is only for five languages so far, but will be extended soon. The ontology interface will permit multilingual queries about museum objects exploiting the MOLTO Knowledge Representation Infrastructure. It also makes this case study into an example of multilingual ontology verbalization.
Ontology and corpus study (D8.1).
Grammars for translation and multilingual NLG for painting descriptions in five languages: English, Finnish, French, Italian, Swedish. This was built in a modular way that is easy to extend to new languages, which we will do soon.
Ontology verbalization in a generic way. The same languages will be usable as in translation, but aren't yet.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 1 (R. Enache) | 0 |
UPC | 0 | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 0 | 0 | 0 | 5,9 (M.Chechev, M.Damova, K.Krustev, V.Zhikov) |
We have not been able to proceed at the planned pace, and would like to have an extension of the WP time. We would like to extend this WP and its last deliverable 8.3 till Month 36. We have not been able to use the person months as planned (as can be seen from the Use of resources). One of the planned key persons, Dana Dannélls at UGOT, will be able to join MOLTO full-time a few months later than originally planned, probably October/November 2012.
This WP is working on collecting evaluation plans from each site.
An extended D9.1E Evaluation plan has been written.
Progress evaluation has mainly been carried out by each site during development. This would be a good idea to collect this more systematically.
For the SMT/hybrid patent case, automatic measures (BLEU but also others - maybe check Cristina/UPC slides for examples) are probably mainly used.
In developing the GF grammars, informants (native speakers of the relevant languages) have been used during the grammar writing process to check and correct output. The informants have been given output to read and have informed the developer if sentences are correct or if not, how they should be corrected.
Moving forward, the final evaluations will need to include usability of the tools as well as quality evaluation of the output. (WP9 review slides have some examples of the user communities that might be mobilized for usability evaluation and the platforms that could be used. One thing that we were discussing wrt to mobilizing evaluators is that they need to be motivated to use the tools in some way?)
For output quality, final evaluations will likely involve both automatic and manual methods. For automatic methods, UPC's Asiya evaluation kit offers some syntactically and semantically oriented metrics in addition to the purely lexical ones like BLEU, but only for a couple of languages. As all automatic metrics rely on comparison to gold standard human translations, these need to be obtained for the test sets, if they are to be used.
Manual evaluation methods on the other hand require humans to do evaluations. For the patent case, evaluators need to have sufficient understanding of the material to be able to assess whether the translations are correct or not, particularly since we expect one of the strengths of the GF hybrid to be in correctly handling long formulae. Therefore plans have been made to hire professional patent translators of the languages in question to do the evaluation expectedly in June. Since Google is now also providing patent translations, that will be used as a point of comparison. The TAUS scale, fluency etc. could be used in this case.
For the museum case one manual evaluation approach was to produce museum descriptions in various languages that combine the simpler rules - e.g. "Painter painted Painting in City in Year on Canvas" etc. and then have the native speakers check the individual relations involved (Who painted? What did they paint? Where? When? etc.) and combine these into a measure of the overall fidelity. For this, evaluators do not necessarily need to be museum experts, any native speakers of the language in question should do. If you want a reference for this, an interesting description of such approach is in http://www.cs.ust.hk/~dekai/library/WU_Dekai/LoWu_Acl2011.pdf Other measures such as fluency, TAUS fitness scale could also be used.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UPC | 2,16 (L. Màrquez, LluísP, D. Farwell) | 0,5 (C. España) | 0 | 0 |
UHEL | 0.5 (L. Carlson) | 0 | 9 (S. Nyrkkö) | 0 |
OntoText | 0 | 0 | 0 | 2(M.Chechev, K.Krustev) |
UZH | 0 | 0 | 0 | 0 |
BI | 0 | 0 | 0 | 0 |
A few events have been organized by MOLTO and some are in the making. The your researchers in the Consortium have published a number of papers at international venues on initial results of the project. Project meetings have taken place in Helsinki, and in Zürich with the extra MOLTO-EEU kickoff meeting in Gotheburg. GF tutorials and tutorials on MOLTO works have been delivered on various occasions, more prominently during the GF Summer School: Frontiers of Multilingual Technologies in August 2011.
At UGOT, R. Enache and S. Virk have passed their licenciate, a step towards their PhD, by publishing work in connection to MOLTO. Moreover, D. Dannels has also defended her PhD seminar by discussing work done in the natural language analysis of cultural heritage domain. Finally, K. Angelov, obtained his PhD at Chalmers with a thesis on the inner workings of GF, much of which goes to benefit WP2 and WP3.
The list of publications, archived on the MOLTO website (http://www.molto-project.eu/biblio?sort=year&order=asc), follows here below.
Controlled Language for Everyday Use: the MOLTO Phrasebook, Ranta, Aarne, Enache Ramona, and Détrez Grégoire, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)
The GF mathematics library, Saludes, Jordi, and Xambó Sebastian, THedu'11, (2011)
Grammatical Framework: Programming with Multilingual Grammars, Ranta, Aarne, CSLI Studies in Computational Linguistics, Stanford, p.350, (2011)
MOLTO Enlarged EU Annex I - Description of Work, Consortium, MOLTO , (2011)
MOLTO poster presented at EAMT Conference (European Association for Machine Translation) 2011, Leuven, Ranta, Aarne, and Enache Ramona, (2011) - also presented at META-FORUM by Listenmaa, Inari in Budapest, 2011.
Typeful Ontologies with Direct Multilingual Verbalization, Angelov, Krasimir A., and Enache Ramona, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)
Typeful Ontologies with Direct Multilingual Verbalization poster, presented at the Google Anita Borg retreat, June 2011, Zurich, Enache, Ramona, (2011)
The GF Mathematics Library, Saludes, Jordi, and Xambó Sebastian, Proceedings First Workshop on CTP Components for Educational Software (THedu'11), 02/2012, Volume Electronic Proceedings in Theoretical Computer Science, Number 79, Wrocław, Poland, p.102–110, (2011)
D1.3A Advisory Board Report, Hall, Keith, and Pulman Stephen, 03/2011, Number D1.3A, Gothenburg, (2011)
MOLTO - Multilingual On-line Translation - Annual Report 2010-2011, Caprotti, Olga, España-Bonet Cristina, and Alanko Lauri, 03/2011, Gothenburg, (2011) - Published on cordis.eu.
A Framework for Improved Access to Museum Databases in the Semantic Web, Dannélls, Dana, Damova Mariana, Enache Ramona, and Chechev Milen , RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING, 09/2011, Hissar, Bulgaria, (2011)
Hybrid Machine Translation Guided by a Rule–Based System, España-Bonet, Cristina, Labaka Gorka, Díaz De Ilarraza Arantza, Màrquez Lluís, and Sarasola Kepa , Machine Translation Summit, 09/2011, Xiamen, China, p.554-561, (2011)
The painting ontology, Dannélls, Dana, CIDOC 2011 conference, 09/2011, (2011)
Patent translation within the MOLTO project, España-Bonet, Cristina, Enache Ramona, Slaski Adam, Ranta Aarne, Màrquez Lluís, and Gonzàlez Meritxell, Workshop on Patent Translation, MT Summit XIII, 09/2011, p.70-78, (2011)
Reason-able View of Linked Data for Cultural Heritage, Damova, Mariana, and Dannélls Dana, The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES (S3T), 09/2011, Bourgas, Bulgaria, (2011)
Deep evaluation of hybrid architectures: simple metrics correlated with human judgments, Labaka, Gorka, Díaz De Ilarraza Arantza, España-Bonet Cristina, Sarasola Kepa, and Màrquez Lluís, International Workshop on Using Linguistic Information for Hybrid Machine Translation, 11/2011, Barcelona, Spain, p.50-57, (2011)
The Patents Retrieval Prototype in the MOLTO project, Chechev, Milen, Gonzàlez Meritxell, Màrquez Lluís, and España-Bonet Cristina, WWW2012 Conference, Lyon, France, (2012)
MOLTO - Multilingual On-Line Translation, Ranta, Aarne, Talk given at Xerox Research Centre Europe, Grenoble, 19 January 2012, 01/2012, (2012)
Using GF in multimodal assistants for mathematics, Archambault, Dominique, Caprotti Olga, Ranta Aarne, and Jordi Saludes, 02/2012, Digitization and E-Inclusion in Mathematics and Science 2012, (2012)
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 4 (O. Caprotti) | 0 | 0 | 0 |
UPC | 1 (S. Xambo) | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 1 (B.Popov) | 0 | 0 | 2,6 (M.Chechev, M.Damova) |
UZH | 0 | 0 | 0 | 0 |
BI | 0 | 0 | 0 | 0 |
None to report.
During the first 3 months of our participation in the MOLTO project we completed an initial integration of the GF-provided services (mainly translation and look-ahead editing) into AceWiki.
We implemented a new Java front-end to the GF Webservice, and use it to connect to the GF services from AceWiki. The existing AceWiki user interface was extended to allow for an easy switching between different languages and to present with each sentence its GF-provided analysis (translations into other languages, word alignment diagrams, GF syntax trees, etc.). The AceWiki storage format was changed to a one based on GF abstract trees (which are language-neutral).
The other main part of our work dealt with the implementation of the ACE grammar in GF. We tested an existing implementation (Angelov and Ranta, 2009) which targets an earlier version of ACE for its recall and precision, and found that some changes need to be introduced to make it compatible with the latest version of ACE. More importantly, we decided to focus on and also started work on a grammar of the subset of ACE that is used in the current AceWiki.
We also experimented with taking the content of an existing AceWiki demo wiki (domain Geography) and using it to pre-populate the new GF-based AceWiki.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
UZH | 0 | 2 (K. Kaljurand) | 0 | 0 |
None to report, besides delayed start as explained later in this document.
The first 2 months we started with the Adoption phase as described in the DoW for WP12. We've focused our efforts on the requirements for the verbalization component(D12.1). We distinguish 4 categories of relevant requirements.
We presented a requirements draft to our partners in March 2012.
At the kickoff hosted by UGOT we did a first round with the UGOT people to draw up the specification to migrate Be Informed current explanation prototype to GF.
Next Steps
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
BI | 0,25 (J. van Grondelle, J. van Aart ) | 0 | 0 | 0,25 (H. ter Horst) |
WP12 runs longer than is indicated in the Gantt chart appearing in the Annex, however duration is correctly listed under 3.3.3. We planned for a duration of 15 months, D12.2 for instance is projected for March 2013.
Deliverables of MOLTO are listed and linked for download on the web page http://www.molto-project.eu/workplan/deliverables.
Below is a summary of deliverables due until the fourth semester.
The milestones until now have been achieved to a different degree of completion, either by a deliverable or by some online prototype. In particular, MS3, is the translation editor available at http://www.grammaticalframework.org:41296/editor/#translate and is being integrated in the Translator's Tools due next September 2012. The grammar-ontology-interoperability has been documented in D4.3 however is has been requested that more details be made explicit. Concerning MS7, the methods implemented by the hybrid combination prototypes have been evaluated on a specific test set as reported in D5.4.
ID | Title | Due date |
---|---|---|
MS1 | 15 Languages in RGL | 1 September, 2010 |
MS2 | Knowledge Representation Infrastructure | 1 September, 2010 |
MS3 | Web-based translation tool | 1 March, 2011 |
MS4 | Grammar-ontology interoperability | 1 October, 2011 |
MS5 | First prototypes of the cascade-based combination models | 1 October, 2011 |
MS6 | Grammar tool complete | 1 March, 2012 |
MS7 | First prototypes of hybrid combination models | 1 March, 2012 |
The third semester of the project saw the enlargement of the Consortium by two new partners University of Zürich and Be Informed. While the original planned start of the MOLTO-EEU project enlargement was scheduled for September 2011, and accounted for synchronicity of the deliverables with ongoing workpackages, the actual kickoff only happened in January 2012. Consequently, the end date of the project is now shifted to 31 May 2013 and the main deliverable of Workpackage 2 has been shifted 3 months to take into account the feedback from the new use cases added by the enlargement.
The following inconsistencies have been noticed in the revised Annex:
However, due to the delay in start, both WP11 and WP12 will now be ongoing in the period M22-M37, as in the chart below. Notice also the changes affecting WP2 and WP8.
The following actions were taken as a result of the review report quoted here below.
Some observations, comments and remarks, raised and discussed at the review meeting, follow. These should be addressed in the respective deliverable(s) as well as in the planning for the next period.
Rule extraction (from lexical databases, ontologies, text examples) needs to be specified in detail and a concrete schedule should be included in the updated workplan (D1.1).
The workplan in maintained online using a dynamically generated list of tasks entered by the workpackage leaders. It is available, if logged in, under http://www.molto-project.eu/workplan and tasks http://www.molto-project.eu/workplan/tasks. It is the responsibility of the workpackage leader to actively use and document ongoing work using this tool.
The topic of rule extraction will be detailed in the last, main deliverable of WP2.
Concerning the integration of the TermFactory (TF) and Knowledge Representation Infrastructure (KRI), it seems that there are overlaps between these tools. The partners must clarify which functions of these tools will be used in the case studies in order to exploit complementarities of the tools and avoid overlaps.
The Term Factory is not only a component in molto but a stand-alone software, which vitally requires some functionalities of its own for technical purposes. Any excessive development of overlapping functions will avoided by co-operation and planning with the KRI developers. It is in deed notably relevant for evaluation and reporting that the case studies describe the tools that provide each functionality.
Critical issues with respect to the semi-automatic creation of abstract grammars from ontologies, as well as deriving ontologies from grammars, are still to be clarified. Concrete steps to handle these issues need to be specified in detail and a schedule should be included in the updated workplan (D1.1).
As part of the prototype for D4.3 an automatically build from an ontology abstract and concrete English grammar have been integrated. They are used to verbalize the results from the semantic repository. Experiments and discussions, about using a similar approach for automatically buiding a query grammar from the semantic repository, were performed, but the provided from UGOT GF query grammar was selected as better tool because of its expressing power and the possibilities to generate better natural language. The query grammar has different types of question templates and it can be easily ported for new domain with minor modifications at the abstract and concrete grammars. The mapping rules that are used for connection between the abstract grammar and SPARQL are selected as the best semi-automated aproach for connection between the grammar and SPARQL. The mapping rules provide possibilities to make an general rules for transformation, but also to make a fine tune for a specific cases. The rules that are currently used are general enough to be used at new domains with a ported GF query grammar and this will be demonstrated at WP7 and WP8 prototypes.
Current description of work in WP6 lacks details on the prototype multilingual dialog system to be developed. Including an example dialog and specifications of this prototype in a new version of deliverable D9.1 is recommended.
WP7 work should focus on the major issues examined in MOLTO, especially in relation to the grammar – ontology interoperability rather than chemical compound splitting. Specific scenarios are needed for the exploitation of MOLTO tools in this case study. It is recommended to include such scenarios in a new version of deliverable D9.1.
Specific scenarios are needed for the exploitation of MOLTO tools in the case study on cultural heritage (WP8) which just started. It is recommended to include such scenarios in a new version of deliverable D9.1.
Use cases are listed in http://www.molto-project.eu/workplan/usecases and they include two scenarios for WP8 and two for WP7. The specific use case scenarios for WP7 were described in: UC-71 and UC-72. Details about them were given in Section 2 of D.7.1.
UC-71 focuses on grammar-ontology interoperability. User queries, written in CNL (controlled natural language) are used to query the information retrieval system.
UC-72 focuses on high-quality machine translation of patent documents. It uses an SMT baseline system to translate a big dataset and fill up the retrieval databases. In order to study the impact of hybrid systems in translation quality, a smaller dataset will be translated using the hybrid system developed in WP5.
The way the project’s web site is structured, although it contains the necessary content, affects its readability in some cases.
We have added a direct navigation link to Sites and People, and a quick link to the public deliverables list. Publications can be tagged by workpackage or event, thus making the selection of publications by tag easier.
The deliverables on the workplan (D1.1) and the dissemination plan (D10.1) should be regularly updated (at the beginning of 2nd and 3rd year).
We have kept an updated list of deliverables with administrator's view at http://www.molto-project.eu/workplan/deliverables and quick links at http://www.molto-project.eu/view/biblio/deliverables. The dissemination plan is kept uptodate on the wiki page, http://www.molto-project.eu/wiki/living-deliverables/d101-dissemination-.... We now added a Section to summarize Exploitation plans.
Taking into account the numerous endeavors undertaken in the translation domain, both research and commercial, the market segment addressed by MOLTO should be identified with maximum precision. The specific case studies should also be taken into account in this effort. It is suggested that careful planning is initiated as early as possible and not later than the next reporting period.
The addition of the new partner BI will open extra markets for the tools of MOLTO. We have also started to look into usage of constrained natural languages in software localization, in social networks and in specific mathematical domains.
Official tables on the usage of resources are available for yearly reporting in Forms C.
Here we have a rough estimate of person's months given by each node. Note that the figures listed previously do not include management months, hence totals may differ.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 9 (O. Caprotti, A. Ranta) | 0 | 10 (R. Enache) | 12 (J. Camilleri, A. Slaski, S. Virk) |
UPC | 19.16 (J. Saludes, L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) | 22 (C. España, X. Carreras, M. Gonzalez) | 0 | 6 (A. Ribó Mor) |
UHEL | 2,5 (L. Carlson) | 0 | 7 (S. Nyrkkö) | 5 (L. Alanko), 5 (I. Listenmaa), 12 (J. Shen, C. Li) |
Ontotext | 6 (B.Popov, S.Karagova) | 0 | 0 | 36 (P.Mitankin, M.Nozchev, F.Alexiev, A.Ilchev, I.Kabakov, K.Krustev, M.Damova, M.Chechev, V.Zhikov, S.Enev) |
UZH | 0 | 2 (K. Kaljurand) | 0 | 0 |
BI | 0,25 (J.van Aart, J. van Grondelle) | 0 | 0 | 0,25 (H. ter Horst) |
We found a typo in the table for WP7 in the new Annex I for MOLTO EEU. The person months must be the same as for the previous DoW (Version number: 3 Revision 1 (21/01/2011)): namely WP7 description, pag. 31, PMs: UGOT 12, UPC 15, and Ontotext 15 (and not Ontotext 0).
I, as scientific representative of the coordinator of this project and in line with the obligations as stated in Article II.2.3 of the Grant Agreement declare that:
1. The attached periodic report represents an accurate description of the work carried out in this project for this reporting period;
2. The project (tick as appropriate):
3. The public website, if applicable:
4. To my best knowledge, the financial statements which are being submitted as part of this report are in line with the actual work carried out and are consistent with the report on the resources used for the project (section 3.4) and if applicable with the certificate on financial statement.
5. All beneficiaries, in particular non-profit public bodies, secondary and higher education establishments, research organisations and SMEs, have declared to have verified their legal status. Any changes have been reported under section 3.2.3 (Project Management) in accordance with Article II.3.f of the Grant Agreement.
Name of scientific representative of the Coordinator:
Aarne Ranta
....................................................................
Date: 26/4/2012
</hr/>