This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.
For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:
Moreover, if applicable:
This WP has delivered the GF grammar development infrastructure in anticipated ways, resulting in two IDE's (a cloud-based and an Eclipse-based) and a faster grammar compiler. The WP and its last deliverable have been extended in time to allow for interaction with the MOLTO-Enlarged EU, which was delayed. The skeleton of the final deliverable has been discussed during the 4th Project Meeting.
GF 3.3.3: faster compilation of grammars, permitting on-the-fly changes of running translation systems.
GF Cloud-Based IDE: an IDE for beginners, as well as for on-the-fly changes of running translation systems. New features in this year:
GF-Eclipse plugin: an IDE for power users, with features such as
GF Resource Grammar Library has 7 new languages since March 2012: Hindi, Latvian, Nepali, Persian, Punjabi, Sindhi, and Thai. Some MOLTO applications (e.g. the Phrasebook and the Math library) are ported to some of these languages.
RGL support of lexicon building was evaluated in the article by Détrez and Ranta, Smart Paradigms and the Predictability and Complexity of Inflectional Morphology, to appear in EACL 2012.
As a tutorial and reference for GF, a book has been published: Ranta, Grammatical Framework - Programming with Multilingual Grammars, CSLI, Stanford, 2011.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 5 (R. Enache) | 6 (J. Camilleri) |
UPC | 1 (J. Saludes) | 1 (C. España) | 0 | 0 |
UHEL | 0 | 0 | 0 | 5 (L. Alanko) |
OntoText | 0 | 0 | 0 | 4,2(M.Chechev,M.Damova,K.Krustev) |
We moved Deliverable 2.3, "User Manual and Best Practices", to Month 27 (due 20 June 2012). The reason is that we want to include the experience from the new kind of users from Be Informed, and the start of the MOLTO enlargement was delayed.
The work done during the last year is related to the promises of WP3: to combine MOLTO tools with traditional CAT tools. As described in the appendix D9.1A, MOLTO tools would be used to translate real time multilingually some formulaic parts of a more complex document type, such as descriptions of chemical formulas in a patent. The rest of the document would be translated with more traditional methods. We have chosen the translation management system GlobalSight to combine the workflows.
We have been modifying the editor released in MS3 adding term management and user authentication. We've been also developing a term search; currently it is a separate component, but we're planning to attach it to the editor. The search can be tested at http://tfs.cc/molto_term_editor./editor_sparql.html.
Term management platform TermFactory (TF), a related project run by Lauri Carlson, is under development. The plan is to connect TF to the editor in order to allow on-the-fly user extensions of the lexicon of the grammar. The work done in WP2 by UGOT is in synergy with our WP: they have been developing ways to change the GF grammar without full recompilation thus in a significantly faster time.
As for publications, a master's thesis called ''Ontology-based lexicon management in a multilingual translation system'', written within the project, will be finished during Spring 2012.
As a part of MS8 (due September 2012), GlobalSight is now running on our server at http://tfs.cc/globalsight/.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 4 (A. Slaski, S. Virk, N. Frolov) |
UPC | 0,25 (L. Màrquez) | 1,75 (M. Gonzàlez) | 0 | 0 |
UHEL | 2 (L. Carlson) | 0 | 0 | 5 (I. Listenmaa), 6 (J. Shen), 6 C. Li) |
OntoText | 0 | 0 | 0 | 4,1(M.Damova, M.Chechev, S.Enev) |
This WP has delivered two way interoperability between the natural language and ontology. The prototype was build and made publicly available on http://molto.ontotext.com. The prototype integrates the infrastructure for knowledge modeling, semantic indexing and retrieval with tools for NL queries to the semantic repository and verbalization of the results.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UPC | 0 | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 4(B.Popov) | 0 | 0 | 10,1(M.Chechev,P.Mitankin,M.Nozchev,F.Alexiev,A.Ilchev,I.Kabakov,K.Krustev,V.Zhikov) |
The milestone MS7 has been achieved in M24 (First prototypes of hybrid combination: The methods are implemented and evaluated on a specific test set).
The work of the fourth semester corresponds to the three last tasks of the WP (T5.4 Baseline systems, T5.5 Hybrid Models and T5.6 Systems evaluation, see http://www.molto-project.eu/workplan/statistical-and-robust-translation). The baseline systems have been improved by extending the GF translator. Now the translator is able to deal with chunks so that the coverage has been widened (Task 5.4).
For Task 5.5 we have implemented two kinds of hybrid models which we call Soft and Hard integrations. The following section outlines its main characteristics.
Finally, for Task 5.6 both the baselines and the hybrid systems have been evaluated using a variety of lexical metrics and compared with generic public available translators such as Google and Bing. Also a manual evaluation has been carried out in order to compare the most promising hybrid system according to the automatic evaluation and the pure SMT translator.
The work done for these tasks has been submitted to the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) and the submitted paper with title "A Hybrid System for Patent Translation" can be found in MOLTO web page.
At the same time of writing this report, the deliverable D5.2 Description and evaluation of the combination prototypes is being submitted as a regular publication. It is a public document accessible from the MOLTO web page.
Two kinds of hybrid translators for patents have been developed. The final systems are not only a combination of two different engines but the subsystems also mix different components. We have developed a GF translator for the specific domain that uses an in-domain SMT system to build the lexicon; an SMT system is on top of it to translate those phrases not covered by the grammar.
In the previous report we showed that the GF grammar-based system alone could not parse most patent sentences. Consequently, the current translation system aims at using GF for translating patent chunks, and assemble the results in a later phase. As explained in D5.2, this implies several modifications to the GF baseline itself.
To gain robustness in the final system, the output of the GF translator is used as a priori information for a higher level SMT system. The SMT baseline is fed with phrases which are integrated in two different ways. First, what we call "Hard Integration", phrases with GF translation are forced to be translated this way. The system can reorder the chunks and translates the untranslated chunks, but there is no interaction between GF and pure SMT phrases. Second, in the "Soft Integration" system, phrases with GF translation are included in the translation table with a certain probability so that the phrases coming from the two systems interact.
The hybrids exploit the high coverage of statistical translators and the high precision of GF to deal with specific issues of the language. At this moment the grammar tackles agreement in gender, number and between chunks, and reordering within the chunks. Although the cases where these problems apply are not extremely numerous both manual and automatic evaluations consistently show their preference for the hybrid system in front of the two individual translators. In the near future we plan to widen the number of issues approached by the grammar. Also, modifications with SMT components to the GF translator and new kinds of combination of phrases will be introduced.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 1 (R. Enache) | 0 |
UPC | 7.75 (L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) | 11.25 (C. España, M. Gonzalez, X. Carreras) | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 0 | 0 | 0 | 0 |
The final hybrid translators have been developed for the French-English language pair. We also aim at including German, so in the following months the concrete syntax for German will be completed. We plan to complete the task in May and it does not affect any other tasks of the project. The systems in the three languages will be available for the final evaluation.
Deliverable D6.2 has been released as tagged SVN content publicly available at svn://molto-project.eu/tags/D6.2. Bug fixing and some more features may continue to be developed in the head branch.
With respect to M18, we added an upper layer to the MGL library to support commands issued to a Computer Algebra System (CAS) and to render the answers in the natural languages as text or speech using actual concrete syntaxes for 3 languages: English, Spanish and German.
We developed software components to interact with a CAS (Sage) both externally using the http
protocol, or inside the Sage shell and notebook interfaces. Furthermore, we developed a testing procedure to assist in regression tests for the tool.
Developed a prototype to engage Sage in a dialog using natural language that runs on Linux and Mac OSX. The system assists command composing by providing autocompletion and gives spoken output on demand.
Developed a Sage interface to issue commands to a Sage process from the native Sage shell or notebook. In Linux it provides autocompletion using the native shell mechanism for it.
The dialog prototype has been demonstrated at DEIMS12
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UPC | 6 (J. Saludes) | 0 | 0 | 6 (A. Ribó Mor) |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 0 | 0 | 0 | 0 |
During this period, WP7 has done a step forward in the development of the prototype: the Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.
Due the incremental development of the prototype, most of the tasks span till M27, when the final prototype must be delivered. The following lines describe the progress of the following tasks:
In relation to Task 7.2, the EPO provided a parallel corpus of patents from which only 66 patents belongs to the biomedical domain. We downloaded an alternative corpus of 7,705 document directly from their website (i.e. publicly available) The following summarizes the content of these documents: 4,274 out of the 7,705 documents have claims (6M lines), 2,058 out of them are trilingual (3M lines). 2,116 documents have claims written only in English, 66 have claims only in German (260K lines), 34 only in French (88K lines). There are no extra files having other combination of languages.
Regarding Task 7.4 and Task 7.5, the ontologies, indexes, databases and retrieval engines have been set up for the specific domain and using the patent documents described above. The semantic annotation process is carried out by a GATE pipeline on the English texts. We are working to export the annotations during the translation process in order to be able to show the annotations also in the French and German texts.
As for Task 7.3 and Task 7.6, the grammars development and SMT adaptation to the domain is being developed jointly with WP5 tasks. The grammars have been developed for English and French, and in the following will be developed also for German.
Finally, regarding Task 7.7, the interface allows accessing the system in three different ways: the controlled language, SPARQL and terms in the index. In the future we will include free text and a combination of it with the controlled language.
Since M21 there is a fully functional version of the prototype at http://molto-patents.ontotext.com/. The demo allows querying the system in English and French. The patents in the database has original text in English, French and German.
The retrieval system can be queried in three different ways. The NL-based interface allows the user to query the system in English and French using written natural language. The SPARQL interface, more suitable for advanced users, allows to accurately browse the repository using SPARQL queries The keyword-based visual browsing interface uses the RelFinder tool in which the user can search for keywords using the autocomplete functionality. The results from the RelFinder search are visualised as graphs.
The visualization of the results displays the list of classes from the ontologies that match the query and the list of patent documents indexed under the matching criteria. The interface provides also a link to access the semantically annotated documents and the original patents. The interface that shows the annotated documents highlights on the text the words that are related to any semantic item. Colors are given according to the semantic annotations type. The right side of the page gives the list of semantic types and colors that are present in the text.
A paper about the Patent retrieval system was accepted at WWW2012 Conference, to be held in April.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 3 (R. Enache) | 3 (A. Slaski) |
UPC | 0 | 7,5 (M. Gonzales, C. España) | 0 | 0 |
UHEL | 0 | 0 | 0 | |
OntoText | 0 | 0 | 0 | 8,8 (M.Chechev, M.Damova, V.Zhikov, I.Kabakov) |
In general lines, we are achieving the objectives related to WP7 tasks within the timeframe. However, due the several issues related to the gathering of the corpora, the databases of the retrieval system do not include yet automatic translations of the patent document but only real translations. The issue affects directly the annotation process of Tasks 7.5, but it does not imply a delay for the whole prototype. The estimation is that the automatic translations and annotations will be included in the final prototype.
The work package has started by data collection, proceeded with developing the ontology interface, and lately focused on the baseline translator. The translator is only for five languages so far, but will be extended soon. The ontology interface will permit multilingual queries about museum objects exploiting the MOLTO Knowledge Representation Infrastructure. It also makes this case study into an example of multilingual ontology verbalization.
Ontology and corpus study (D8.1).
Grammars for translation and multilingual NLG for painting descriptions in five languages: English, Finnish, French, Italian, Swedish. This was built in a modular way that is easy to extend to new languages, which we will do soon.
Ontology verbalization in a generic way. The same languages will be usable as in translation, but aren't yet.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 1 (R. Enache) | 0 |
UPC | 0 | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 0 | 0 | 0 | 5,9 (M.Chechev, M.Damova, K.Krustev, V.Zhikov) |
We have not been able to proceed at the planned pace, and would like to have an extension of the WP time. We would like to extend this WP and its last deliverable 8.3 till Month 36. We have not been able to use the person months as planned (as can be seen from the Use of resources). One of the planned key persons, Dana Dannélls at UGOT, will be able to join MOLTO full-time a few months later than originally planned, probably October/November 2012.
This WP is working on collecting evaluation plans from each site.
An extended D9.1E Evaluation plan has been written.
Progress evaluation has mainly been carried out by each site during development. This would be a good idea to collect this more systematically.
For the SMT/hybrid patent case, automatic measures (BLEU but also others - maybe check Cristina/UPC slides for examples) are probably mainly used.
In developing the GF grammars, informants (native speakers of the relevant languages) have been used during the grammar writing process to check and correct output. The informants have been given output to read and have informed the developer if sentences are correct or if not, how they should be corrected.
Moving forward, the final evaluations will need to include usability of the tools as well as quality evaluation of the output. (WP9 review slides have some examples of the user communities that might be mobilized for usability evaluation and the platforms that could be used. One thing that we were discussing wrt to mobilizing evaluators is that they need to be motivated to use the tools in some way?)
For output quality, final evaluations will likely involve both automatic and manual methods. For automatic methods, UPC's Asiya evaluation kit offers some syntactically and semantically oriented metrics in addition to the purely lexical ones like BLEU, but only for a couple of languages. As all automatic metrics rely on comparison to gold standard human translations, these need to be obtained for the test sets, if they are to be used.
Manual evaluation methods on the other hand require humans to do evaluations. For the patent case, evaluators need to have sufficient understanding of the material to be able to assess whether the translations are correct or not, particularly since we expect one of the strengths of the GF hybrid to be in correctly handling long formulae. Therefore plans have been made to hire professional patent translators of the languages in question to do the evaluation expectedly in June. Since Google is now also providing patent translations, that will be used as a point of comparison. The TAUS scale, fluency etc. could be used in this case.
For the museum case one manual evaluation approach was to produce museum descriptions in various languages that combine the simpler rules - e.g. "Painter painted Painting in City in Year on Canvas" etc. and then have the native speakers check the individual relations involved (Who painted? What did they paint? Where? When? etc.) and combine these into a measure of the overall fidelity. For this, evaluators do not necessarily need to be museum experts, any native speakers of the language in question should do. If you want a reference for this, an interesting description of such approach is in http://www.cs.ust.hk/~dekai/library/WU_Dekai/LoWu_Acl2011.pdf Other measures such as fluency, TAUS fitness scale could also be used.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UPC | 2,16 (L. Màrquez, LluísP, D. Farwell) | 0,5 (C. España) | 0 | 0 |
UHEL | 0.5 (L. Carlson) | 0 | 9 (S. Nyrkkö) | 0 |
OntoText | 0 | 0 | 0 | 2(M.Chechev, K.Krustev) |
UZH | 0 | 0 | 0 | 0 |
BI | 0 | 0 | 0 | 0 |
A few events have been organized by MOLTO and some are in the making. The your researchers in the Consortium have published a number of papers at international venues on initial results of the project. Project meetings have taken place in Helsinki, and in Zürich with the extra MOLTO-EEU kickoff meeting in Gotheburg. GF tutorials and tutorials on MOLTO works have been delivered on various occasions, more prominently during the GF Summer School: Frontiers of Multilingual Technologies in August 2011.
At UGOT, R. Enache and S. Virk have passed their licenciate, a step towards their PhD, by publishing work in connection to MOLTO. Moreover, D. Dannels has also defended her PhD seminar by discussing work done in the natural language analysis of cultural heritage domain. Finally, K. Angelov, obtained his PhD at Chalmers with a thesis on the inner workings of GF, much of which goes to benefit WP2 and WP3.
The list of publications, archived on the MOLTO website (http://www.molto-project.eu/biblio?sort=year&order=asc), follows here below.
Controlled Language for Everyday Use: the MOLTO Phrasebook, Ranta, Aarne, Enache Ramona, and Détrez Grégoire, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)
The GF mathematics library, Saludes, Jordi, and Xambó Sebastian, THedu'11, (2011)
Grammatical Framework: Programming with Multilingual Grammars, Ranta, Aarne, CSLI Studies in Computational Linguistics, Stanford, p.350, (2011)
MOLTO Enlarged EU Annex I - Description of Work, Consortium, MOLTO , (2011)
MOLTO poster presented at EAMT Conference (European Association for Machine Translation) 2011, Leuven, Ranta, Aarne, and Enache Ramona, (2011) - also presented at META-FORUM by Listenmaa, Inari in Budapest, 2011.
Typeful Ontologies with Direct Multilingual Verbalization, Angelov, Krasimir A., and Enache Ramona, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)
Typeful Ontologies with Direct Multilingual Verbalization poster, presented at the Google Anita Borg retreat, June 2011, Zurich, Enache, Ramona, (2011)
The GF Mathematics Library, Saludes, Jordi, and Xambó Sebastian, Proceedings First Workshop on CTP Components for Educational Software (THedu'11), 02/2012, Volume Electronic Proceedings in Theoretical Computer Science, Number 79, Wrocław, Poland, p.102–110, (2011)
D1.3A Advisory Board Report, Hall, Keith, and Pulman Stephen, 03/2011, Number D1.3A, Gothenburg, (2011)
MOLTO - Multilingual On-line Translation - Annual Report 2010-2011, Caprotti, Olga, España-Bonet Cristina, and Alanko Lauri, 03/2011, Gothenburg, (2011) - Published on cordis.eu.
A Framework for Improved Access to Museum Databases in the Semantic Web, Dannélls, Dana, Damova Mariana, Enache Ramona, and Chechev Milen , RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING, 09/2011, Hissar, Bulgaria, (2011)
Hybrid Machine Translation Guided by a Rule–Based System, España-Bonet, Cristina, Labaka Gorka, Díaz De Ilarraza Arantza, Màrquez Lluís, and Sarasola Kepa , Machine Translation Summit, 09/2011, Xiamen, China, p.554-561, (2011)
The painting ontology, Dannélls, Dana, CIDOC 2011 conference, 09/2011, (2011)
Patent translation within the MOLTO project, España-Bonet, Cristina, Enache Ramona, Slaski Adam, Ranta Aarne, Màrquez Lluís, and Gonzàlez Meritxell, Workshop on Patent Translation, MT Summit XIII, 09/2011, p.70-78, (2011)
Reason-able View of Linked Data for Cultural Heritage, Damova, Mariana, and Dannélls Dana, The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES (S3T), 09/2011, Bourgas, Bulgaria, (2011)
Deep evaluation of hybrid architectures: simple metrics correlated with human judgments, Labaka, Gorka, Díaz De Ilarraza Arantza, España-Bonet Cristina, Sarasola Kepa, and Màrquez Lluís, International Workshop on Using Linguistic Information for Hybrid Machine Translation, 11/2011, Barcelona, Spain, p.50-57, (2011)
The Patents Retrieval Prototype in the MOLTO project, Chechev, Milen, Gonzàlez Meritxell, Màrquez Lluís, and España-Bonet Cristina, WWW2012 Conference, Lyon, France, (2012)
MOLTO - Multilingual On-Line Translation, Ranta, Aarne, Talk given at Xerox Research Centre Europe, Grenoble, 19 January 2012, 01/2012, (2012)
Using GF in multimodal assistants for mathematics, Archambault, Dominique, Caprotti Olga, Ranta Aarne, and Jordi Saludes, 02/2012, Digitization and E-Inclusion in Mathematics and Science 2012, (2012)
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 4 (O. Caprotti) | 0 | 0 | 0 |
UPC | 1 (S. Xambo) | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
OntoText | 1 (B.Popov) | 0 | 0 | 2,6 (M.Chechev, M.Damova) |
UZH | 0 | 0 | 0 | 0 |
BI | 0 | 0 | 0 | 0 |
None to report.
During the first 3 months of our participation in the MOLTO project we completed an initial integration of the GF-provided services (mainly translation and look-ahead editing) into AceWiki.
We implemented a new Java front-end to the GF Webservice, and use it to connect to the GF services from AceWiki. The existing AceWiki user interface was extended to allow for an easy switching between different languages and to present with each sentence its GF-provided analysis (translations into other languages, word alignment diagrams, GF syntax trees, etc.). The AceWiki storage format was changed to a one based on GF abstract trees (which are language-neutral).
The other main part of our work dealt with the implementation of the ACE grammar in GF. We tested an existing implementation (Angelov and Ranta, 2009) which targets an earlier version of ACE for its recall and precision, and found that some changes need to be introduced to make it compatible with the latest version of ACE. More importantly, we decided to focus on and also started work on a grammar of the subset of ACE that is used in the current AceWiki.
We also experimented with taking the content of an existing AceWiki demo wiki (domain Geography) and using it to pre-populate the new GF-based AceWiki.
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UHEL | 0 | 0 | 0 | 0 |
UZH | 0 | 2 (K. Kaljurand) | 0 | 0 |
None to report, besides delayed start as explained later in this document.
The first 2 months we started with the Adoption phase as described in the DoW for WP12. We've focused our efforts on the requirements for the verbalization component(D12.1). We distinguish 4 categories of relevant requirements.
We presented a requirements draft to our partners in March 2012.
At the kickoff hosted by UGOT we did a first round with the UGOT people to draw up the specification to migrate Be Informed current explanation prototype to GF.
Next Steps
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
BI | 0,25 (J. van Grondelle, J. van Aart ) | 0 | 0 | 0,25 (H. ter Horst) |
WP12 runs longer than is indicated in the Gantt chart appearing in the Annex, however duration is correctly listed under 3.3.3. We planned for a duration of 15 months, D12.2 for instance is projected for March 2013.