D1.4 Periodic Management Report T18


Contract No.: FP7-ICT-247914
Project full title: MOLTO - Multilingual Online Translation
Deliverable: D1.4 Periodic Management Report T18
Security (distribution level): Confidential
Contractual date of delivery: M18
Actual date of delivery: 1 Oct 2011 (expected)
Type: Report
Status & version: Final
Author(s): O. Caprotti et al.
Task responsible: UGOT
Other contributors: All


Abstract

Progress report for the third semester of the MOLTO project lifetime, 1 Mar 2011 - 30 Sep 2011.

1. Publishable summary

1.1 Project context and objectives

The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 36 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF [2], Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as those used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data.

MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.

While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.

Three such domains will be considered during the MOLTO project: mathematical exercises, biomedical patents, and museum object descriptions. The MOLTO tools however will be applicable to other domains as well. Examples of such domains could be e-commerce sites, Wikipedia articles, contracts, business letters, user manuals, and software localization.

1.2 Main results achieved so far

The results achieved during the first 18 months of the projects are:

  • the first web services demonstrations highlighting the fundamental features of the MOLTO system tools: high-quality multilinguality and NLG multilingual query interface to semantic data;
  • the mathematical grammar library that allows to express simple mathematical problems in more than 10 languages, although not all at the same level of quality.
  • a online GF grammar editor for assisting authors of GF application grammars when editing in the cloud, also including the option to carry out example-based grammar generation;
  • GF plugins in Python and C to call library functions;
  • APIs for GF Grammar compilation and for the MOLTO translator's tools;
  • a prototype demonstrating GF grammar-ontology interoperability;
  • a description of the cultural heritage ontology for museum artifacts and use case scenarios for the application grammar.

1.3 Expected final results and their potential impact and use

The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software products:

  • a grammar development tool, available as an IDE and an API, to enable the use as a plugin to web browsers, translation tools, etc, for easy construction and improvement of translation systems and the integration of ontologies with grammars
  • a translator’s tool, available as an API and some interfaces in web browsers and translation tools
  • a grammar library for linguistic resources
  • application grammar libraries for the domains of mathematics, patents, and cultural heritage

These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.

The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.

2. Core of the report

2.1 Project objectives for the period

This semester marks the half-lifetime of the project, a point in which all work-packages are under development and the first integrations ought to take place. In particular, the initial prototypes are being delivered. They include the APIs for WP2 and WP3, the GF grammar IDE and the grammar-ontology interoperability allowing natural language generation from an ontology, translation of natural language queries to SPARQL, the GF grammar for simple mathematical exercises, and information extraction. The main integrations are taking place among the GF grammar tools developed in WP2 and the translator's workbench developed in WP3, but also between the museum ontology created by WP8 and the interoperability described above and carried out in WP4.

2.2 Work progress and achievements during M12-M18

This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:

  • A summary of progress towards objectives and details for each task;
  • Highlights of clearly significant results;
  • If applicable, the reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
  • If applicable, the reasons for failing to achieve critical objectives and/or not being on schedule with remarks on the impact on other tasks as well as on available resources and planning;
  • A statement on the use of resources, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex 1 (Description of Work);
  • If applicable, corrective actions.

WP2 Grammar developer’s tools - M18

Highlights

The period M13-M18 has been very active in the development of grammar development tools and also in their documentation and dissemination. Thus we can report the following new software:

  • The GF Grammar Web IDE was released in http://www.grammaticalframework.org/demos/gfse/ and has undergone a few upgrades. It is one of the two systems to be reported in D2.2.
  • The GF Shell Reference http://www.grammaticalframework.org/doc/gf-shell-reference.html was published and will be upgraded semi-automatically whenever new functionalities appear.
  • The GF Resource Grammar Library got three new complete languages: Punjabi, Nepali, and Persian. Partial implementations of Latvian and Swahili were published.
  • The Eclipse plug-in to GF was started and is operational with basic functionality, as reported in D2.2.
  • The example-based grammar writing feature is available as a shell program and also in the Web IDE, as reported in D2.2.

Ongoing work

We are working with further development of the two IDE's for GF and the example-based grammar writing method. We are also gathering material for Deliverable D2.3, Best Practices, due M24.

WP3 Translator's tools - M18

During the reporting period, work has progressed on the following fronts.

  1. The MOLTO translation editor demo prototype by Krasimir Angelov build with the Google Web Toolkit was installed under Eclipse with apache web server fastcgi. The complete development and testing cycle inside Eclipse is now available. The installation instructions are in the MOLTO UHEL Wiki at XXX.

A new tab embedding a treegrid editor implemented with the ExtJS javascript library for editing term equivalents has been added to the editor. A version is able to query Ontotext FactForge for term candidates. It remains to implement a full search and edit back end.

  1. A design and implementation plan for an extended Translation Tools API has been drawn and submitted as Deliverable 3.1 (The Translation Tools API). Most parts of the API exist as existing code.

A development environment for the open source translation management system GlobalSight has been installed for adapting parts of the system for the MOLTO translation tools.

It remains to develop the glue to connect the existing parts together. Some extensions to the grammar development API have been listed in a requirements section in the deliverable.

  1. A C language runtime for parsing and generating with PGF files has been written by Lauri Alanko. The first release will be out in Oct 2011.

  2. A first version of an ontology/terminology acquisition toolkit for lexical resources management by Seppo Nyrkkö was demonstrated at the Helsinki project meeting.

WP4 Knowledge engineering - M18

WP4's main task is to research the possibilities for interoperability between grammars, written in GF, and ontologies and to build a prototype demonstrating it.

During the period M12-M18 the work concentrated on refactoring and bug fixing of the prototype build earlier, on experimenting with bigger data sets, and on extending the functionality of the prototype. The main points can be summarized as follows:

  • the infrastructure has been updated with latest versions of the semantic repository and tools - this was necessary because SPARQL 1.1 queries are supported by the newest version of the semantic repository Owlim.
  • the set of the ontologies has been extended with dbpedia, geonames and other databases. - this task is important for re-usability of the prototype in the use cases WP6-WP8.
  • analyzed the requirements of the verbalization of the results
  • implemented a verbalization of the results - the current implementation generates automatically a GF abstract representation from the results and this abstract representation is matched from the GF abstract grammar and GF concrete grammar that are also build automatically from the ontology.
  • add 4 more languages to the prototype - the current list of the languages for the prototype is: English, Finnish, Swedish, French, German and Italian.
  • refactoring of the prototype GUI and internationalization.
  • test and bug fixing the prototype for Deliverable D4.3

WP5 Statistical and robust translation - M18

Summary of progress

M18 is the date where Milestone S5 (First prototypes of the baseline combination models) should be achieved. The baseline systems for this workpackage, as described is Task 5.4, include an statistical machine translation system (SMT) trained with patents data, and the GF multilingual translation with a specific grammar for patents.

The SMT system was mainly developed in the previous six months and was already reported in the First Year Report. In the following section we explain the most significant results which have been accomplish with respect to the GF system.

For Task 5.5, we have started the work towards the hybrid system. Parts of the GF system such as the lexicon building already make use of statistical components. Besides, the methodology to combine SMT and GF alignments is established waiting to be applied to the patents domain.

The work done for these tasks has been recently published in the "MT Summit XIII 4th Workshop on Patent Translation" with the title "Patent translation within the MOLTO project".

At the same time of writing this report, the Deliverable D5.1 Description of the final collection of corpora corresponding to Tasks 5.1 and 5.2 has been submitted as a regular publication. It is a public document accessible from the MOLTO web page.

Highlights

A first implementation of the English-to-French patent translator with GF is available. The translation process can be divided according to the action of three modules: a generic pre-processing, the on-line lexicon building, and the patents grammar.

The generic processing consists of an on-purpose tokeniser that deals with compound nouns, phrases separated by hyphens, chemical compounds, etc. The Stanford POS-tagger is used for named entities recognition and a recogniser of numbers has been developed. Chemical compounds after being tagged can be independently translated by the compounds grammar. This grammar is in an early stage of development within this workpackage.

The second module is devoted to the lexicon building. To do this, the GF library multilingual lexicon is extended with nouns, adjectives, verbs and adverbs. The abstract syntax for these PoS is created from the claims in English and words are lemmatised and corrected manually from noise and ambiguities. The appropriate inflection is generated using the implemented GF paradigms and the English dictionary of the GF library for English, which is the starting language. Base forms are then translated into French and the inflection is generated in the same way. This process will be extended to other languages later on the project.

Finally, the core of the translator is the patents grammar. The GF resource grammar has been extended with functions that implement constructions that occur in patent claims. The grammar is also in its first stages and nowadays it has a huge number of ambiguities and its coverage is around 15% on complete sentences. This figure can increase up to a 60% when dealing with chunks instead of full sentences.

Deviations from Annex I

This workpackage is tightly related to WP7. The delay on the patents corpus from WP7 has implied a reordering of some tasks within WP5. This explains the work done for Task 5.5 substituting parts of Task 5.4 which will be finished in the next months. Also because of the delay on the approval of the data, Deliverable 5.1 could be updated soon.

WP6 Case study: mathematics - M18

Summary of progress

Deliverable D6.1 has been released as a tagged SVN repository available at svn://molto-project.eu/tags/D6.1, although bug fixing may continue in the head branch.

With respect to the T6 progress report, we increased the number of compiled languages from 7 to 13 and checked for correctness and fluency in 3 languages.

Dissemination activities at CADE'23 and satellite conference THedu'11.

Highlights

  • Refactoring from the WebALT code to the modular form compatible with GF 3.2 complete.

  • The library compiles for the following languages: Bulgarian, Catalan, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, Swedish and Urdu plus an artificial language (LaTeX).

  • A demo based on the minibar demo, but including mathematical output for LaTeX is up and running at http://www.grammaticalframework.org/demos/minibar/mathbar.html

  • Testing for correcteness and fluency has been done for English, German and Spanish. This amounted to:

    1. Extracting a sample of the grammar productions to form a treebank to be used for quality testing and as a regression tool.
    2. Asking fluent speakers in the mathematical vernacular for these three languages to provide right renderings of the treebank entries
    3. Implementing the differences into the grammar library.
  • The results of Deliverable D6.1 have been presented in THedu'11

WP7 Case study: patents - M18

Summary of progress

WP7 is due to provide a beta prototype in the forthcoming M21: D7.1 Patent MT and Retrieval Prototype Beta. Although the WP started with certain delay, the objectives for each task have been accomplished, namely in the last months.

The initial tasks includes the definition of the architecture for the prototype, Task 7.7, and the use case scenarios Task 7.1. We mainly consider two scenarios: the multilingual retrieval of biomedical pantents and the online translation of patent claims and abstracts.

In relation to Task 7.4 and Task 7.5, we started the work towards the selection of ontologies to deal with the biomedical domain and the extraction of FDA terms, drugs and measurement related models. Besides, we also have created an ontology to capture the structure of patent documents. In order to query the system, we have defined the structure and topics of the queries available to the user, detailed below. Finally, we have implemented a first version of the grammar that recognizes such queries.

Regarding Task 7.2, recently we finally obtained the official license for using the EPO corpus. Earlier, we had been working with provisional data in order to create the first version of GF grammars Task 7.3 and to test the SMT system with this data Task 7.6.

Highlights

The architecture of the multilingual patents retrieval system is based on Exopatent, a working KRI platform from OntoText. The platform allow several search options including NL queries. We have defined 131 query examples along 21 query topics in relation to the biomedical domain. The grammar developed to process the queries covers about 600 queries in English and 500 in French.

Patents in the retrieval engine are being annotated following the two main ontologies selected for the domain.

The patents translation system is tightly related to WP5. The recent work in this task includes the development of the patents grammar, and the extension of the GF Resource Grammar with the functions implementing constructions that occur in patent claims. Generally speaking, the coverage of the grammar is unsatisfactory, which reinforces the efforts in the use of statistical techniques.

Deviations from Annex I

This workpackage has suffered a delay due to the lack of the proper license of the data corpus. Nonetheless, we are achieving the objectives related to WP7 tasks. Similarly, tasks in WP5 are being rescheduled. As we recently received the approval for the use of the data, we expect to speed up some tasks that were waiting for the data, as the baseline system (Task 5.4) and the annotation of the patents and the generation of the database (Task 7.5).

Use of resources

The use of resources have been as planned, with no remarkable deviations, according to the tasks described above performed so far.

WP8 Case study: cultural heritage - M18

Summary of progress

Dana Dannels from the Linguistics Department of UGOT and Mariana from Ontotext have had a skype meeting every month to summarize what has been accomplished by WP8 and WP4 and to coordinate the work on tasks. Dana mainly worked with Ramona Enache, a GF expert, to share ideas and discuss the work in progress.

The objectives for this workpackage in the period are ....

Task ... is completed, is undergoing.....

Significant results

  • Empirical study of existing metadata schemata adopted by museums in Sweden resulted in the creation of an ad-hoc ontology supporting compatibility to a variety of CH data schemata
  • Study of syntactic structures and patterns for discourse generation
  • The well known standard CIDOC-CRM has been implemented in GF.
  • The application specific ontology that uses ontology standards was implemented in GF.
  • The grammar implementation of the ontologies has been tested for verbalizing the ontology axioms.

We have developed a prototype for generating natural language descriptions using discourse patterns in English and Swedish, and two scientific paper about our work; both have been per reviewed. One was presented in a WWW conference and one in a CH workshop.

WP9 User requirements and evaluation - M18

Summary of progress

The work done during the months 12-18 the results are targeting for the preliminary work for D9.2 - the final evaluation.

Highlights

We have gained much input from Maarit Koponen's review on post-editing analysis and quality measurement in MT evaluation, also presented during the Open Day at the Third Project Meeting in Helsinki.

For designing the evaluation of translator's tools, we have studied different translation management systems, that are common in the translation industry. We have selected GlobalSight, an open-source platform, for a closer study.

We also have set up a MOLTO Content Factory server, which provides collaborative term voting and term validation. These features will be used in evaluation of terminology work. The MOLTO server has an URL already - but currently, UHEL security measures make it http-accessible only via UHEL's VPN. UHEL is working to find a solution for opening up the server to the MOLTO Consortium compatible with the university security policies. The server base URL is "http://tfs.cc/" and the mediawiki content is hosted at "http://tfs.cc/wiki", as demonstrated during the MOLTO Open Day.

There is an ongoing discussion about collaboration with local entrepreneurs who are researching pre-editing and pre-validation of machine translatable documents. The research is focused in MT quality and its evaluation metrics.

Deviations from Annex I

This work package is tightly related to other work packages, likewise to the dissemination work package. Due to the changes in the Patents Case Study (WP7) we are reviewing the related material for evaluation purposes. We are going to announce updates to the earlier evaluation plan (D9.1) as needed.

Use of resources

The use of resources follows the earlier plans.

WP10 Dissemination and exploitation - M18

The objectives of this WP are to:

  • (i) create a MOLTO community of researchers and commercial partners;
  • (ii) make the technology popular and easy to understand through light-weight online demos;
  • (iii) apply the results commercially and ensure their sustainability over time through synergetic partnerships with the industry.

To address (i), we have interfaced the RSS feed, publishing updates from the MOLTO website, to the Twitter feed http://www.molto-project.eu/moltoproject. This will further distribute the MOLTO news feed to mobile devices, alongside with the project's presence on LinkedIn.

To address (ii) we have organized a number of events geared at publicizing the core technologies employed in the project. MOLTO partners from UPC and UGOT organized a GF Summer School in Barcelona between 15-26, August 2011. The program comprised a tutorial week and an advanced week with specific topics, including also work which is being carried out as part of the MOLTO workplan. In particular, J. Saludes presented the evaluation of WP6, T. Hallgren introduced web application programming for GF, R. Enache showed the work on the GF-ontology inter-relation and on WP8, and C. Espana presented the results of WP5. The web site of the school, archives the presentations, the discussions (in particular the future work suggestions as a result of the panel discussion) and the calendar of the lectures. Furthermore, A. Ranta delivered a GF tutorial during CADE23, Grammatical Framework: A Hands-On Introduction, and J. Saludes presented The GF Mathematics Library (joint work with S. Xambó) during the CADE23 satellite workshop "CTP Components for Educational Software".

UHEL arranged the 3rd MOLTO Project meeting in Helsinki Aug.31-Sept 2, 2011.

To address (iii), we now will enlarge the MOLTO Consortium by two new partners, one of which, Be Informed is a commercial partner interested in exploiting the MOLTO tools in its products.

The list of publications can be obtained from the MOLTO website, ordered by year (most recent first), http://www.molto-project.eu/biblio?sort=year&order=desc.

Publications

The GF book appeared in April 2011 and is expected to help new developers to get started with MOLTO tools: Aarne Ranta, Grammatical Framework: Programming with Multilingual Grammars, CSLI Publications, Stanford, 2011, 340 pp. http://www.grammaticalframework.org/gf-book/

Aarne Ranta, Translating between Language and Logic: What is Easy and What is Difficult. In N. Bjørner and V. Sofronie-Stokkermans (eds), Automated Deduction - CADE-23 Proceedings, LNCS/LNAI 6803, Springer, Heidelberg, 2011, pp. 5-25 (invited talk mentions word done in WP6).

Events

Computational Morphology. A course in European Master's Programme in Language and Communication Technologies 2011, University of Malta, 22-30 March 2011. http://www.cse.chalmers.se/~aarne/computationalmorphology/

Computational Syntax. A course in the Masters Programme in Language Technology, University of Gothenburg, 11 April - 31 May, 2011. http://www.cse.chalmers.se/~aarne/computationalsyntax/

Grammatical Framework: A Hands-On Introduction. Tutorial at CADE-23, Wroclaw, 1 August 2011. http://www.grammaticalframework.org/gf-tutorial-cade-2011/

Second GF Summer School: Frontiers of Multilingual Technology. Barcelona, 15-26 August, 2011. http://school.grammaticalframework.org/

2.3 Project management during M12-M18

The main task of the management workpackage for the period has been to finalize the enlargement of the consortium which has been proposed last January. The new partners, University of Zurich (UZH) and the company Be Informed (BI), will lead two new workpackages directly applying the MOLTO tools to their existing core technologies. During the negotiation phase, a new workplan has been submitted and the budget was recalculated.

Payment for Period 1 arrived on the last day of August and was shared to the consortium early in September after redistribution of the Matrixware budget.

Monthly meetings of the Steering Committee were held regularly on conference calls and recorded on the wiki pages, http://www.molto-project.eu/wiki/minutes.

3. Deliverables and milestones tables

Deliverables of MOLTO are listed and linked for download on the web page http://www.molto-project.eu/workplan/deliverables.

Below is a summary of deliverables due in the third semester.

The only milestone due for the period is MS3, the web-based translation tool, which has been interpreted as a online editor interface to a GF application grammar and made available at http://www.molto-project.eu/node/1063.

ID Due date Dissemination level Nature Publication
D2.1 GF Grammar Compiler API 1 March, 2011 Public Prototype D2.1 GF Grammar Compiler API
D1.3 Periodic management report 2 1 April, 2011 Consortium Report D1.3A Advisory Board Report
D4.2 Data Models, Alignment Methodology, Tools and Documentation 1 May, 2011 Public Regular Publication D4.2. Data models and alignments
D2.2 Grammar IDE 1 September, 2011 Public Prototype D2.2 Grammar IDE
D3.1 MOLTO translation tools API 1 September, 2011 Public Prototype D 3.1. The Translation Tools API
D4.3 Grammar-Ontology Interoperability 1 September, 2011 Public Prototype D4.3 Grammar ontology interoperability
D5.1 Description of the final collection of corpora 1 September, 2011 Public Regular Publication D5.1. Description of the final collection of corpora
D6.1 Simple drill grammar library 1 September, 2011 Public Prototype Simple drill grammar library
D8.1 Ontology and corpus study of the cultural heritage domain 1 September, 2011 Public Other Ontology and corpus study of the cultural heritage domain
D1.4 Periodic management report 3 1 October, 2011 Consortium Report D1.4 Periodic Management Report T18

4. Use of the resources

Tables on the usage of resources are not available for midterm reporting, however we have a rough estimate of person's months by each node.

Node Professor PhD PhD Student Research Engineer
UGOT 1 3 9 0
UPC 10 12 0 0
UHEL 2 0 4 9
OntoText 0 0 0 23.75

5. Financial statements

Not available for midterm reporting.