2.2 Work progress and achievements during the period

Please provide a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

For each work package, except project management, which will be reported in section 2.3, please provide the following information:

  • A summary of progress towards objectives and details for each task;
  • Highlight clearly significant results;
  • If applicable, explain the reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
  • If applicable, explain the reasons for failing to achieve critical objectives and/or not being on schedule and explain the impact on other tasks as well as on available resources and planning (the explanations should be coherent with the declaration by the project coordinator) ;
  • a statement on the use of resources, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex 1 (Description of Work);
  • If applicable, propose corrective actions.

WP2 Grammar Developer’s Tools - Month 6

The Grammarian's Tools include tools for using the GF grammar compiler and the Resource Grammar Library. In the first 6 months of MOLTO, we have worked on consolidating the compiler and the Library API, and also experimenting with the example-based grammar writing technique.

Clearly significant results include:

  • Milestone 1, 15 languages in the library. Due September 2010; reached December 2009.
  • Workflow for example-based grammar writing and estimated engineering effort: reported as a part of D10.2
  • GF plugin to Python NLTK
  • GF syntax highlight plugin to XCode programming environment.
  • Integrating probabilities with GF grammars.
  • Release of GF 3.1.6 in April 2010; GF 3.2 forthcoming before end of 2010.

No deviations from Annex I and the use of resources was as planned.

WP3 Translator's Tools - M6

na

WP4 Knowledge Engineering - Month 6

During the first period we managed to clarify the needs for knowledge representation infrastructure of the case studies and software tools in MOLTO. We have also circulated a questionaire describing the structured data sets which are expected to be of benefit for the project. Based on this information, we proceeded with deploying the knowledge representation infrastructure, which is now in place and accessible to the partners. It will be further described in D4.1 Knowledge Representation Infrastructure.

The second major direction during this period was the undoubtedly challenging grammar to ontology interoperability. For this we have chosen a quasi-exhaustive knowledge base of important named entities in the world and some relations between them. It is encoded according to PROTON – a basic-upper level ontology with about 300 classes of named entities. The first goal set for this interoperability was a transformation of questions expressed in natural language towards a formal query language – SPARQL. For this purpose, and on the basis of the ontology and the entities in the knowledge base, we have manually created a corpus of 500 sentences. This corpus is being used for development of the GF grammars handling the natural language questions and also for evaluation of the coverage of the grammars over this language space. After an initial grammar handling questions to the knowledge base has been developed for a subset of the English language, we have created a transformation function, rendering GF sentence trees to SPARQL queries. In order to show these initial results, we have developed a natural language based search interface over the knowledge base, with automatic suggestion of possible continuation of the questions, which is featured on the MOLTO website. The results of these questions are one or two dimensional tables of entities, where each row is an individual “answer”.

Effort spent by Ontotext in WP4 – 7.5 PMs; Other participants UGOT: Aarne will talk directly to Olga for this.

AttachmentSize
MOLTO.WP4_.M6.doc94 KB

WP5 Statistical and Robust Translation - M6

WP5 is planned to span from Month 7 to Month 30, but it is being conditioned by the delay on the Patents data. So, there is already some ongoing work we detail in the folowing.

Work towards Milestone MS7 (Month 24)

MS7: First prototypes of hybrid combination models.

Most of the objectives of the package depend on the compilation of the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.

At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.

Work towards Deliverable D51 (Month 18)

D51 : Description of the final collection of corpora.

Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information. We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. At the moment, we have compiled and annotated the European Parliament corpus for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.

On the other hand, domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study, WP7). We cannot build the final corpus, but some of the MOLTO members have join the IRF so that a set of Patents data are available for individual research purposes. This has allowed to compile a preliminar parallel corpus on which we can start shortly to build a domain GF grammar and to develop a first pure SMT domain-adapted translator.

AttachmentSize
ProgressReport_WP5.odt34.6 KB

WP6 Case Study: Mathematics - Month 6

Working towards deliverable D6.1:

  1. Refactor prior code (WebALT grammars) into a separate module for each OpenMath Content Dictionary (CD).
  2. Adapt said code to work with current GF resource libraries (3.1)
  3. Test compilation of OpenMath layer for: English, Catalan, French, Italian, Spanish, German, Swedish.

Clearly significant results include:

  • OpenMath layer of D6.1 compiles correctly for said languages (English, Catalan, French, Italian, Spanish, German, Swedish)

WP6 was moved ahead to start on Month 5 (instead of 7) to buy time for WP5 which will be delayed due to lack of data.

AttachmentSize
MOLTO.WP6_.M6.doc23 KB

WP7 Case Study: Patents - M6

WP7 was scheduled to start in Month 4. But the WP leader site, Matrixware, left the MOLTO Consortium during Month 3. We have had negotiations with replacing partners, and expect them to be concluded before November 2010 (Month 9 of MOLTO). Then we expect to start WP7 no later than January 2011 (Month 11).

While the delay is with several months, it need not imply great changes in the actual work. The original reason to start in Month 4 was to give the Matrixware site something to work on, since they we not highly involved in the other WP's. The new partner is expected to get started immediately, and the WP will also profit from the fact that some other MOLTO tools have become available (grammarian's tools from WP2 and grammar-statistics combination from WP5).

The actual work plan for WP7 may change in accordance with the preferences of the new partner. This will happen within the limits of the budget originally allocated to this WP.

WP8 Case Study: Cultural Heritage - M6

WP8 will start in Month 12, so no work can be reported yet.

WP9 User Requirements and Evaluation - M6

lauri, here goes your report

WP10 Dissemination and Exploitation - M6

The stated objectives of this workpackage are to:

  1. create a MOLTO community of researchers and commercial partners;
  2. make the technology popular and easy to understand through light-weight online demos;
  3. apply the results commercially and ensure their sustainability over time through synergetic partnerships with the industry.

The first task has been to setup the website for MOLTO, with information about MOLTO’s technology and potential (D10.2, UGOT and Ontotext) targeted to research, industry and users. Bibliographic information on GF, on SMT and on knowledge retrieval is kept up-to-date and includes tutorial presentations delivered during the MOLTO workshops. The web site includes a News section with frequent informal posts on internal progress and plans and encouraging community contributions in the form of comments. More light newsflash items are published using the MOLTO Twitter feed. A specific section is devoted to Frequently Asked Questions and can be collaboratively maintained by the MOLTO partners.

This workpackage was responsible for two deliverables during the first semester:

  1. Dissemination plan with monitoring and assessment,
  2. MOLTO web service, first version.

The dissemination plan can be accessed on the consortium-restricted pages at http://www.molto-project.eu/wiki/d10.1 and will be amended during the project's lifetime if needed. The project has been presented in a few meetings and international events, most notably at LREC2010, EAMT2010, and ACL2010.

The first version of the MOLTO web service consists of an online demonstration of a multilingual travel phrasebook, described online in Deliverable D10.2 at http://www.molto-project.eu/wiki/d10.2.