Living Deliverables

Living deliverables are the online drafts of the project's deliverable documents. Please use the cover page as in the template deliverable and enter the administrative data for the deliverable. Once finished editing, and ready for the final version, produce the PDF from the top cover page, using the print icon, and create a Biblio item to archive the version of the document to be delivered. Both the Biblio item and the link to the corresponding living deliverable should be entered in the administrative view for the project's deliverables.

Having done the above, the table of deliverables is automatically filled in by the system.

D1.7A Advisory report


Contract No.: FP7-ICT-247914 and FP7-ICT-288317
Project full title: MOLTO - EEU Multilingual Online Translation
Deliverable: D1.7A Advisory report
Security (distribution level): Consortium
Contractual date of delivery: M39
Actual date of delivery: 27 May 2013
Type: Report
Status & version: Final
Author(s): Stephen Pulman (University of Oxford), Keith Hall (Google Research)
Task responsible: UHEL
Other contributors:


Abstract

This annex to Deliverable D1.7 is the final report written by the MOLTO Advisory Panel after attending the last project meeting in Barcelona on 23 May 2013.

Advisory Panel Report

The original aims of the MOLTO project were to use the GF approach to provide high quality translations of texts in limited domains in real time, enabling an author to simultaneously produce versions of a text in multiple languages. These aims also included expansion to cover new languages, to enhance lexical and other resources, and to develop frameworks and training to simplify and make more productive the tasks of grammar development. Other aims included an investigation of the role of controlled languages in interaction with ontologies and other types of reasoning and knowledge representation systems, and to explore hybrid approaches to machine translation - to combine the precision of GF based translation with the coverage and robustness of statistical machine translation methods. The project also hoped via its industrial partners to demonstrate that GF applications could be of commercial value. Three case studies were envisaged: mathematical exercises (using the Sage framework) in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages.

Over the course of the MOLTO project much progress has been made in achieving the above goals. One of the challenges to building and maintaining hand-built grammar systems is managing the grammar development. The MOLTO project delivers a new set of development tools ranging from the cloud-based grammar editor to an integrated development environment plugin (Eclipse). The Grammatical Framework summer school continues to train new members of the community in developing the grammars using these tools. The development of a new translation system takes a matter of days; adding a new language to a system takes hours (once the resource grammar for the language exists). The most labor intensive part of the work is developing a resource grammar, which takes on the order of 3 to 9 months; once completed it is easily exploited through the MOLTO tools and existing systems based on MOLTO technology can utilize the new language (for translation, generation, or information access).

In order to ease the development of multilingual grammars (and domain specific resource grammars) MOLTO delivers tools and techniques to expand the lexicons. One approach iis to utilize translations of wordnet which maintain links across the lexical entries. This allows the grammar developer to write sense-disambiguated translation lexicons.

In the final year of the project, a robust parser for the Grammatical Framework grammars was developed. This statistical parser bridges the gap between brittle hand-coded grammars and data-driven statistical parsers. The parser is capable of generating parse fragments even when a complete analysis is not available under the defined grammar. Performance is competitive with widely used systems like the Stanford parser.

MOLTO explored a variety of applications of the rich grammar formalism of GF along with the development tools. One focus was on the application to multilingual information access: the Ontotext ontologies, the Cultural Heritage retrieval and verbalization, and ACEWiki inference. While another focus was to expand the translation capabilities of GF by exploring the integration of GF and statistical machine translation techniques. Leveraging the strong syntactic typing from GF, the GF/SMT translation system was able to perform state-of-the-art translation for patents.

One of the industrial partners, beInformed, has successfully deployed systems based on MOLTO technologies to model their information and general customer-facing documents in multiple languages. The other industrial partner, OntoText, has detailed plans to include MOLTO technologies in its product line.

We were particularly impressed by the effort devoted to evaluation, particularly of translation, but also of other MOLTO applications such as business logic modelling. For translation, the original promise of the MOLTO project was to provide high quality precise translations, within limited domains. The various tests carried out with human judges largely seem to confirm that this goal has been achieved: whereas existing commercial systems provide wider coverage than the MOLTO tools, the quality of the results is not as high. Similarly, the comparison carried out by Be Informed of the MOLTO tools against their existing solution (Velocity) seems to confirm their superiority.

Overall we believe the project team are to be congratulated on what they have achieved over the course of the project: we regard the project as having successfully accomplished all of the goals it originally set for itself.

DX.2 Annual Public Report


Contract No.: FP7-ICT-247914
Project full title: MOLTO - Multilingual Online Translation
Deliverable: DX.2 Annual public report
Security (distribution level): Public
Contractual date of delivery: M24
Actual date of delivery: 15 March 2010
Type: Report
Status & version: Final
Author(s): O. Caprotti et al.
Task responsible: UGOT
Other contributors:


Abstract

Annual report on activities carried out in the framework of the MOLTO EU project. This report is designed for Web publishing, for a broad public outside the consortium. It documents the main results obtained by the MOLTO project during the first two years of activity and promotes the objectives of the project.

MOLTO’s goal is to develop a suite of tools for translating texts between multiple languages in real time with high quality. MOLTO uses domain specific semantic grammars and ontology-based interlinguas implemented in GF (Grammatical Framework), a grammar formalism where multiple languages are related by a common abstract syntax. Until now GF has been applied in several small-to-medium size domains, typically targeting up to ten languages, but during MOLTO we will scale this up in terms of productivity and applicability by increasing the size of domains and the number of languages.

MOLTO aims to make its technology accessible to domain experts who lack GF expertise so that building a multilingual application will amount to just extending a lexicon and writing a set of example sentences. The most research-intensive parts of MOLTO are the two-way interoperability between ontology standards (such as OWL and RDF) and GF grammars and the extension of rule-based translation by statistical methods. The OWL-GF interoperability enables multilingual natural language based interaction with machine-readable knowledge while the statistical methods add robustness to the system when desired. MOLTO technology is released as open-source libraries for third-party translation tools and web pages and thereby fits into standard workflows.

1. Project Objective

The EU project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run until June 2013.The Consortium, comprising the universities of Gothenburg, Helsinki and Polytechnical Barcelona together with the industrial Bulgarian partner OntoText, has been enlarged by the addition of University of Zurich and of the Dutch Be Informed.

MOLTO's multilingual translation tools use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are designed to model domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language.

An implementation of this technology is alread available in the Grammatical Framework, GF. As a result of the MOLTO project work, GF technologies are complemented by the use of ontologies, viewed as formalisms employed by the semantic web for capturing structural relations, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from linguistic data.

MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European
Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is constant on-going work on creating new resource grammars, in particular Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, and Swahili. The coverage and accuracy of the GF grammar library resource varies among the different languages and is documented on the web site of GF.

When comparing MOLTO to popular translation tools like Systran (Babelfish) and Google Translate, the main difference is the intended user of the tools: these tools target end-users of information whereas MOLTO targets producers of information.
By producers of information, MOLTO is able to handle well scenarios in which the language is constrained, as examples one may consider e-commerce sites, where products are often described with repeated linguistic expressions (e.g. Wikipedia articles, contracts, business letters, user manuals, and software localization), but even social networks often display usage of common phrases ("Happy birthday!" "I like it" "The hotel is located ...." "Your reservation is confirmed"). Ideally, MOLTO tools will enable publishers of websites to add multilinguality with little effort but most importantly with the certification that the meaning of the message conveyed stays unaltered across languages. MOLTO is also working on a multilingual semantic wiki .............

There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood codified language, either because it is of technical nature or because of common everyday usage.
The domains considered during the MOLTO project show a range of features of constrained natural languages: mathematical exercises and biomedical patents employ a technical and sophisticated jargon, whereas museum object descriptions use a language accessible to anybody.

2. Results

The expected final product of MOLTO is an open-source software suite of tools comprising a grammar development environment, an application programming interface and environment to assist the translators' workflow, and sample application grammar libraries for the domains of mathematical word problems, biomedical patents, and cultural artefacts.

Translation systems in MOLTO rely on multilingual grammars written in the GF programming language. Until now, the development environments available to GF grammarians consisted of a generic text editor, such as Emacs, used in combination with the GF interactive command shell, and the online GF documentation. This is a simple and effective environment for the experienced grammar developer. To better support less experienced grammar developers, one of the goals of the MOLTO project is to create an Integrated Development Environment for grammar development. The GF Simple Editor (by Thomas Hallgren), an initial prototype of a web-based grammar development environment that offers the same core functionality as the traditional environment is now available at http://www.grammaticalframework.org/demos/gfse. Its main features include grammar editing, grammar compilation, error detection, testing and visualization. Moreover, it enables the creation of web-based translation systems without installation of any software, as it is using web services to carry out compilation and interpretation tasks, and thus gives quick access to GF to novice and occasional users. Intended scenario for this editor is in supporting fast testing and prototyping of example grammars in tutorial settings, for instance during teaching and demonstrating GF.

A different, more sophisticated high-level integrated development environment is based on the Eclipse platform and specifically tailors GF grammar-writing. The GF Eclipse plugin (by John Camilleri) currently features real-time syntax checking, automatic code formatting, import-aware auto-complete suggestions, cross-reference resolution, inline contextual documentation, "New Module" wizards, external library browsing, launch shortcuts to the GF shell, and a visual tool for running treebank test suites. These new, powerful, time-saving development tools are aimed at both new users and GF veterans alike. It is available online at http://www.grammaticalframework.org/eclipse/ and at http://www.molto-project.eu/wiki/gf-eclipse-plugin.

Controlled natural languages are controlled subsets of natural languages, which are normally used in technical domains. The purpose of these languages is to reduce the complexity involved in natural languages, and to eliminate the ambiguity. The users of these languages are experts within their domain, and are trained to use these languages.

The MOLTO Phrasebook (by Aarne Ranta et al.) is one such controlled natural language, whose domain is that of touristic phrases. It covers greetings and travel phrases such as "this fish is delicious", "how far is the airport from the hotel" in 17 languages. The translations show the kind of quality that can be hoped for when using a GF grammar that can handle disambiguation in conveying gender and politeness, for instance from English to Italian. It is available both on the web from http://www.grammaticalframework.org/demos/phrasebook/ and as a stand-alone, offiline Android application, the PhraseDroid, from http://tinyurl.com/7tyzvfd. Screenshots of the mobile application are shown in the image on the side.

A different kind of controlled natural language is one that is used to command an interactive software system, for instance a computational engine such as Sage. The GFSage software application (by Jordi Saludes) shows a command-line tool able to take commands in natural language, have them executed by Sage, and have the answers rendered in natural language too. The image on the side shows the web interface of Sage augmented by the MOLTO natural language command module. Note that this application demonstrates how a MOLTO library can add multimodality to a system originally developed with keyboard input/output as user interface. In fact, by piping the results to a speech engine, one can have the results aurally thus increasing accessibility of the computational systems to the visually impaired. The natural language interface relies on the Mathematical Grammar Library that can be tested at http://www.grammaticalframework.org/demos/minibar/mathbar.html and documentation on the GFSage module is available as deliverable http://tinyurl.com/78bh4ap from the MOLTO wiki http://www.molto-project.eu/wiki/d62-prototype-comanding-cas.

To demonstrate the MOLTO Knowledge Reasoning Infrastructure, the Patent retrieval prototype (by Milen Chechev from Ontotext in collaboration with the UPC and the UGOT teams), at http://molto-patents.ontotext.com, shows examples of queries in natural language to a set of patents in the pharmaceutical domain. Users can ask question in French and English like 'what are the active ingredients of "AMPICILLIN"', 'que sont les formes posologiques de "AMPICILLIN"'. The system is still under development: at present the online interface allows to browse the retrieved patents and returns the semantic annotations that explain why any particular patent has matched the user's criteria. Similar technology for knowledge retrieval is being applied also in the case of cultural heritage, namely with descriptions of artefacts from the museum of Gothenburg, in order to allow multilingual query and retrieval. For this task, an ad-hoc ontology has been created and its preliminary GF application grammar can be tested by selecting "Painting.pgf" at http://www.grammaticalframework.org/demos/minibar/minibar.html.

The MOLTO translation environment is being developed (by UHEL with contributions of UGOT) as a customization the GlobalSight translation system (www.globalsight.com). The aim is to be able to embed MOLTO translation tools to a third-party translation platform. MOLTO tools are designed with a focus only on translation. GlobalSight is an open source translation management platform, which provides the infrastructure needed in a professional translation workflow. More specifically, a MOLTO translation editor will be available on the side of conventional editors and be characterized by the possibility of fetching terms from the FactForge ontology via the TermFactory database, allowing to import and export terms in TermFactory. Terminology work is also supported by OntoR, an ontology extraction system (by Seppo Nyrkkö) implemented as s semi-supervised machine learning process, where new term dictionary candidates may be found in given text, by finding "closest matches" in previously known _ontologies_ (i.e. hierarchical vocabulary, term structure, usually industry or domain specific). A corpus-harvested new term can be _aligned_ with its closest matches in an prior existing term ontology. New term's functional and semantic environment is analyzed, and the feature variables extracted are compared to values of previously known terms. The user is given the supervision control to decide the best alignment match and thus refine the ontology incrementally. These tools are not yet ready for distribution but a preview can be seen during the project meetings' open days.

3. Dissemination

The main dissemination venues for the results of MOLTO are the MOLTO website and the project meetings. The website at www.molto-project.eu makes available all the project’s results and advertises news, deliverables, and events organized by the partners. It also archives all MOLTO publications, both delivered at international meeting as well as at
internal workshops. The MOLTO news updates are posted as RSS feed suitable for aggregation by interested portals that is distributed by the MOLTO twitter feed and via the MOLTO LinkedIn group.

MOLTO sponsored the GF Summer School 2011, Frontiers of Multilingual Technologies during August 15-26, 2011 hosted by UPC in Barcelona, Spain. The two weeks program included lectures from "Getting started with GF", to "GF application development" and "Resource grammar development" and was attended by around 20 participants from around the world. The use case studies of MOLTO were amply presented by members of the Consortium. On August,1 2011 Aarne Ranta was invited to give a tutorial on GF during CADE-the 23rd International Conference on Automated Deduction, in Wroclaw, Poland. The lecturing material "Grammatical Framework: A Hands-On Introduction" is online. At the same meeting, Jordi Saludes has presented the Mathematical Grammar Library during the affiliated workshop THedu'11, Computer Theorem Proving Components for Educational Software. "A Framework for Improved Access to Museum Databases in the Semantic Web" was presented during the meeting Recent Advances in Natural Language Processing (RANLP 2011), in September 2011, at Hissar, Bulgaria. Similar work, "Reason-able View of Linked Data for Cultural Heritage" was presented during The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES, also in September, 2011 in Bourgas, Bulgaria. The MOLTO project was presented at Tsukuba University and during the meeting "Digitization and E-Inclusion in Mathematics and Science 2012" (DEIMS2012) in February 2012 in Tokyo Japan by Olga Caprotti.
Two demonstrations of MOLTO prototypes on query and retrieval in the cultural heritage and in the patent domains have been accepted for presentation at the European Track of the World Wide Web 2012 conference. A paper on GF, "Smart Paradigms and the Predictability and Complexity of Inflectional Morphology", will also be presented at the conference of the European Association for Computational Linguistics in April 2012.

The list of conference papers funded by MOLTO can be retrieved under Publication from the website.

Project meetings of MOLTO include always an open day with a program of presentations aimed at a general audience, the last MOLTO open days took place in Gothenburg on March 9, 2011 during the second project meeting, on September,2 2011 in Helsinki during the third project meeting, and on January,12 2012 in Gothenburg for the MOLTO-EEU kick off meeting.

4. Forthcoming

The project is looking forward to the final development phase especially with the addition of the new case studies, which will bring feedback to existing tools and ongoing work. In terms of events sponsored by MOLTO, the Third International Workshop on Free/Open-source Rule-based Machine Translation will take place in Gothenburg, Sweden, between 13-15 June 2012. Chair of the meeting is the MOLTO coordinator A. Ranta and local organization is managed by the MOLTO project manager. The fifth MOLTO project meeting will take place in September 2012 in The Netherlands in cooperation with the MONNET project. Stay tuned by subscribing to the MOLTO RSS feed or follow us on Twitter.

DX.3 Final Project Report


Contract No.: FP7-ICT-
Project full title: MOLTO - Multilingual Online Translation
Deliverable: DX.3
Security (distribution level): Public
Contractual date of delivery: M41
Actual date of delivery:
Type: Report
Status & version: Draft
Author(s): Aarne Ranta
Task responsible: UGOT
Other contributors: All partners


Abstract

This final report is meant to summarize the work carried out and the results obtained under the grant agreement FP7-ICT-247914 and its enlargement 288317. It is also intended as a means to assess the output of the MOLTO project by the public.

1. Final publishable summary report

This section must be of suitable quality to enable direct publication by the Commission and should preferably not exceed 40 pages. This report should address a wide audience, including the general public.

The publishable summary has to include 5 distinct parts described below:

  • An executive summary (not exceeding 1 page).
  • A summary description of project context and objectives (not exceeding 4 pages).
  • A description of the main S&T results/foregrounds (not exceeding 25 pages),
  • The potential impact (including the socio-economic impact and the wider societal implications of the project so far) and the main dissemination activities and exploitation of results (not exceeding 10 pages).
  • The address of the project public website, if applicable as well as relevant contact details.

Furthermore, project logo, diagrams or photographs illustrating and promoting the work of the project (including videos, etc…), as well as the list of all beneficiaries with the corresponding contact names can be submitted without any restriction.

2. Use and dissemination of foreground

A plan for use and dissemination of foreground (including socio-economic impact and target groups for the results of the research) shall be established at the end of the project. It should, where appropriate, be an update of the initial plan in Annex I for use and dissemination of foreground and be consistent with the report on societal implications on the use and dissemination of foreground (section 4.3 – H). The plan should consist of:

  • Section A

This section should describe the dissemination measures, including any scientific publications relating to foreground. Its content will be made available in the public domain thus demonstrating the added-value and positive impact of the project on the European Union.

  • Section B

This section should specify the exploitable foreground and provide the plans for exploitation. All these data can be public or confidential; the report must clearly mark non-publishable (confidential) parts that will be treated as such by the Commission. Information under Section B that is not marked as confidential will be made available in the public domain thus demonstrating the added-value and positive impact of the project on the European Union.

2.1 Section A

This section includes two templates:

  • Template A1: List of all scientific (peer reviewed) publications relating to the foreground of the project.

  • Template A2: List of all dissemination activities (publications, conferences, workshops, web sites/applications, press releases, flyers, articles published in the popular press, videos, media briefings, presentations, exhibitions, thesis, interviews, films, TV clips, posters).

These tables are cumulative, which means that they should always show all publications and activities from the beginning until after the end of the project. Updates are possible at any time.

2.2 Section B (Confidential or public: confidential information to be marked clearly)

Part B1

The applications for patents, trademarks, registered designs, etc. shall be listed according to the template B1 provided hereafter.

The list should, specify at least one unique identifier e.g. European Patent application reference. For patent applications, only if applicable, contributions to standards should be specified. This table is cumulative, which means that it should always show all applications from the beginning until after the end of the project.

Part B2

Please complete the table hereafter:

| Type of Exploitable Foreground | Description of exploitable foreground | Confidential | Foreseen embargo date | Exploitable product(s) or measure(s) |  Sector(s) of application |  Timetable, commercial or any other use |  Patents or other IPR exploitation (licences) | Owner and Other Beneficiary(s) involved |

| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |

| General advancement of knowledge, Commercial exploitation of R&D results, Exploitation of R&D results via standards, exploitation of results through EU policies, exploitation of results through (social) innovation | Ex: New superconductive Nb-Ti alloy | YES/NO | dd/mm/yyyy | MRI equipment | 1. Medical 2. Industrial inspection (the type sector (NACE nomenclature) : http://ec.europa.eu/competition/mergers/cases/index/nace_all.html) | 2008- 2010 | A materials patent is planned for 2006 | Beneficiary X (owner) Beneficiary Y, Beneficiary Z, Poss. licensing to equipment manuf. ABC |

In addition to the table, please provide a text to explain the exploitable foreground, in particular:

  • Its purpose
  • How the foreground might be exploited, when and by whom
  • IPR exploitable measures taken or intended
  • Further research necessary, if any
  • Potential/expected impact (quantify where possible)

3. Report on societal implications

Replies to the following questions will assist the Commission to obtain statistics and indicators on societal and socio-economic issues addressed by projects. The questions are arranged in a number of key themes. As well as producing certain statistics, the replies will also help identify those projects that have shown a real engagement with wider societal issues, and thereby identify interesting approaches to these issues and best practices. The replies for individual projects will not be made public.

Sample cover page


Contract No.: FP7-ICT-
Project full title: MOLTO - Multilingual Online Translation
Deliverable:
Security (distribution level):
Contractual date of delivery:
Actual date of delivery:
Type:
Status & version:
Author(s):
Task responsible:
Other contributors:


Abstract

D1.1 Work plan for MOLTO


Contract No.: FP7-ICT-247914
Project full title: MOLTO - Multilingual Online Translation
Deliverable: D1.1. Work plan for MOLTO
Security (distribution level): Confidential
Contractual date of delivery: M1
Actual date of delivery: 1 April 2010
Type: Report
Status & version: Final (evolving document)
Author(s): A. Ranta et al.
Task responsible: UGOT
Other contributors:



Abstract

Detailed work plan for internal use of the consortium.

This is an evolving description of the work plan of MOLTO, divided in work packages and in tasks. The document is meant to track what the MOLTO Consortium is planning to do, what it has completed so far and the status of the ongoing research. It is the responsibility of the work package leader to enter tasks and to keep them up to date so as to reflect the work done by the group.

</br/></p/>

WP1: Management

Detailed workplan for WP1

A number of management tasks are entitled to the coordinator: e.g. - collecting information from partners,
- reviewing and submitting information on the progress of the project as well as reports and other deliverables to EC; - preparation of meetings, - proposing the decisions and preparing the agenda of the SG, - chairing of the SG meetings and monitoring the implementation of decisions taken at the meetings; - presenting the results of the consortium and serving as the secretary in the meetings; - administering the EC financial contribution and fulfilling other financial tasks, - maintaining the project's website etc.

According to the Grant Agreement, Annex II, management of the consortium activities includes:

  • maintenance of the consortium agreement, if it is obligatory,
  • the overall legal, ethical, financial and administrative management including, for each of the beneficiaries, the obtaining of the certificates on the financial statements and on the methodology and costs relating to financial audits and technical reviews,
  • implementation of competitive calls by the consortium for the participation of new beneficiaries, where required by Annex I of this grant agreement,
  • any other management activities foreseen by the annexes, except coordination of research and technological development activities.

Associate tasks to workpackages

15 Mar 2010
Europe/Stockholm
ID: 
1.1
Workpackage: 
Management
Task leader: 
aarne.ranta
Assignees: 
olga.caprotti
Status: 
Completed
Timeframe: 
Mar 2010 - Apr 2010
Completed on: 
25 March, 2010 - 23:00

In order to get an overview of the workpackage: - add a view of the associated tasks - add a view of the deliverables

Create content type "deliverable"

15 Mar 2010
Europe/Stockholm
ID: 
1.2
Workpackage: 
Management
Task leader: 
aarne.ranta
Assignees: 
olga.caprotti
Status: 
Completed

Create an admin type "deliverable" to collect the info on due deliverable so that they can be tracked on the calendar and in the workpackage's description.

See list of all deliverables

Revision of the management report

0
ID: 
1.3
Workpackage: 
Management
Task leader: 
aarne.ranta
Assignees: 
aarne.ranta
Assignees: 
emilia.rung
Assignees: 
olga.caprotti
Relevant Deliverables: 
Periodic management report 1
Status: 
Completed
Timeframe: 
May 2011 - Jul 2011
Completed on: 
21 June, 2011 - 18:00

The commission requested the following:

Session Submitted on Verified on
5.2 Apr 26, 2011 2:40:36 PM May 3, 2011 2:33:02 PM

You are kindly requested to clarify the issues raised in this letter and submit a revised periodic report and Forms C through NEF at the latest on 16th of May. Should you require more time, please contact us. However, should we not have heard from you by the deadline we will proceed with the information at hand. Please note that in such case, this may lead to all or part of the costs being rejected.

Please note that according to Article II.5 of the grant agreement, the period for the execution of the payment has been suspended pending receipt of the additional information and the revised Periodic Report through NEF.

Please clarify the following points in the revised periodic report, and revise the Forms C if necessary:

  • Reports – general: Please add a list of conferences and meetings you attended, including who participated, venue and for what purpose.
  • Please provide in more detail the activities carried out by each partner according to the template which can be found on EC/Cordis home page (http://cordis.europa.eu/fp7/find-doc_en.html).
  • Beneficiary 1 – There is a discrepancy between average personnel cost compared to the budget. Could you please clarify why average personnel costs are more then 15 % higher then budget?
  • Beneficiary 2 – There is a marked difference between the use of MM (compared to budget) and funding for personnel in the 1st period. The average personnel cost is lower than budgeted for. Could you please explain these discrepancies?
  • Beneficiary 3 – Please justify coffee breaks under subcontracting considering that you anticipated as subcontracting only the auditing costs (p. 45 of the Annex I of the Grant Agreement).
  • Beneficiary 4 - Please specify more accurately use of MM per WP. From your explanation it is not clear enough how many MM you used. Could you please correct the table in the Management of the use of resources part of the report where the total costs seem to appear under indirect costs while the item indirect costs differs from that reported in the form C.
AttachmentSize
MOLTO (247914) _ Periodic Report and Cost Claim submission in NEF.pdf77.56 KB
Cost Claim overview MOLTO (247914) 2011-06-22.pdf74.59 KB
D1 3R (amended).docx232.74 KB

MOLTO-enlarged negotiation

0
ID: 
1.4
Workpackage: 
Management
Task leader: 
aarne.ranta
Assignees: 
olga.caprotti
Status: 
Ongoing
Timeframe: 
Jul 2011 - Aug 2011

Negotiation Mandate

  1. Proposal number:288317 Acronym:MOLTO-Enlarged EU
  2. Strategic objective/theme:
  3. Project Officer (to whom all documents must be returned): MR. BROCHARD Michel Tel:33912 European Commission e-Mail:Michel.Brochard@ec.europa.eu DG INFSO - E 01 Office: EUFO - 02/270 L - 2920 Luxembourg
  4. Date and time of first negotiation meeting: 13/07/2011 at 10:00AM Address for the first negotiation meeting: 10 rue de R. StumperL - 2920 Luxembourg
  5. EC financial contribution: Maximum financial EC contribution 600.000,00 € (euro)
  6. Duration of the project: 18 Months
  7. Change of technical content: Forthcoming
  8. Timetable for negotiation:
    
    • 05/08/2011 Deadline for the first version of the description of work(Annex I) and GPF
    • 13/07/2011 Negotiation meeting in Luxembourg
    • 31/08/2011 End of negotiation

Accompanying letter

The negotiating Project Officer (PO) is Mr. BROCHARD Michel. The full contact references are detailed in the "Negotiation Mandate".

Please note that the negotiation must be successfully concluded by 31/08/2011.

In case this deadline is not met, the Commission reserves the right to cancel negotiations and any subsequent offer for a project grant agreement. We also would like to draw your attention to the fact that negotiations may be terminated, or the negotiation mandate modified, if so required following the results of the consultation with other departments within the Commission.

Please note that in accordance with the legislation in force, the coordinator is obliged to deposit any pre-financing received from the Commission on an interest-bearing bank account. If you do not comply with this obligation, your participation as coordinator may not be accepted.

The negotiation process is supported by an on-line tool called NEF which you will need to use to submit data that is necessary for the grant agreement.

NEF will also provide access to the Legal & Financial Validation form (LFV lite). The LFV lite provides an overview of the status concerning the legal and financial data of the partners of your project, and indicates those partners for whom legal and/or financial data is missing. If the legal and/or financial data of one or more partners is flagged as needed in the LFV lite or would be incorrect, new legal and/or financial documents must be submitted for the partner(s) concerned. Additionally the Commission can also request documents and information in regard to the operational capacity of the consortium and beneficiaries to achieve the objectives and expected results of the project.

The detailed explanations for accessing NEF will be sent shortly in a separate e-mail. Further guidance is available on-line at the following address: http://ec.europa.eu/research/negotiation/

You should have already received the Evaluation Summary Report (ESR) in the info letter email. If not, please contact the negotiating Project Officer.

The negotiation guidance notes and the most recent templates for the Description of Work (Annex I to the Grant Agreement) are available at: Nef Annex 1 - Concept. Other useful information on Framework Programme 7 is available at http://cordis.europa.eu/fp7/find-doc_en.html and includes:

  • documents referring to the negotiation guidance notes,
  • the model Grant Agreement and special conditions,
  • the guide to financial issues,
  • the checklist for a consortium agreement for FP7 projects, and
  • the guide to intellectual property rules for FP7.

This letter should not be regarded under any circumstances as a formal commitment by the Commission to give financial support as this depends, in particular, on the satisfactory conclusion of negotiations and the completion of the formal selection process. Should you have any queries about the above, please do not hesitate to contact the negotiating Project Officer.


Aarne's task list

The main issue to solve is the budget cut, which of course is the usual thing to happen. We will get 600k instead of the 712k we applied for. My suggestion is that we cut all WP's and sites in proportion, so we don't need to change the work description too much.

The realistic goal is that the work will begin on 1 September. Even this needs some effort from us:

  • nominate one contact person (for UZH and BI) who is available at least periodically through the negotiations and authorized (by the rest of the site) to make decisions
  • finalize the work description
  • adjust the budget
  • provide the list of persons (mainly an issue for UZH and BI) as well as relations to other ongoing EC projects
  • extend the MOLTO consortium agreement and get it signed
  • we should work as if the deadline was 26 August; if we seem to fail, we can probably ask the EC for extension
AttachmentSize
comments_MOLTO_Ext.pdf90.52 KB

Followup on review requests

0
ID: 
1.5
Workpackage: 
Management
Task leader: 
olga.caprotti
Assignees: 
aarne.ranta
Assignees: 
borislav.popov
Assignees: 
jordi.saludes
Assignees: 
lauri.carlson
Status: 
Assigned
Timeframe: 
Sep 2011

Please address the reviewers remarks by the end of September 2011!!!!

MOLTO - financial reporting period 2

0
ID: 
1.6
Workpackage: 
Management
Task leader: 
aarne.ranta
Assignees: 
kristina.orban....
Status: 
Ongoing
Timeframe: 
Mar 2012 - Apr 2012

Soon it is time for the reporting of period 2 (01/03/2011 – 29/02/201)of the project MOLTO.

You have to send me:

  • two copies of the Form C (Financial Statement). The (Form C) is to be submitted both electronically (via the Participant Portal) and as a paper copy. After that I have submitted the Form C to the European Commission you can print it out and sign it. I’ll inform you when the Forms C are ready for signatures.

This year you can complete the Use of resources directly in the NEF when completing the Form C. You will have to write short explanations of the costs: the number of person months, travel costs (who travelled where and for which purpose/meeting), consumables etc. All the costs must be related to a Work package.

  • A Certificate on Financial Statement (CFS) is necessary if your claimed EC contribution is equal to or more than 375 000€. Observe that although the threshold is established on the basis of the EC contribution, the CFS must certify all eligible costs.

The deadline for submitting your financial statement in the Participant Portal as well as sending me the Use of Resources by e-mail is 1st of April 2011.

The signed Financial Statement and the CFS (if applicable) have to be submitted to me in paper copies. Please send the originals by courier to address below.

To access the project via the Participant Portal, click on the following link: http://ec.europa.eu/research/participants/portal/

To log into the Participant Portal you need to have an account. If you don't have an account yet follow the 'register' link and instructions on the Participant Portal main page.

Once logged in with the account associated with your email address, the list of the projects you are involved in will appear under the 'My Projects' tab. The project MOLTO (247914) will appear under tab “Active”. By selecting “FR” on that line you will gain access to the Form C.

Do not hesitate to contact me if you have any questions. Kristina

Kristina Orbán Meunier

UNIVERSITY OF GOTHENBURG Research and Innovation Services

Erik Dahlbergsgatan 11B Box 100, 405 30 Göteborg, Sweden Tel +46 31 786 6466

mobile +46 766 229466

kristina.orban.meunier@gu.se

www.gu.se/researchinnovation

WP2: Grammar Developer’s Tools

The grammar developer's tools are divided to two kinds of tasks:

  • GF grammar compiler API

  • actual tools implemented by using the API

The workplan for the first six months concerns mostly the API, with the main actual tool being the GF Shell, which is a line-based grammar development tool. It is a powerful tool since it enables scripting, but it is not integrated with other working environments. The most important other environment will be web-based access to the grammar compiler.

Note that most discussions on GF are public at http://code.google.com/p/grammatical-framework/.

Here follows the work plan, with tasks assigned to sites and approximate months.

Improving the Resource Grammar Library API and its documentation

0
ID: 
2.10
Task leader: 
aarne.ranta
Assignees: 
ramona.enache
Relevant Deliverables: 
GF Grammar Compiler API
Status: 
Ongoing
Timeframe: 
Mar 2010

Documentation of GF is hosted on Google Code at http://code.google.com/p/grammatical-framework/

There is a wiki cover page for the Resource Grammar Library API and an online version at http://www.grammaticalframework.org/compiler-api/.

Designing the API and writing its documentation

0
ID: 
2.5
Task leader: 
aarne.ranta
Assignees: 
krasimir.angelov
Status: 
Completed
Timeframe: 
Mar 2010 - Aug 2010
Completed on: 
1 October, 2011 (All day)

The GF API design will take into account the following requirements:

  • programming environment eg Eclipse, XCode, NotePad++, Web etc
  • standard formats for I/O
  • ....

The documentation is being hosted at the GF website.

Example-based grammar writing

0
ID: 
2.2
Task leader: 
ramona.enache
Assignees: 
aarne.ranta
Relevant Deliverables: 
Grammar IDE
Status: 
Ongoing
Timeframe: 
Jul 2010 - Sep 2010

What we mean by example based grammar writing.

Current status is proof of concept: it is possible to load example based grammar and to compile it.

Need to do: - ....

GF runtime in C

0
ID: 
2.11
Task leader: 
lauri.alanko
Assignees: 
jordi.saludes
Assignees: 
krasimir.angelov
Assignees: 
lauri.alanko
Status: 
Ongoing
Timeframe: 
Apr 2010

Overview

The runtime is the part of the GF system that implements parsing and linearization of texts based on a PGF grammar that has been produced by the GF compiler.

The standard GF runtime is written in Haskell like the rest of the system. Unfortunately this results in a large memory footprint and possibly also portability problems, which preclude its use in certain applications.

The goal of the current task is to reimplement the GF runtime as a pure C library. This C library can then hopefully be used in some situations where the Haskell-based runtime would be unwieldy.

Status

Preview versions of the implementation, libpgf, are available from the project home page. This is also where up-to-date documentation can be found.

Morphology server and its API

0
ID: 
2.3
Task leader: 
aarne.ranta
Assignees: 
aarne.ranta
Assignees: 
krasimir.angelov
Relevant Deliverables: 
Grammar IDE
Status: 
Planned
Timeframe: 
Aug 2010 - Oct 2010

The compiler API must be used by the morphology server.

Plugin to Python NLTK

0
ID: 
2.8
Task leader: 
jordi.saludes
Assignees: 
jordi.saludes
Status: 
Completed

To develop a python plugin for gf (based on the planned C plugin) and connect it to relevant parts of the Natural Language Toolkit (http://www.nltk.org/)

Subtasks

2.8.1 Develop python bindings to gf.

2.8.2 nltk integration.

GF python bindings

Using the GF python bindings

This is how to use some of the functionalities of the GF shell inside Python.

Installation

Due to some ghc glitch, it only builds on Linux.

You'll need the source distribution of GF, ghc and the Python development files1. Then, go to the python bindings folder and build it:

 cd GF/contrib/py-bindings
 make

It will build a shared library (gf.so) that you can import and use into Python as shown below.

Testing installation

To test if it works correctly, type:

 python -m doctest example.rst

Examples

Loading a pgf file

First you must import the library:

% import gf

then load a PGF file, like this tiny example:

% pgf = gf.read_pgf("Query.pgf")

We could ask for the supported languages:

% pgf.languages()
[QueryEng, QuerySpa]

The start category of the PGF module is:

% pgf.startcat()
Question

Parsing and linearizing

Let's us save the languages for later:

% eng,spa = pgf.languages()

These are opaque objects, not strings:

% type(eng) 
(type 'gf.lang')

and must be used when parsing:

% pgf.parse(eng, "is 42 prime") 
[Prime (Number 42)]

Yes, I know it should have a '?' at the end, but there is not support for other lexers at this time.

Notice that parsing returns a list of gf trees. Let's save it and linearize it in Spanish:

% t = pgf.parse(eng, "is 42 prime")
% pgf.linearize(spa, t[0])
'42 es primo'

(which is not, but there is a '?' lacking at the end, remember?)

Getting parsing completions

One of the good things of the GF shell is that it suggests you which tokens can continue the line you are composing.

We got this also in the bindings. Suppose we have no idea on how to start:

% pgf.complete(eng, "")
['is']

so, there is only a sensible thing to put in. Let's continue:

% pgf.complete(eng, "is ")
[]

Is it important to note the blank space at the end, otherwise we get it again:

% pgf.complete(eng, "is")
['is']

But, how come that nothing is suggested at "is "? At the current point, a literal integer is expected, so GF would have to present an infinite list of alternatives. I cannot blame it for refusing to do so.

% pgf.complete(eng, "is 42 ")
['even', 'odd', 'prime']

Good. I will go for 'even', just to be in the safe side:

% pgf.complete(eng, "is 42 even ")
[]

Nothing again, but this time the phrase is complete. Let us check it by parsing:

% pgf.parse(eng, "is 42 even")
[Even (Number 42)]

Deconstructing gf trees

We store the last result and ask for its type:

% t = pgf.parse(eng, "is 42 even")[0]
% type(t)
(type 'gf.tree')

What's inside this tree? We use unapply for that:

% t.unapply()
[Even, Number 42]

This method returns a list with the head of the fun judgement and its arguments:

% map(type, _)
[(type 'gf.cid'), (type 'gf.expr')]

Notice the argument is again a tree (gf.tree or gf.expr, it is all the same here.)

% t.unapply()[1]
Number 42

We will repeat the trick with it now:

% t.unapply()[1].unapply()
[Number, 42]

and again, the same structure shows up:

% map(type, _)
[(type 'gf.cid'), (type 'gf.expr')]

One more time, just to get to the bottom of it:

% t.unapply()[1].unapply()[1].unapply()
42

but now it is an actual number:

% type(_)
(type 'int')

We ended with a full decomposed fun judgement.


  1. In Ubuntu I got it by installing the package python-all-dev

Refactoring the grammar compiler code base (to improve reusability)

0
ID: 
2.1
Task leader: 
aarne.ranta
Assignees: 
krasimir.angelov
Status: 
Assigned
Timeframe: 
Mar 2010 - Jul 2010

Here a slighly better description with eventually relevant links to sw, documentation etc.

Release of GF 3.2

0
ID: 
2.4
Task leader: 
aarne.ranta
Assignees: 
aarne.ranta
Assignees: 
krasimir.angelov
Status: 
Completed
Timeframe: 
Mar 2010 - Dec 2010
Completed on: 
15 December, 2012 (All day)

Major features:

  • pgf format is updated and documented in the wiki
  • completed runtime type checker for dependent types
  • parsing with dependent types
  • on-line parsing
  • type-error reporting
  • exhaustive generation of ASTs (also via lambda prolog)
  • probabilities in the abstract syntax
  • random generation guided by probability
  • parse results ranked by probability
  • example based grammar generation (extra script)

New languages:

  • Urdu, complete resource grammar
  • Turkish, complete morphology
  • Amharic, complete resource grammar
  • Punjabi, complete morphology

Web-based grammar development environment (version 1)

0
ID: 
2.6
Task leader: 
aarne.ranta
Assignees: 
krasimir.angelov
Assignees: 
thomas.hallgren
Relevant Deliverables: 
Grammar IDE
Dependencies: 
Release of GF 3.2
Status: 
Ongoing
Timeframe: 
Aug 2010 - Mar 2011

Prototype

Web-based tools for grammarians: http://www.grammaticalframework.org/demos/gfse/

Ongoing work at http://cloud.grammaticalframework.org.

Similar work

Look into online IDE platforms, like Kodingen and CodeRun.

There is work for Ajax-based code editors, eg Ymacs, which could be useful since there is a GF mode for emacs already (where?).

The emacs mode can now be found in http://www.grammaticalframework.org/src/tools/gf.el (note by Aarne)

There is also a Mozilla project, Bespin, to build a web-based editor extensible by javascript.

Also - check Orc, yet another online IDE for a new language, using CodeMirror as editor.

Integrating probabilities in GF and PGF

0
ID: 
2.8
Task leader: 
aarne.ranta
Assignees: 
aarne.ranta
Status: 
Planned
Timeframe: 
Oct 2010 - Dec 2010

Design and intergrate probabilistic features to GF and PGF.

Extend planning here.

Integration with ontology tools

0
ID: 
2.9
Task leader: 
aarne.ranta
Assignees: 
aarne.ranta
Assignees: 
borislav.popov
Assignees: 
lauri.carlson
Status: 
Planned

Finale phase of the work planned in this workpackage. Exact scheduling to be defined.

On-line extension of PGF with new words

0
ID: 
2.7
Task leader: 
aarne.ranta
Assignees: 
krasimir.angelov
Status: 
Planned
Timeframe: 
Aug 2010 - Jan 2011

Adding the possibility to dynamically add new words to lexicons "linked" in compiled grammars.

WP3: Translator's Tools

To be entered for M7 - M30.

WP4: Knowledge Engineering

  • knowledge representation infrastructure (D4.1, by Ontotext);
  • aligned semantic models and instance bases (D4.2, by Ontotext and UHEL);
  • two-way grammar-ontology and NL (Natural Language) to ontology interoperability (D4.3, by Ontotext and UGOT).
  • M1-M6 plan

    • all partners send informal answers of the questions above & anything additional that has been missed in the questions (questionnaire)[/node/896]
    • decide on the best server location (on site / rented)
    • Ontotext to provide a description of what is the planned initial infrastructure for KE
    • Ontotext to describe more about the LOD data sets and give the LDSR link
    • UHEL to indicate their ideas on the KE WP, data sets and participation
    • provide examples of entity descriptions in WKB
    • first attempts on NL queries based on the WKB; put examples of the entity description constructs and then decide next steps with UGOT/UHEL;
    • discuss feasibility of creating a NL interface for SPARQL (or a subset)
    • provide examples of entity descriptions in WKB
    • KE infrastructure in place
    • decide on initial data sets with the partners; consider one or many instances
    • KE infrastructure loaded with initial data sets
    • navigation and search web UI over KEI
    • results of the first experiences on ontology to grammar interoperability
    • M4.1 - Knowledge representation - Retrieval access provided infrastructure to the consortium
    • report for M1-M6
    • plan for M6-M12

    M6-M12 plan

    • analyze the needs of autocompletetion and implement scalable solution for it
    • work on transformation from natural language to sparql
    • report for M6-M12
    • plan for M6-M12

    M12-M18 plan

    • experiment with different ontologies
    • write D4.2
    • work on verbalization of the results from the semantic repository search
    • finalize the D4.3 prototype
    • report for M12-M18
    • plan for M12-M18

    M18-M24 plan

    • support the work on other workpakages in the sphere of grammar-ontology interoperability
    • report for M18-M24

    Process & General Considerations

    • we'd like all info to be available on the wiki, incl. notes, decisions, status of the components, docs
    • SVN
    • consider a backlog for the entire project/ each WP
    • consider periodic development/delivery cycles
    • consider regular telecons/skype conferences - once a month at least
    • make application ideas backlog at the wiki

    Appendix to D4.3 Grammar ontology interoperability

    0
    ID: 
    4.1A
    Workpackage: 
    Knowledge Engineering
    Task leader: 
    milen.chechev
    Assignees: 
    milen.chechev
    Relevant Deliverables: 
    Grammar-Ontology Interoperability
    Status: 
    Completed
    Timeframe: 
    Mar 2012 - Apr 2012
    Completed on: 
    31 May, 2012 (All day)

    Add child pages to the living deliverable following instructions given in the abstract.

    http://www.molto-project.eu/wiki/living-deliverables/d43a-appendix-gramm...

    See deliverable

    http://www.molto-project.eu/sites/default/files/D4.3.pdf

    Knowledge Engineering Infrastructure in place

    According to the plan http://www.molto-project.eu/node/858 the Knowledge Engineering Infrastructure has been realeased. It is accessible here. We have imported an exemplary initial data set containing information for different persons, organizations, locations.

    To execute a SPARQL query to the data set, click "SPARQL Query" and for exemple try the following query without the backslashes (\)

    prefix rdf:<\http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix rdfs:<\http://www.w3.org/2000/01/rdf-schema#> prefix prt:<\http://proton.semanticweb.org/2006/05/protont#> select distinct ?l where { ?s rdf:type prt:Organization ; rdfs:label ?l . }

    It should return the names of all organizations stored in the data set.

    The Knowledge Engineering Infrastructure could be extended with new data sets if new data sets are available, see http://www.molto-project.eu/node/858, http://www.molto-project.eu/node/896 and http://www.molto-project.eu/node/948.

    WP5: Statistical and Robust Translation

    here a better task description

    WP6: Case Study Mathematics

    Mathematical grammars developed using GF for the WebALT project (eContent 22253) allow us to generate multilingual simple drills for high school students and university freshmen. These grammars will be the starting point aiming at extending coverage to word problems, the ones that require the student to first model a situation and then to manipulate the mathematical model to obtain a solution.

    The UPC team, being a main actor in the past developing of gf mathematical grammars and having ample experience in mathematics teaching, will be in charge of the tasks in this work package with help from UGOT on technical aspects of GF and possibly Ontotext on ontology representation and handling.

    State of the art on mathematical and ontological reasoning

    0
    ID: 
    6.2
    Workpackage: 
    Case Study: Mathematics
    Task leader: 
    jordi.saludes
    Assignees: 
    jordi.saludes
    Status: 
    Planned

    It will be required to reason on equations and statements proposed by the student, so we will need to review to what extend an automatic reasoner could deal with student input of this sort and how the system behavior could be designed to degrade gracefully in order to keep the student interaction going.

    Make WebALT grammars modular

    0
    ID: 
    6.3
    Workpackage: 
    Case Study: Mathematics
    Task leader: 
    jordi.saludes
    Assignees: 
    jordi.saludes
    Assignees: 
    Sebastian Xambo
    Relevant Deliverables: 
    Simple drill grammar library
    Status: 
    Ongoing
    Timeframe: 
    Jul 2010

    In the framework of the WebALT project a gf grammar library was developed for generating simple mathematical drills in a variety of languages. The legal status of this library has recently changed to LGPL, making it suitable to be the starting point for the language services demanded by this work package. To achieve a better degree of interchangeability it is required to organize the existing code into modules, remove redundancies and lay them in a way acceptable for easy lexicon enhancement by way of the grammar developer’s tools of work package 2, WP2.

    Commanding GF library

    0
    ID: 
    6.4
    Workpackage: 
    Case Study: Mathematics
    Task leader: 
    jordi.saludes
    Assignees: 
    jordi.saludes
    Status: 
    Planned

    Writing a gf grammar for commanding a generic computer algebra system (CAS) by natural language imperative sentences. Concrete grammars adapted to the CAS at hand. Depends on work package 2 WP2.

    Module for driving a CAS by natural language commands

    0
    ID: 
    6.5
    Workpackage: 
    Case Study: Mathematics
    Task leader: 
    jordi.saludes
    Assignees: 
    jordi.saludes
    Assignees: 
    Sebastian Xambo
    Dependencies: 
    Commanding GF library
    Status: 
    Planned

    Integrate the commanding library into a component to transform the issued commands to the CAS.

    Objects and properties GF library

    0
    ID: 
    6.6
    Workpackage: 
    Case Study: Mathematics
    Task leader: 
    jordi.saludes
    Assignees: 
    jordi.saludes
    Status: 
    Planned

    Gf grammar library able to generate natural language sentences corresponding to objects and relations of the word problem. It must be able to parse simple questions related to the word problem domain into predicates. Depends on work package 2 and probably work package 4.

    Module for semi-automatic reasoning

    0
    ID: 
    6.7
    Workpackage: 
    Case Study: Mathematics
    Task leader: 
    jordi.saludes
    Assignees: 
    jordi.saludes
    Status: 
    Planned

    Automated reasoning is needed to assess the soundness of the model proposed by the student and to answer his/her questions. This requires adding small ontologies describing the word problem, including:

    • Data present in the problem statement;
    • Additional world knowledge to make reasoning possible.

    Add State of the Art study here.

    Set theory reasoning of the ducks and rabbits problem

    Some time ago I managed to build a theory supporting the Farm problem in Isabelle/HOL (attached below)

    I wasn't expecting such a toil but lack of detailed documentation and a wicked simplifier made my life miserable for a whole week.

    Assumptions

    It is based on 3 sets:

    • Being in the farm (farm)
    • Being a duck
    • Being a rabbit
    • and a function: is_leg_of : leg → animal.

      As axioms, we have:

    Ground knowledge axioms

    • Rabbits have 4 legs.
    • Ducks have 2 legs.

    Problem axioms

    • All animals in the farm are rabbits or ducks.
    • There are 100 animals in the farm.
    • There are 260 legs in the farm

    Extra axioms

    That is, facts that are implicitely known but you need to state for Isabelle with Main theory to work:

    • Rabbits and ducks are finite
    • Rabbits and ducks are disjoint

    Variables and equations

    Let R be the number of rabbits in the farm and D the number of ducks in the farm. With the preceding axioms, we were able to produce Isabelle-certified proofs that

    R + D = 100
    

    and

    2*D + 4*R = 260
    

    and then deduce that R=30 and D=70.

    AttachmentSize
    Farm.thy5.67 KB

    Integration into dialog manager

    0
    ID: 
    6.8
    Workpackage: 
    Case Study: Mathematics
    Task leader: 
    jordi.saludes
    Assignees: 
    jordi.saludes
    Status: 
    Planned

    In particular, objects will be annotated by natural language noun-phrases and equations by sentences. These annotations will be parsed into GF interlingua and will be used whenever language generation related to the problem was needed.

    WP7: Case Study Patents

    The work will start with the provision of user requirements (WP9) and the preparation of a parallel patent corpus (EPO) to fuel the training of statistical MT (UPC). In parallel UGOT will work on grammars covering the domain and subsequently, together with UPC, apply the hybrid (WP2, WP5) MT on abstracts and claims. Ontotext will provide semantic infrastructure with loaded existing structured data sets (WP4) from the patent domain (IPC, patent ontology, bio-medical and pharmaceutical knowledge bases, e.g. LLD). Based on the use case requirements, Ontotext will build a prototype (D7.1, D7.2) exposing multiple cross-lingual retrieval paradigms and MT of patent sections. The accuracy will be regularly evaluated through both automatic (e.g. BLEU scoring) and human based (e.g. TAUS) means (WP9).

    Task List

    The work package is split into 9 major tasks as follows:

    • Task 7.1 User Requirements and Scenarios (Task Lead: UPC)
    • Task 7.2 Patent corpora (Task Lead: UPC)
    • Task 7.3 Grammars for the patent domain (Task Lead: UGOT)
    • Task 7.4 Ontologies and document indexation (Task Lead: Ontotext)
    • Task 7.5 Prototype (Task Lead: Ontotext)
    • Task 7.6 SMT and Hybrid MT (Task Lead: UPC)
    • Task 7.7 Prototype (user interface) (Tas Lead by Ontotext)
    • Task 7.8 Human evaluation (Task Lead: TBD)
    • Task 7.9 Patent Case Study: Final Report (Task Lead: UPC)


    Month 10-15 plan

    • Task 7.2 starts in M10 and is due to provide a first set of corpora at the end of M16. Final revision depends on the availability of the EPO data.
    • Task 7.3 starts in M10 and is due to provide a preliminary report at the end of M16.

    Month 16-21 plan

    • Task 7.1 starts at M15 and is due to provide a preliminar version at the beginning of M17.
    • Task 7.3 will produce a more complete report by the beginning of M19.
    • Task 7.4 starts at M16 and is due to provide a description of the type of queries at the end of M16.
    • Task 7.5 starts at M16 and is due to provide a description of the Prototype architecture at the end of M16.
    • Task 7.6 starts along with WP5 and will produce a SMT baseline for the Patents prototype.
    • D7.1 deadline is M21.

    User Requirements

    1 Jun 2010
    1 Jul 2010
    Europe/Stockholm
    ID: 
    7.1
    Workpackage: 
    Case Study: Patents
    Assignees: 
    aarne.ranta
    Assignees: 
    meritxell.gonzalez
    Status: 
    Completed
    Timeframe: 
    May 2011 - Oct 2011

    The patents case study comprises two basic scenarios: the online patent retrieval and the patent translation. In this prototype we tackle these two scenarios separately, as shown in Figure 1, even though they can be viewed as a unique multilingual patent retrieval paradigm. In the future, we plan to study how to automate the reciprocal inputs between the two processes, i.e. the annotation of translations and the translation of semantically annotated documents.

    From a general perspective, two user roles may be defined in this case study: end-users looking for information related to the patents and editors adding new patent documents to a hypothetical repository.

    Details are given in D71.

    Patent Corpora

    1 Jun 2010 15:17
    Europe/Vienna
    ID: 
    7.2
    Workpackage: 
    Case Study: Patents
    Task leader: 
    meritxell.gonzalez
    Assignees: 
    cristina.españa
    Assignees: 
    lluis.marquez
    Status: 
    Completed
    Timeframe: 
    Jun 2011 - Oct 2012
    Completed on: 
    30 November, 2012 - 23:00

    Determining and gathering of bilingual and monolingual corpora for the patent case study.

    • SMT system is trained with te MAREC corpus (WP5).
    • EPO dataset is used for testing pourposes (WP5).
    • www-EPO dataset will be used to fill the retrieval databases (WP7)

    Grammars for the patent domain

    1 Aug 2010
    Europe/Stockholm
    ID: 
    7.3
    Workpackage: 
    Case Study: Patents
    Task leader: 
    ramona.enache
    Assignees: 
    aarne.ranta
    Assignees: 
    ramona.enache
    Status: 
    Ongoing
    Timeframe: 
    Jan 2011 - Nov 2012

    There are two subtasks here:

    • Grammars for translation of the patent documents.
    • Grammars for online-translation of CNL queries

    Ontologies and Document Indexation

    0
    ID: 
    7.4
    Workpackage: 
    Case Study: Patents
    Task leader: 
    meritxell.gonzalez
    Assignees: 
    borislav.popov
    Assignees: 
    mariana.damova
    Status: 
    Ongoing
    Timeframe: 
    Jun 2011 - Oct 2012

    Developing an ontology capturing the structure of patent documents; and indexing the patents documents according to the semantic knowledge.

    Patents Retrieval System

    1 Jul 2010
    Europe/Vienna
    ID: 
    7.5
    Workpackage: 
    Case Study: Patents
    Task leader: 
    lluis.marquez
    Assignees: 
    borislav.popov
    Assignees: 
    milen.chechev
    Assignees: 
    petar
    Relevant Deliverables: 
    Patent Case Study Final Report
    Relevant Deliverables: 
    Patent MT and Retrieval Prototype
    Relevant Deliverables: 
    Patent MT and Retrieval Prototype Beta
    Dependencies: 
    Patent Corpora
    Status: 
    Completed
    Timeframe: 
    Jun 2011 - Dec 2012

    Contact @UPC: Lluis and Cristina

    DEPENDENCIES:

    • TASK 1, 2, 3 and 4
    • WP4. Knowledge Engineering
    • TASK 8 (for final version of prototype)

    Participants:

    • Ontotext,
    • UGOT,
    • UPC

    Contact point @Ontotext: Borislav Popov

    DEADLINES: Beta = M21; Final = M27

    Machine Translation Systems

    22 Mar 2010
    Europe/Vienna
    ID: 
    7.6
    Workpackage: 
    Case Study: Patents
    Assignees: 
    aarne.ranta
    Assignees: 
    cristina.españa
    Assignees: 
    lluis.marquez
    Assignees: 
    meritxell.gonzalez
    Assignees: 
    ramona.enache
    Relevant Deliverables: 
    Patent Case Study Final Report
    Relevant Deliverables: 
    Patent MT and Retrieval Prototype
    Relevant Deliverables: 
    Patent MT and Retrieval Prototype Beta
    Status: 
    Completed
    Timeframe: 
    Jan 2012 - Dec 2012
    Completed on: 
    11 January, 2013 (All day)

    Contact @UPC: Lluis and Cristina

    DEPENDENCIES:

    • TASK 2, 3
    • WP5. A baseline of the WP5 system will be integrated in the prototype.

    Patents abstracts and claim are translated using the baseline of the hybrid system.

    Protoype (User Interface)

    1 Dec 2010
    31 Oct 2011
    Europe/Vienna
    ID: 
    7.7
    Workpackage: 
    Case Study: Patents
    Task leader: 
    borislav.popov
    Assignees: 
    borislav.popov
    Assignees: 
    cristina.españa
    Assignees: 
    lluis.marquez
    Assignees: 
    meritxell.gonzalez
    Assignees: 
    milen.chechev
    Relevant Deliverables: 
    Patent MT and Retrieval Prototype
    Relevant Deliverables: 
    Patent MT and Retrieval Prototype Beta
    Dependencies: 
    Machine Translation Systems
    Dependencies: 
    Patents Retrieval System
    Status: 
    Completed
    Timeframe: 
    Jun 2011 - Sep 2012

    DEPENDENCIES:

    • TASK 1
    • TASK 8 (for final version of prototype)

    Participants:

    • Ontotext,
    • UGOT,
    • UPC

    Contact point @Ontotext: Borislav Popov

    DEADLINES: Beta = M21; Final = M27

    Evaluations

    1 Jun 2011
    30 Jun 2012
    Europe/Vienna
    ID: 
    7.8
    Workpackage: 
    Case Study: Patents
    Assignees: 
    aarne.ranta
    Assignees: 
    maria.mateva
    Assignees: 
    meritxell.gonzalez
    Relevant Deliverables: 
    Patent Case Study Final Report
    Status: 
    Planned

    DEPENDENCIES:

    • TASK 5

    Note: Deadlines have been delayed 3 months due to the WP delay.

    DEADLINE: M31 (to allow for final report)

    Subtasks

    • Preparation starts M19 (at the very latest)
    • Hiring translators
    • Producing guidelines for translators
    • Full evaluation starts at latest M28
      • Evaluation will make use of the TAUS criteria






    TAUS Evaluation Criteria:

    • Excellent (4):
      • Accurately transfers all info; correct terminology, correct grammar. Understanding not improved by reading the source text.
    • Good (3):
      • Contains minor mistakes; would not need to refer to source text to correct the mistakes.
    • Medium (2):
      • Significant errors in output. Would need to read the source text to correct the errors.
    • Poor (1):
      • Serious errors in output. Would need to read the source text to understand the output. Would probably need to retranslate from scratch.

    WP8: Case Study: Cultural Heritage

    The work is started by a study of the existing categorizations and metadata schemas adopted by the museum, as well as a corpus of texts in the current documentation which describe these objects (D8.1, UGOT and Ontotext). We will transform the CIDOC-CRM model into an ontology aligning it with the upper-level one in the base knowledge set (WP4) and modeling the museum object metadata as a domain specific knowledge base. Through the interoperability engine from WP4 and the IDE from WP2, we will semi-automatically create the translation grammar and further extend it (D8.2, UGOT, UHEL, UPC, Ontotext). The final result will be an online system enabling museum (virtual) visitors to use their language of preference to search for artefacts through semantic (structured) and natural language queries and examine information about them. We will also automatically generate a set of articles in the Wikipedia format describing museum artefacts in the 5 languages with extensive grammar coverage (D8.3, UGOT, Ontotext).

    Links to Swedish museum databases who use the Carlotta system which is built upon the CIDOC-CRM model:

  • http://samsok.kmmuseum.se/
  • http://carlotta.gotlib.goteborg.se/pls/carlotta/welcome
  • http://www.tremil.se/pls/hborg/rigby.welcome
  • http://collections.smvk.se/pls/em/rigby.SokEnkel
  • WP9: User Requirements and Evaluation

    Requirements for work

    The work will start with collecting user requirements for the grammar development IDE (WP2), translation tools (WP3), and the use cases (WP6-8).

    We will define the evaluation criteria and schedule in synchrony with the WP plans (D9.1). We will define and collect corpora including diagnostic and evaluation sets, the former, to improve translation quality on the way, and the latter to evaluate final results.

    Corpus definitions

    • Each corpus available for MOLTO will be described by the providing project members.
    • The corpora will be split for development, diagnostic and evaluation use.
    • Contact persons will be named for questions and requests for each corpus.
    • Storage places and access protocols to gain this specified corpus data will be defined.

    Description of end-user workflow

    Translator's new role (parallel to WP3: Translator's tools) will be designed and described in the D9.1 deliverable. Most current translator's workbench software treat the original text as read-only source. The tools to be developed within WP3 (+ 2) will lead towards more mutable role of source text. The translation process will resemble more like structured document editing or multilingual authoring than transformation from a fixed source to a number of target languages.

    We will only provide a basic infrastructure API for external translation workbenches and keep an eye on the "new multilingual translator's workflow".

    Introduction of WP liaison persons and other contacts

    For each work package, the liaison contact information and work progress will be kept up-to-date on the MOLTO web site. Our liaison person Mirka Hyvärinen will be in contact with other project members.

    Also possibility to access UHEL's internal working wiki "MOLTO kitwiki" will be granted upon request to other project members.

    Evaluation of results

    Evaluation aims at both quality and usability aspects. UHEL will develop usability tests for the end-user human translator. The MOLTO-based translation workflow may differ from the traditional translator's workflow. This will be discussed in the D9.1 evaluation plan.

    To measure the quality of MOLTO translations, we compare them to (i) statistical and symbolic machine translation (Google, SYSTRAN); and (ii) human professional translation. We will use both automatic metrics (IQmt and BLEU; see section 1.2.8 for details) and TAUS quality criteria (Translation Automation Users Society). As MOLTO is focused on information-faithful grammatically correct translation in special domains, TAUS results will probably be more important.

    Given MOLTO's symbolic, grammar-based interlingual approach, scalability, portability and usability are important quality criteria for the translation results. For the translator's tools, user-friendliness will be a major aspect of the evaluation. These criteria are quantified in (D9.1) and reported in the final evaluation (D9.2).

    In addition to the WP deliverables, there will be continuous evaluation and monitoring with internal status reports according to the schedule defined in D9.1.

    WP10: Dissemination and Exploitation

    Define workplan here

    Factorize the Food grammar

    15 Apr 2010
    Europe/Stockholm
    ID: 
    10.3
    Task leader: 
    aarne.ranta
    Assignees: 
    aarne.ranta
    Assignees: 
    olga.caprotti
    Status: 
    Completed
    Completed on: 
    26 March, 2010 - 01:00

    Factorize the grammar used now for the demo fridge in modules that isolate the different kinds of phrases: eg. Comments, Greetings, Questions, etc. Check whether there are ontologies that describe these.

    The factorization can be seen in the phrasebook example under /example/phrasebook.

    MOLTO Phrasebook May Release

    0
    ID: 
    10.4
    Task leader: 
    aarne.ranta
    Assignees: 
    aarne.ranta
    Assignees: 
    krasimir.angelov
    Assignees: 
    olga.caprotti
    Assignees: 
    ramona.enache
    Assignees: 
    thomas.hallgren
    Dependencies: 
    Factorize the Food grammar
    Status: 
    Completed
    Timeframe: 
    Apr 2010 - May 2010
    Completed on: 
    1 June, 2010 (All day)

    The MOLTO Phrasebook is a web application for the traveler, eventually it will be a phone application (for the Android). It consists of frequently used phrases that a foreigner might want to use when abroad.

    Features of May Release

    • fixes and additions in RGL
    • data collection from
      • wikipedia phrasebook
      • wiki page for collection
    • web server
    • web GUI
    • Android GUI
    • structured and customizable release (e.g. choose the language pair)
    • agree on base abstract syntax
    • Android stand-alone
    • complete remaining concretes
      • examples + native informant
    • feedback button
      • current state info
      • spam issue
    • unlexing
    • lexing &+
    • disambiguation
    • deletion
    • history
    • gr by shaking

    demo preview: http://tournesol.cs.chalmers.se/~aarne/phrasebook/phrasebook.html

    Investigate the Internationalization API of Facebook

    0
    ID: 
    10.6
    Task leader: 
    olga.caprotti
    Assignees: 
    johnj.camilleri
    Assignees: 
    Kaarel.Kaljurand
    Assignees: 
    krasimir.angelov
    Assignees: 
    thomas.hallgren
    Relevant Deliverables: 
    GF Grammar Compiler API
    Status: 
    Planned
    Timeframe: 
    Mar 2012 - Apr 2013

    The current GF Grammar Compiler API is providing translation services that can be called on-the-fly. The goal of this task is to find out how to integrate them to an existing API where there is a need for Internationalization, example Facebook https://developers.facebook.com/docs/internationalization.

    The image shows how translations are entered manually in the current version. My guess is that we could improve on that.

    Anther example is the situation of commonly used sentences: "Happy birthday", we have on our Travel Phrasebook, we do not have Portuguese, we could friends-source it :) but how? Give them a FB app?

    Love to see some comments on this.

    BTW, I am not partial to FB, you can check any social network of your liking that provides an Internationalization API. This is a test of concept also looking for CNLs in the wild :)

    WP11: Multilingual semantic wiki - AceWiki

    Introduction

    The core of WP11 is an existing wiki system AceWiki which is going to be developed into a multilingual controlled natural language wiki system within the MOLTO project.

    Important links

    The AceWiki homepage (http://attempto.ifi.uzh.ch/acewiki/) contains:

    • demo wikis
    • list of related publications
    • further links

    AceWiki development is hosted on GitHub (https://github.com/AceWiki/AceWiki)

    Tasks

    • Task 11.1: Make the AceWiki design multilingual and implement a small AceWiki engine for multilingual GF grammars
    • Task 11.2: General refactoring of the AceWiki code

    Meetings


    To do

    Phase 1

    AceWiki side:

    • Refactor AceWiki to support different implementations of predictive parsers (at the moment, AceWiki's chartparser is hardwired) [done: Tobias]
    • Extract from the package "aceowl" everything that should be reused in GFAceWiki into a new package (mostly OWL-related stuff) [done: Tobias]
    • Connect AceWiki to GF (via JPGF) [done: Tobias]
    • Change the AceWiki architecture to support multilinguality [done: Tobias]
    • Implement a simple AceWiki language engine for multilingual GF grammars (as an alternative to the current ACE-OWL engine) [done: Tobias]

    GF side:

    Phase 2

    • Implement support for adding/changing/deleting words
    • Implement more languages
    • Choose application domain(s) and build exemplary knowledge base
    • Perform user studies

    Releases

    Release notes: https://raw.github.com/AceWiki/AceWiki/master/CHANGES.txt

    • 2012-01-05: v0.5.2

    Unsorted

    AceWiki as a webservice

    See also https://github.com/yuchangyuan/AceWiki

    See also the thread starting with: https://lists.ifi.uzh.ch/pipermail/attempto/2011-December/000818.html

    • There could be potentially multiple front-ends
      • the current front-end
      • an existing GF front-end
      • Emacs
      • Unix commandline
      • ACE View
      • native Android/iOS app
      • ...
    • REST API
      • allows to easily push content (existing lexicon, existing ACE text) into the wiki
      • should support GET as much as possible
    • Make AceWiki easily deployable on hosting providers such as Google App Engine
      • reasons: speed, reliability, etc.
      • for that to work, AceWiki should be completely in Java (i.e. no using of APELocal), i.e. ape.exe would still have to be on a different host (because it is in Prolog)
      • would reasoning performance profit in a major way?

    AceWiki more configurable

    AceWiki Refactoring

    0
    ID: 
    11.2
    Assignees: 
    Kaarel.Kaljurand
    Assignees: 
    Tobias.Kuhn
    Relevant Deliverables: 
    Multilingual semantic wiki
    Status: 
    Completed
    Timeframe: 
    Dec 2011 - Jan 2012
    Completed on: 
    5 January, 2012 (All day)

    General refactoring and clean-up of the AceWiki code.

    Multilingual AceWiki

    0
    ID: 
    11.1
    Assignees: 
    Tobias.Kuhn
    Relevant Deliverables: 
    Multilingual semantic wiki
    Status: 
    Completed
    Timeframe: 
    Dec 2011 - Jan 2012
    Completed on: 
    5 January, 2012 (All day)

    Make the AceWiki design multilingual and implement a small AceWiki engine for multilingual GF grammars.

    D1.2 Progress Report T6


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D1.2. Progress Report
    Security (distribution level): Confidential
    Contractual date of delivery: M7
    Actual date of delivery: 1 Oct 2010
    Type: Report
    Status & version: Draft (evolving document)
    Author(s): A. Ranta et al.
    Task responsible: UGOT
    Other contributors: All


    Abstract

    Progress report for the first semester of the MOLTO project lifetime, 1 Mar - 30 Sep 2010.

    1. Publishable Summary

    This section must be of suitable quality to enable direct publication by the Commission and should preferably not exceed four pages.

    In line with this, diagrams or photographs illustrating and promoting the work of the project, as well as relevant contact details or list of partners can be provided without restriction.


    Project context and objectives

    The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 36 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data.

    MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.

    Tools like Systran (Babelfish) and Google Translate are designed for consumers of information, but MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecisions. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it.

    There is a well known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well understood language. Three such domains will be considered during the MOLTO project: mathematical exercises, biomedical patents, and museum object descriptions. The MOLTO tools however will be applicable to other domains as well. Examples of such domains could be e-commerce sites, Wikipedia articles, contracts, business letters, user manuals, and software localization.

    Main results achieved so far

    A few results have been already achieved during the first semester of the project's lifetime. Two applications of the MOLTO translation web services are online on the project web pages:

    1. The travel phrasebook translates sentences to 14 different languages and shows some of the major end-user features available to MOLTO users: predictive typing and JavaScript-based GUI. Predictive typing prompts the user with the next available choices mandated by the underlying grammar and offers quasi-incremental translations of intermediate results from words or complete sentences. JavaScript-based GUI using off-the-shelf functions can be readily deployed on any device where a browser is available.
    2. The MOLTO KRI, Knowledge Reasoning Infrastructure, demonstrates the possibility of adding a natural language query language to retrieve answers from an OWL database. In this way, a query like Give me information about all organizations located in Europe is interpreted as the machine understandable SPARQL statement: SELECT DISTINCT ?organization ?organization_label WHERE { ?organization
      . ?organization
      ?organizationloc. ?organizationloc "Europe". ?organization ?organization_label . }

    On the more technical level, MOLTO released:

    • first version of the Python plugin for GF (based on the planned C plugin). The plugin makes GF primitives available from the Natural Language Toolkit (http://www.nltk.org/), an open source collection of Python modules for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.
    • (version 3.2 of GF, which features updates of the pgf format, complete type checker for dependent types, exhaustive generation of ASTs via lambda prolog, support for probabilities in the abstract syntax, random generation and parse results guided by probability, and example based grammar generation.) DISCUSS WITH AR
    • (Urdu resource grammar library and Turkish morphology) DISCUSS WITH AR

    Expected final results and their potential impact and use

    The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software products:

    1. a grammar development tool, available as an IDE and an API, to enable the use as a plug-in to web browsers, translation tools, etc, for easy construction and improvement of translation systems and the integration of ontologies with grammars
    2. a translator’s tool, available as an API and some interfaces in web browsers and translation tools
    3. a grammar library for linguistic resources
    4. a grammar library for the domains of mathematics, patents, and cultural heritage

    These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.

    The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.

    Public website

    The MOLTO website at http://www.molto-project.eu publishes the results, the news and all information related to the project. In addition, a Twitter feed is also available at http://twitter.com/moltoproject.

    2. Core of the report

    2.1 Project objectives for the period

    The project objectives for the first semester focus on establishing the grounds for cooperation among the partners, hence three deliverables contribute to refine the goals of the project:

    1. setup of the website and definition of the workplan,
    2. definition of the dissemination strategy,
    3. MOLTO test criteria, methods and schedule.

    The first version of the MOLTO web services, due at Month 3 is the major concrete target for the period and demonstrates the technologies underlying the ideas of the project.

    2.2 Work progress and achievements during the period

    Please provide a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

    For each work package, except project management, which will be reported in section 2.3, please provide the following information:

    • A summary of progress towards objectives and details for each task;
    • Highlight clearly significant results;
    • If applicable, explain the reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
    • If applicable, explain the reasons for failing to achieve critical objectives and/or not being on schedule and explain the impact on other tasks as well as on available resources and planning (the explanations should be coherent with the declaration by the project coordinator) ;
    • a statement on the use of resources, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex 1 (Description of Work);
    • If applicable, propose corrective actions.

    WP2 Grammar Developer’s Tools - Month 6

    The Grammarian's Tools include tools for using the GF grammar compiler and the Resource Grammar Library. In the first 6 months of MOLTO, we have worked on consolidating the compiler and the Library API, and also experimenting with the example-based grammar writing technique.

    Clearly significant results include:

    • Milestone 1, 15 languages in the library. Due September 2010; reached December 2009.
    • Workflow for example-based grammar writing and estimated engineering effort: reported as a part of D10.2
    • GF plugin to Python NLTK
    • GF syntax highlight plugin to XCode programming environment.
    • Integrating probabilities with GF grammars.
    • Release of GF 3.1.6 in April 2010; GF 3.2 forthcoming before end of 2010.

    No deviations from Annex I and the use of resources was as planned.

    WP3 Translator's Tools - M6

    na

    WP4 Knowledge Engineering - Month 6

    During the first period we managed to clarify the needs for knowledge representation infrastructure of the case studies and software tools in MOLTO. We have also circulated a questionaire describing the structured data sets which are expected to be of benefit for the project. Based on this information, we proceeded with deploying the knowledge representation infrastructure, which is now in place and accessible to the partners. It will be further described in D4.1 Knowledge Representation Infrastructure.

    The second major direction during this period was the undoubtedly challenging grammar to ontology interoperability. For this we have chosen a quasi-exhaustive knowledge base of important named entities in the world and some relations between them. It is encoded according to PROTON – a basic-upper level ontology with about 300 classes of named entities. The first goal set for this interoperability was a transformation of questions expressed in natural language towards a formal query language – SPARQL. For this purpose, and on the basis of the ontology and the entities in the knowledge base, we have manually created a corpus of 500 sentences. This corpus is being used for development of the GF grammars handling the natural language questions and also for evaluation of the coverage of the grammars over this language space. After an initial grammar handling questions to the knowledge base has been developed for a subset of the English language, we have created a transformation function, rendering GF sentence trees to SPARQL queries. In order to show these initial results, we have developed a natural language based search interface over the knowledge base, with automatic suggestion of possible continuation of the questions, which is featured on the MOLTO website. The results of these questions are one or two dimensional tables of entities, where each row is an individual “answer”.

    Effort spent by Ontotext in WP4 – 7.5 PMs; Other participants UGOT: Aarne will talk directly to Olga for this.

    AttachmentSize
    MOLTO.WP4_.M6.doc94 KB

    WP5 Statistical and Robust Translation - M6

    WP5 is planned to span from Month 7 to Month 30, but it is being conditioned by the delay on the Patents data. So, there is already some ongoing work we detail in the folowing.

    Work towards Milestone MS7 (Month 24)

    MS7: First prototypes of hybrid combination models.

    Most of the objectives of the package depend on the compilation of the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.

    At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.

    Work towards Deliverable D51 (Month 18)

    D51 : Description of the final collection of corpora.

    Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information. We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. At the moment, we have compiled and annotated the European Parliament corpus for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.

    On the other hand, domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study, WP7). We cannot build the final corpus, but some of the MOLTO members have join the IRF so that a set of Patents data are available for individual research purposes. This has allowed to compile a preliminar parallel corpus on which we can start shortly to build a domain GF grammar and to develop a first pure SMT domain-adapted translator.

    AttachmentSize
    ProgressReport_WP5.odt34.6 KB

    WP6 Case Study: Mathematics - Month 6

    Working towards deliverable D6.1:

    1. Refactor prior code (WebALT grammars) into a separate module for each OpenMath Content Dictionary (CD).
    2. Adapt said code to work with current GF resource libraries (3.1)
    3. Test compilation of OpenMath layer for: English, Catalan, French, Italian, Spanish, German, Swedish.

    Clearly significant results include:

    • OpenMath layer of D6.1 compiles correctly for said languages (English, Catalan, French, Italian, Spanish, German, Swedish)

    WP6 was moved ahead to start on Month 5 (instead of 7) to buy time for WP5 which will be delayed due to lack of data.

    AttachmentSize
    MOLTO.WP6_.M6.doc23 KB

    WP7 Case Study: Patents - M6

    WP7 was scheduled to start in Month 4. But the WP leader site, Matrixware, left the MOLTO Consortium during Month 3. We have had negotiations with replacing partners, and expect them to be concluded before November 2010 (Month 9 of MOLTO). Then we expect to start WP7 no later than January 2011 (Month 11).

    While the delay is with several months, it need not imply great changes in the actual work. The original reason to start in Month 4 was to give the Matrixware site something to work on, since they we not highly involved in the other WP's. The new partner is expected to get started immediately, and the WP will also profit from the fact that some other MOLTO tools have become available (grammarian's tools from WP2 and grammar-statistics combination from WP5).

    The actual work plan for WP7 may change in accordance with the preferences of the new partner. This will happen within the limits of the budget originally allocated to this WP.

    WP8 Case Study: Cultural Heritage - M6

    WP8 will start in Month 12, so no work can be reported yet.

    WP9 User Requirements and Evaluation - M6

    lauri, here goes your report

    WP10 Dissemination and Exploitation - M6

    The stated objectives of this workpackage are to:

    1. create a MOLTO community of researchers and commercial partners;
    2. make the technology popular and easy to understand through light-weight online demos;
    3. apply the results commercially and ensure their sustainability over time through synergetic partnerships with the industry.

    The first task has been to setup the website for MOLTO, with information about MOLTO’s technology and potential (D10.2, UGOT and Ontotext) targeted to research, industry and users. Bibliographic information on GF, on SMT and on knowledge retrieval is kept up-to-date and includes tutorial presentations delivered during the MOLTO workshops. The web site includes a News section with frequent informal posts on internal progress and plans and encouraging community contributions in the form of comments. More light newsflash items are published using the MOLTO Twitter feed. A specific section is devoted to Frequently Asked Questions and can be collaboratively maintained by the MOLTO partners.

    This workpackage was responsible for two deliverables during the first semester:

    1. Dissemination plan with monitoring and assessment,
    2. MOLTO web service, first version.

    The dissemination plan can be accessed on the consortium-restricted pages at http://www.molto-project.eu/wiki/d10.1 and will be amended during the project's lifetime if needed. The project has been presented in a few meetings and international events, most notably at LREC2010, EAMT2010, and ACL2010.

    The first version of the MOLTO web service consists of an online demonstration of a multilingual travel phrasebook, described online in Deliverable D10.2 at http://www.molto-project.eu/wiki/d10.2.

    2.3 Project management during the period

    Management tasks carried out during the first semester of MOLTO finalized the administrative and organizational setup of the project. The website for the project is online at http://www.molto-project.eu. The Consortium Agreement had been signed before the Grant Agreement in December 2009. The workplan for MOLTO (Deliverable D1.1) is hosted on the wiki pages on the website.

    The Steering Group of MOLTO, elected during the Kick-Off meeting, presently consists of voting members A. Ranta (UGOT, Chair), J. Saludes (UPC), B. Popov (Onto), and L. Carlson (UH). The Steering group held monthly calls to discuss the project's progress and recorded the minutes on the website. The MOLTO Advisory Board has been established, with members Prof. Stephen Pulman (Computing Laboratory Oxford) and Keith Hall (Google Research Zurich).

    The project had to face a major challenge with the dissolution of the Consortium partner company Matrixware. Upon learning of this, the Coordinator informed the Commission and proceeded to formalize the dismissal of Matrixware, that left the Consortium at the end of Month 2, on April 23, 2010. In order to be able to carry out the tasks set forward in the MOLTO DoW, with minor disruption, MOLTO started negotiations with EPO, European Patent Office, to incorporate it as new member of the MOLTO Consortium. This process has taken a long time, about 3 months and we expect to learn their final decision at the end of October. In case of positive outcome, then EPO will step in and we expect little changes to the original workplan. In case of negative outcome, then MOLTO will discuss changing the workplan for Workpackage 7, the Patent Case Study, possibly to a different domain. MOLTO partners have been approached by several interested parties with use case study domains that could be suitable testbeds for the tools developed during the project, these wil be approached first.

    The original workplan has been slightly modified to cope with changes in the Consortium, mainly by shifting the start of two workpackages. The loss of Matrixware affected the MOLTO activities scheduled for Workpackage 7: Case Study Patents (led by MXW) from Month 4 to Month 30. The major task that has been put on hold is the preparation of a parallel patent corpus (Mxw) to fuel the training of statistical MT (UPC). The work on Workpackage 7 will start as soon as the Consortium situation clarifies. UPC, the most directly affected partner (whose tasks depended on the work of Mxw), has begun the work on Workpackage 6: Case Study Mathematics in Month 5 instead of Month 7.

    Two project meetings have been organized, in Barcelona, 8-10 March 2010, and in Varna 10-12 September, 2010. A bilateral meeting, between UH and UGOT, has been organized in Helsinki on 5-6 May 2010.

    3. Deliverables and milestones tables


    Deliverables for Period M1-M6


    List of deliverables accessible to you.

    The admin page is the administrative data related to this deliverable as was planned in the description of work. Use this page, as work package leader, to keep track of changes in the content, scope, or date of the deliverable.

    The wiki page is the collaborative editing platform for the deliverable, when a report, or for the cover document, when a prototype. Please cut and paste the front matter as in sample deliverables when creating a new one.

    The publication is the actual frozen deliverable: it can easily be produced from the wiki page using the print icon and save as pdf, directly from the browser. Unless a publication is linked to the administrative record of the deliverable, it will not appear in the quick listing http://www.molto-project.eu/view/biblio/deliverables.



    Milestones for Period M1-M6


    ID Title Due date
    MS1 15 Languages in RGL 1 September, 2010
    MS2 Knowledge Representation Infrastructure 1 September, 2010

    4. Use of the resources

    Not available for midterm reporting.

    5. Financial statements

    Not available for midterm reporting.

    D1.3 Progress report T12


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D1.3. Progress Report T12
    Security (distribution level): Confidential
    Contractual date of delivery: M12
    Actual date of delivery: 1 Apr 2010
    Type: Report
    Status & version: Draft (evolving document)
    Author(s): A. Ranta et al.
    Task responsible: UGOT
    Other contributors: All


    Abstract

    Progress report for the first year of the MOLTO project lifetime, 1 Mar 2010 - 28 Feb 2011.

    1. Publishable summary


    This section must be of suitable quality to enable direct publication by the Commission and should preferably not exceed four pages.

    The publishable summary has to include all the distinct parts described below:

    • A summary description of project context and objectives,
    • A description of the work performed since the beginning of the project and the main results achieved so far ,
    • The expected final results and their potential impact and use (including the socio-economic impact and the wider societal implications of the project so far),
    • The address of the project public website, if applicable

    In line with this, diagrams or photographs illustrating and promoting the work of the project, as well as relevant contact details or list of partners can be provided without restriction.

    The publishable summary should be updated for each periodic report.


    2. Core of the report for the period

    2.1 Project objectives for the period


    Please provide an overview of the project objectives for the reporting period in question, as included in Annex I to the Grant Agreement. These objectives are required so that this report is a stand-alone document.

    Please include a summary of the recommendations from the previous reviews (if any) and indicate how these have been taken into account.


    2.2 Work progress and achievements during the period


    Please provide a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

    For each work package, except project management, which will be reported in section 3.2.3, please provide the following information:

    • A summary of progress towards objectives and details for each task;
    • Highlight clearly significant results;
    • If applicable, explain the reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
    • If applicable, explain the reasons for failing to achieve critical objectives and/or not being on schedule and explain the impact on other tasks as well as on available resources and planning (the explanations should be coherent with the declaration by the project coordinator) ;
    • a statement on the use of resources, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex 1 (Description of Work);
    • If applicable, propose corrective actions.

    2.3 Project management during the period


    Please use this section to summarise management of the consortium activities during the period. Management tasks are indicated in Articles II.2.3 and Article II.16.5 of the Grant Agreement.

    Amongst others, this section should include the following:

    • Consortium management tasks and achievements;
    • Problems which have occurred and how they were solved or envisaged solutions;
    • Changes in the consortium, if any;
    • List of project meetings, dates and venues;
    • Project planning and status;
    • Impact of possible deviations from the planned milestones and deliverables, if any;
    • Any changes to the legal status of any of the beneficiaries, in particular non-profit public bodies, secondary and higher education establishments, research organisations and SMEs;
    • Development of the Project website, if applicable;

    The section should also provide short comments and information on co-ordination activities during the period in question, such as communication between beneficiaries, possible co-operation with other projects/programmes etc.


    3. Deliverables and milestones tables

    Deliverables


    The deliverables due in this reporting period, as indicated in Annex I to the Grant Agreement have to be uploaded by the responsible participants (as indicated in Annex I), and then approved and submitted by the Coordinator. Deliverables are of a nature other than periodic or final reports (ex: "prototypes", "demonstrators" or "others"). If the deliverables are not well explained in the periodic and/or final reports, then, a short descriptive report should be submitted, so that the Commission has a record of their existence.

    If a deliverable has been cancelled or regrouped with another one, please indicate this in the column "Comments". If a new deliverable is proposed, please indicate this in the column "Comments".

    This table is cumulative, that is, it should always show all deliverables from the beginning of the project.


    Milestones


    Please complete this table if milestones are specified in Annex I to the Grant Agreement. Milestones will be assessed against the specific criteria and performance indicators as defined in Annex I.

    This table is cumulative, which means that it should always show all milestones from the beginning of the project.


    4. Explanation of the use of the resources


    Please provide an explanation of personnel costs, subcontracting and any major costs incurred by each beneficiary, such as the purchase of important equipment, travel costs, large consumable items, etc., linking them to work packages.

    There is no standard definition of "major cost items". Beneficiaries may specify these, according to the relative importance of the item compared to the total budget of the beneficiary, or as regards the individual value of the item.

    These can be listed in the following tables (one table by participant):

    TABLE 3.1 PERSONNEL, SUBCONTRACTING AND OTHER MAJOR COST ITEMS FOR BENEFICIARY 1 FOR THE PERIOD

    Work Package Item description Amount in € with 2 decimals Explanations
    Ex: 2,5, 8, 11, 17 Personnel direct costs 235000.00 € Salaries of 2 postdoctoral students and one lab technician for 18 months each
    5 Subcontracting 11000.02 € Maintenance of the web site and printing of brochure
    8, 17 Major cost item 'X' 75000.23 € NMR spectrometer
    11 Major cost item 'Y' 27000.50€ Expensive chemicals xyz for experiment abc
    Remaining direct costs 15000.10€
    Indirect costs
    TOTAL COSTS 363000.85€

    5. Financial statements – Form C and Summary financial report


    Please submit a separate financial statement from each beneficiary (if Special Clause 10 applies to your Grant Agreement, please include a separate financial statement from each third party as well) together with a summary financial report which consolidates the claimed Community contribution of all the beneficiaries in an aggregate form, based on the information provided in Form C (Annex VI) by each beneficiary.

    When applicable, certificates on financial statements shall be submitted by the concerned beneficiaries according to Article II.4.4 of the Grant Agreement.

    Besides the electronic submission, Forms C as well as certificates (if applicable), have to be signed and sent in parallel by post.

    A Web-based online tool for completing and submitting forms C is accessible via the Participant Portal: http://ec.europa.eu/research/participants/portal.


    D1.4 Periodic Management Report T18


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D1.4 Periodic Management Report T18
    Security (distribution level): Confidential
    Contractual date of delivery: M18
    Actual date of delivery: 1 Oct 2011 (expected)
    Type: Report
    Status & version: Final
    Author(s): O. Caprotti et al.
    Task responsible: UGOT
    Other contributors: All


    Abstract

    Progress report for the third semester of the MOLTO project lifetime, 1 Mar 2011 - 30 Sep 2011.

    1. Publishable summary

    1.1 Project context and objectives

    The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 36 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF [2], Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as those used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data.

    MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.

    While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.

    Three such domains will be considered during the MOLTO project: mathematical exercises, biomedical patents, and museum object descriptions. The MOLTO tools however will be applicable to other domains as well. Examples of such domains could be e-commerce sites, Wikipedia articles, contracts, business letters, user manuals, and software localization.

    1.2 Main results achieved so far

    The results achieved during the first 18 months of the projects are:

    • the first web services demonstrations highlighting the fundamental features of the MOLTO system tools: high-quality multilinguality and NLG multilingual query interface to semantic data;
    • the mathematical grammar library that allows to express simple mathematical problems in more than 10 languages, although not all at the same level of quality.
    • a online GF grammar editor for assisting authors of GF application grammars when editing in the cloud, also including the option to carry out example-based grammar generation;
    • GF plugins in Python and C to call library functions;
    • APIs for GF Grammar compilation and for the MOLTO translator's tools;
    • a prototype demonstrating GF grammar-ontology interoperability;
    • a description of the cultural heritage ontology for museum artifacts and use case scenarios for the application grammar.

    1.3 Expected final results and their potential impact and use

    The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software products:

    • a grammar development tool, available as an IDE and an API, to enable the use as a plugin to web browsers, translation tools, etc, for easy construction and improvement of translation systems and the integration of ontologies with grammars
    • a translator’s tool, available as an API and some interfaces in web browsers and translation tools
    • a grammar library for linguistic resources
    • application grammar libraries for the domains of mathematics, patents, and cultural heritage

    These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.

    The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.

    2. Core of the report

    2.1 Project objectives for the period

    This semester marks the half-lifetime of the project, a point in which all work-packages are under development and the first integrations ought to take place. In particular, the initial prototypes are being delivered. They include the APIs for WP2 and WP3, the GF grammar IDE and the grammar-ontology interoperability allowing natural language generation from an ontology, translation of natural language queries to SPARQL, the GF grammar for simple mathematical exercises, and information extraction. The main integrations are taking place among the GF grammar tools developed in WP2 and the translator's workbench developed in WP3, but also between the museum ontology created by WP8 and the interoperability described above and carried out in WP4.

    2.2 Work progress and achievements during M12-M18

    This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

    For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:

    • A summary of progress towards objectives and details for each task;
    • Highlights of clearly significant results;
    • If applicable, the reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
    • If applicable, the reasons for failing to achieve critical objectives and/or not being on schedule with remarks on the impact on other tasks as well as on available resources and planning;
    • A statement on the use of resources, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex 1 (Description of Work);
    • If applicable, corrective actions.

    WP2 Grammar developer’s tools - M18

    Highlights

    The period M13-M18 has been very active in the development of grammar development tools and also in their documentation and dissemination. Thus we can report the following new software:

    • The GF Grammar Web IDE was released in http://www.grammaticalframework.org/demos/gfse/ and has undergone a few upgrades. It is one of the two systems to be reported in D2.2.
    • The GF Shell Reference http://www.grammaticalframework.org/doc/gf-shell-reference.html was published and will be upgraded semi-automatically whenever new functionalities appear.
    • The GF Resource Grammar Library got three new complete languages: Punjabi, Nepali, and Persian. Partial implementations of Latvian and Swahili were published.
    • The Eclipse plug-in to GF was started and is operational with basic functionality, as reported in D2.2.
    • The example-based grammar writing feature is available as a shell program and also in the Web IDE, as reported in D2.2.

    Ongoing work

    We are working with further development of the two IDE's for GF and the example-based grammar writing method. We are also gathering material for Deliverable D2.3, Best Practices, due M24.

    WP3 Translator's tools - M18

    During the reporting period, work has progressed on the following fronts.

    1. The MOLTO translation editor demo prototype by Krasimir Angelov build with the Google Web Toolkit was installed under Eclipse with apache web server fastcgi. The complete development and testing cycle inside Eclipse is now available. The installation instructions are in the MOLTO UHEL Wiki at XXX.

    A new tab embedding a treegrid editor implemented with the ExtJS javascript library for editing term equivalents has been added to the editor. A version is able to query Ontotext FactForge for term candidates. It remains to implement a full search and edit back end.

    1. A design and implementation plan for an extended Translation Tools API has been drawn and submitted as Deliverable 3.1 (The Translation Tools API). Most parts of the API exist as existing code.

    A development environment for the open source translation management system GlobalSight has been installed for adapting parts of the system for the MOLTO translation tools.

    It remains to develop the glue to connect the existing parts together. Some extensions to the grammar development API have been listed in a requirements section in the deliverable.

    1. A C language runtime for parsing and generating with PGF files has been written by Lauri Alanko. The first release will be out in Oct 2011.

    2. A first version of an ontology/terminology acquisition toolkit for lexical resources management by Seppo Nyrkkö was demonstrated at the Helsinki project meeting.

    WP4 Knowledge engineering - M18

    WP4's main task is to research the possibilities for interoperability between grammars, written in GF, and ontologies and to build a prototype demonstrating it.

    During the period M12-M18 the work concentrated on refactoring and bug fixing of the prototype build earlier, on experimenting with bigger data sets, and on extending the functionality of the prototype. The main points can be summarized as follows:

    • the infrastructure has been updated with latest versions of the semantic repository and tools - this was necessary because SPARQL 1.1 queries are supported by the newest version of the semantic repository Owlim.
    • the set of the ontologies has been extended with dbpedia, geonames and other databases. - this task is important for re-usability of the prototype in the use cases WP6-WP8.
    • analyzed the requirements of the verbalization of the results
    • implemented a verbalization of the results - the current implementation generates automatically a GF abstract representation from the results and this abstract representation is matched from the GF abstract grammar and GF concrete grammar that are also build automatically from the ontology.
    • add 4 more languages to the prototype - the current list of the languages for the prototype is: English, Finnish, Swedish, French, German and Italian.
    • refactoring of the prototype GUI and internationalization.
    • test and bug fixing the prototype for Deliverable D4.3

    WP5 Statistical and robust translation - M18

    Summary of progress

    M18 is the date where Milestone S5 (First prototypes of the baseline combination models) should be achieved. The baseline systems for this workpackage, as described is Task 5.4, include an statistical machine translation system (SMT) trained with patents data, and the GF multilingual translation with a specific grammar for patents.

    The SMT system was mainly developed in the previous six months and was already reported in the First Year Report. In the following section we explain the most significant results which have been accomplish with respect to the GF system.

    For Task 5.5, we have started the work towards the hybrid system. Parts of the GF system such as the lexicon building already make use of statistical components. Besides, the methodology to combine SMT and GF alignments is established waiting to be applied to the patents domain.

    The work done for these tasks has been recently published in the "MT Summit XIII 4th Workshop on Patent Translation" with the title "Patent translation within the MOLTO project".

    At the same time of writing this report, the Deliverable D5.1 Description of the final collection of corpora corresponding to Tasks 5.1 and 5.2 has been submitted as a regular publication. It is a public document accessible from the MOLTO web page.

    Highlights

    A first implementation of the English-to-French patent translator with GF is available. The translation process can be divided according to the action of three modules: a generic pre-processing, the on-line lexicon building, and the patents grammar.

    The generic processing consists of an on-purpose tokeniser that deals with compound nouns, phrases separated by hyphens, chemical compounds, etc. The Stanford POS-tagger is used for named entities recognition and a recogniser of numbers has been developed. Chemical compounds after being tagged can be independently translated by the compounds grammar. This grammar is in an early stage of development within this workpackage.

    The second module is devoted to the lexicon building. To do this, the GF library multilingual lexicon is extended with nouns, adjectives, verbs and adverbs. The abstract syntax for these PoS is created from the claims in English and words are lemmatised and corrected manually from noise and ambiguities. The appropriate inflection is generated using the implemented GF paradigms and the English dictionary of the GF library for English, which is the starting language. Base forms are then translated into French and the inflection is generated in the same way. This process will be extended to other languages later on the project.

    Finally, the core of the translator is the patents grammar. The GF resource grammar has been extended with functions that implement constructions that occur in patent claims. The grammar is also in its first stages and nowadays it has a huge number of ambiguities and its coverage is around 15% on complete sentences. This figure can increase up to a 60% when dealing with chunks instead of full sentences.

    Deviations from Annex I

    This workpackage is tightly related to WP7. The delay on the patents corpus from WP7 has implied a reordering of some tasks within WP5. This explains the work done for Task 5.5 substituting parts of Task 5.4 which will be finished in the next months. Also because of the delay on the approval of the data, Deliverable 5.1 could be updated soon.

    WP6 Case study: mathematics - M18

    Summary of progress

    Deliverable D6.1 has been released as a tagged SVN repository available at svn://molto-project.eu/tags/D6.1, although bug fixing may continue in the head branch.

    With respect to the T6 progress report, we increased the number of compiled languages from 7 to 13 and checked for correctness and fluency in 3 languages.

    Dissemination activities at CADE'23 and satellite conference THedu'11.

    Highlights

    • Refactoring from the WebALT code to the modular form compatible with GF 3.2 complete.

    • The library compiles for the following languages: Bulgarian, Catalan, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, Swedish and Urdu plus an artificial language (LaTeX).

    • A demo based on the minibar demo, but including mathematical output for LaTeX is up and running at http://www.grammaticalframework.org/demos/minibar/mathbar.html

    • Testing for correcteness and fluency has been done for English, German and Spanish. This amounted to:

      1. Extracting a sample of the grammar productions to form a treebank to be used for quality testing and as a regression tool.
      2. Asking fluent speakers in the mathematical vernacular for these three languages to provide right renderings of the treebank entries
      3. Implementing the differences into the grammar library.
    • The results of Deliverable D6.1 have been presented in THedu'11

    WP7 Case study: patents - M18

    Summary of progress

    WP7 is due to provide a beta prototype in the forthcoming M21: D7.1 Patent MT and Retrieval Prototype Beta. Although the WP started with certain delay, the objectives for each task have been accomplished, namely in the last months.

    The initial tasks includes the definition of the architecture for the prototype, Task 7.7, and the use case scenarios Task 7.1. We mainly consider two scenarios: the multilingual retrieval of biomedical pantents and the online translation of patent claims and abstracts.

    In relation to Task 7.4 and Task 7.5, we started the work towards the selection of ontologies to deal with the biomedical domain and the extraction of FDA terms, drugs and measurement related models. Besides, we also have created an ontology to capture the structure of patent documents. In order to query the system, we have defined the structure and topics of the queries available to the user, detailed below. Finally, we have implemented a first version of the grammar that recognizes such queries.

    Regarding Task 7.2, recently we finally obtained the official license for using the EPO corpus. Earlier, we had been working with provisional data in order to create the first version of GF grammars Task 7.3 and to test the SMT system with this data Task 7.6.

    Highlights

    The architecture of the multilingual patents retrieval system is based on Exopatent, a working KRI platform from OntoText. The platform allow several search options including NL queries. We have defined 131 query examples along 21 query topics in relation to the biomedical domain. The grammar developed to process the queries covers about 600 queries in English and 500 in French.

    Patents in the retrieval engine are being annotated following the two main ontologies selected for the domain.

    The patents translation system is tightly related to WP5. The recent work in this task includes the development of the patents grammar, and the extension of the GF Resource Grammar with the functions implementing constructions that occur in patent claims. Generally speaking, the coverage of the grammar is unsatisfactory, which reinforces the efforts in the use of statistical techniques.

    Deviations from Annex I

    This workpackage has suffered a delay due to the lack of the proper license of the data corpus. Nonetheless, we are achieving the objectives related to WP7 tasks. Similarly, tasks in WP5 are being rescheduled. As we recently received the approval for the use of the data, we expect to speed up some tasks that were waiting for the data, as the baseline system (Task 5.4) and the annotation of the patents and the generation of the database (Task 7.5).

    Use of resources

    The use of resources have been as planned, with no remarkable deviations, according to the tasks described above performed so far.

    WP8 Case study: cultural heritage - M18

    Summary of progress

    Dana Dannels from the Linguistics Department of UGOT and Mariana from Ontotext have had a skype meeting every month to summarize what has been accomplished by WP8 and WP4 and to coordinate the work on tasks. Dana mainly worked with Ramona Enache, a GF expert, to share ideas and discuss the work in progress.

    The objectives for this workpackage in the period are ....

    Task ... is completed, is undergoing.....

    Significant results

    • Empirical study of existing metadata schemata adopted by museums in Sweden resulted in the creation of an ad-hoc ontology supporting compatibility to a variety of CH data schemata
    • Study of syntactic structures and patterns for discourse generation
    • The well known standard CIDOC-CRM has been implemented in GF.
    • The application specific ontology that uses ontology standards was implemented in GF.
    • The grammar implementation of the ontologies has been tested for verbalizing the ontology axioms.

    We have developed a prototype for generating natural language descriptions using discourse patterns in English and Swedish, and two scientific paper about our work; both have been per reviewed. One was presented in a WWW conference and one in a CH workshop.

    WP9 User requirements and evaluation - M18

    Summary of progress

    The work done during the months 12-18 the results are targeting for the preliminary work for D9.2 - the final evaluation.

    Highlights

    We have gained much input from Maarit Koponen's review on post-editing analysis and quality measurement in MT evaluation, also presented during the Open Day at the Third Project Meeting in Helsinki.

    For designing the evaluation of translator's tools, we have studied different translation management systems, that are common in the translation industry. We have selected GlobalSight, an open-source platform, for a closer study.

    We also have set up a MOLTO Content Factory server, which provides collaborative term voting and term validation. These features will be used in evaluation of terminology work. The MOLTO server has an URL already - but currently, UHEL security measures make it http-accessible only via UHEL's VPN. UHEL is working to find a solution for opening up the server to the MOLTO Consortium compatible with the university security policies. The server base URL is "http://tfs.cc/" and the mediawiki content is hosted at "http://tfs.cc/wiki", as demonstrated during the MOLTO Open Day.

    There is an ongoing discussion about collaboration with local entrepreneurs who are researching pre-editing and pre-validation of machine translatable documents. The research is focused in MT quality and its evaluation metrics.

    Deviations from Annex I

    This work package is tightly related to other work packages, likewise to the dissemination work package. Due to the changes in the Patents Case Study (WP7) we are reviewing the related material for evaluation purposes. We are going to announce updates to the earlier evaluation plan (D9.1) as needed.

    Use of resources

    The use of resources follows the earlier plans.

    WP10 Dissemination and exploitation - M18

    The objectives of this WP are to:

    • (i) create a MOLTO community of researchers and commercial partners;
    • (ii) make the technology popular and easy to understand through light-weight online demos;
    • (iii) apply the results commercially and ensure their sustainability over time through synergetic partnerships with the industry.

    To address (i), we have interfaced the RSS feed, publishing updates from the MOLTO website, to the Twitter feed http://www.molto-project.eu/moltoproject. This will further distribute the MOLTO news feed to mobile devices, alongside with the project's presence on LinkedIn.

    To address (ii) we have organized a number of events geared at publicizing the core technologies employed in the project. MOLTO partners from UPC and UGOT organized a GF Summer School in Barcelona between 15-26, August 2011. The program comprised a tutorial week and an advanced week with specific topics, including also work which is being carried out as part of the MOLTO workplan. In particular, J. Saludes presented the evaluation of WP6, T. Hallgren introduced web application programming for GF, R. Enache showed the work on the GF-ontology inter-relation and on WP8, and C. Espana presented the results of WP5. The web site of the school, archives the presentations, the discussions (in particular the future work suggestions as a result of the panel discussion) and the calendar of the lectures. Furthermore, A. Ranta delivered a GF tutorial during CADE23, Grammatical Framework: A Hands-On Introduction, and J. Saludes presented The GF Mathematics Library (joint work with S. Xambó) during the CADE23 satellite workshop "CTP Components for Educational Software".

    UHEL arranged the 3rd MOLTO Project meeting in Helsinki Aug.31-Sept 2, 2011.

    To address (iii), we now will enlarge the MOLTO Consortium by two new partners, one of which, Be Informed is a commercial partner interested in exploiting the MOLTO tools in its products.

    The list of publications can be obtained from the MOLTO website, ordered by year (most recent first), http://www.molto-project.eu/biblio?sort=year&order=desc.

    Publications

    The GF book appeared in April 2011 and is expected to help new developers to get started with MOLTO tools: Aarne Ranta, Grammatical Framework: Programming with Multilingual Grammars, CSLI Publications, Stanford, 2011, 340 pp. http://www.grammaticalframework.org/gf-book/

    Aarne Ranta, Translating between Language and Logic: What is Easy and What is Difficult. In N. Bjørner and V. Sofronie-Stokkermans (eds), Automated Deduction - CADE-23 Proceedings, LNCS/LNAI 6803, Springer, Heidelberg, 2011, pp. 5-25 (invited talk mentions word done in WP6).

    Events

    Computational Morphology. A course in European Master's Programme in Language and Communication Technologies 2011, University of Malta, 22-30 March 2011. http://www.cse.chalmers.se/~aarne/computationalmorphology/

    Computational Syntax. A course in the Masters Programme in Language Technology, University of Gothenburg, 11 April - 31 May, 2011. http://www.cse.chalmers.se/~aarne/computationalsyntax/

    Grammatical Framework: A Hands-On Introduction. Tutorial at CADE-23, Wroclaw, 1 August 2011. http://www.grammaticalframework.org/gf-tutorial-cade-2011/

    Second GF Summer School: Frontiers of Multilingual Technology. Barcelona, 15-26 August, 2011. http://school.grammaticalframework.org/

    2.3 Project management during M12-M18

    The main task of the management workpackage for the period has been to finalize the enlargement of the consortium which has been proposed last January. The new partners, University of Zurich (UZH) and the company Be Informed (BI), will lead two new workpackages directly applying the MOLTO tools to their existing core technologies. During the negotiation phase, a new workplan has been submitted and the budget was recalculated.

    Payment for Period 1 arrived on the last day of August and was shared to the consortium early in September after redistribution of the Matrixware budget.

    Monthly meetings of the Steering Committee were held regularly on conference calls and recorded on the wiki pages, http://www.molto-project.eu/wiki/minutes.

    3. Deliverables and milestones tables

    Deliverables of MOLTO are listed and linked for download on the web page http://www.molto-project.eu/workplan/deliverables.

    Below is a summary of deliverables due in the third semester.

    The only milestone due for the period is MS3, the web-based translation tool, which has been interpreted as a online editor interface to a GF application grammar and made available at http://www.molto-project.eu/node/1063.

    ID Due date Dissemination level Nature Publication
    D2.1 GF Grammar Compiler API 1 March, 2011 Public Prototype D2.1 GF Grammar Compiler API
    D1.3 Periodic management report 2 1 April, 2011 Consortium Report D1.3A Advisory Board Report
    D4.2 Data Models, Alignment Methodology, Tools and Documentation 1 May, 2011 Public Regular Publication D4.2. Data models and alignments
    D2.2 Grammar IDE 1 September, 2011 Public Prototype D2.2 Grammar IDE
    D3.1 MOLTO translation tools API 1 September, 2011 Public Prototype D 3.1. The Translation Tools API
    D4.3 Grammar-Ontology Interoperability 1 September, 2011 Public Prototype D4.3 Grammar ontology interoperability
    D5.1 Description of the final collection of corpora 1 September, 2011 Public Regular Publication D5.1. Description of the final collection of corpora
    D6.1 Simple drill grammar library 1 September, 2011 Public Prototype Simple drill grammar library
    D8.1 Ontology and corpus study of the cultural heritage domain 1 September, 2011 Public Other Ontology and corpus study of the cultural heritage domain
    D1.4 Periodic management report 3 1 October, 2011 Consortium Report D1.4 Periodic Management Report T18

    4. Use of the resources

    Tables on the usage of resources are not available for midterm reporting, however we have a rough estimate of person's months by each node.

    Node Professor PhD PhD Student Research Engineer
    UGOT 1 3 9 0
    UPC 10 12 0 0
    UHEL 2 0 4 9
    OntoText 0 0 0 23.75

    5. Financial statements

    Not available for midterm reporting.

    D1.5 Periodic Management Report T24


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D1.5 Periodic Management Report T24
    Security (distribution level): Confidential
    Contractual date of delivery: M24
    Actual date of delivery: 5 Apr 2012
    Type: Report
    Status & version: Final
    Author(s): O. Caprotti et al.
    Task responsible: UGOT
    Other contributors: All


    Abstract

    Progress report for the fourth semester of the MOLTO project lifetime, 1 Sep 2011 - 29 Feb 2012.

    1. Publishable summary

    1.1 Project context and objectives

    The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run for 39 months. It promises to develop a set of tools for translating texts between multiple languages in real time with high quality. MOLTO will use multilingual grammars based on semantic interlinguas and statistical machine translation to simplify the production of multilingual documents without sacrificing the quality. The interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework. GF technologies in MOLTO are complemented by the use of ontologies, such as those used in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data. GF has been applied in several small-to-medium size domains, typically targeting up to ten languages but MOLTO will scale this up in terms of productivity and applicability.

    A part of the scale-up is to increase the size of domains and the number of languages. A more substantial part is to make the technology accessible to domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, with the tools produced by MOLTO, this can be done by just extending a lexicon and writing a set of example sentences.

    MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.

    While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.

    The MOLTO Enlarged EU proposal adds two countries (Switzerland and The Netherlands) and two work packages. The Semantic Wiki work package builds a system that integrates the functionalities of MOLTO tools with a collaborative environment, where users can create content in different languages, and all edits become immediately visible in all languages, via automatic semantic-based translation. The Interactive Knowledge-Based System work package puts MOLTO technology to use in an enterprise environment, for the localization of end-user oriented systems to new languages and the generation of high-quality explanations in natural language. Noteworthy in this work package is the fact that translation grammars are constructed in house by Be Informed's non-expert staff without the intervention of grammar specialists.

    MOLTO technology will be released as open-source libraries, which can be plugged into standard translation tools and web pages and thereby fit into standard workflows. It will be demonstrated in web-based demos and applied in three case studies: mathematical exercises in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages.

    1.2 Main results achieved so far

    The results achieved during the first 24 months of the projects have been demonstrated during the 4th Project Meeting. They include:

    • GF plugin for the Eclipse IDE
    • patent retrieval in NL in the pharmaceutical domain
    • first results on hybrid models combining GF and SMT
    • SageGF, a natural language input for the suite of computer algebra systems Sage with output rendered aurally
    • GF grammars for the cultural heritage domain
    • a number of translation tools, including the Global sight translation system, and the MOLTO editors for ontologies, translations and equivalence in TermFactory
    • preliminary integration of GF-based multilingual editing in AceWiki.

    A detailed list with short abstracts is available at http://www.molto-project.eu/content/molto-4th-project-meeting-demos.

    In the past semester we reported:

    • the first web services demonstrations highlighting the fundamental features of the MOLTO system tools: high-quality multilinguality and NLG multilingual query interface to semantic data;
    • the mathematical grammar library that allows to express simple mathematical problems in more than 10 languages, although not all at the same level of quality.
    • a online GF grammar editor for assisting authors of GF application grammars when editing in the cloud, also including the option to carry out example-based grammar generation;
    • GF plugins in Python and C to call library functions;
    • APIs for GF Grammar compilation and for the MOLTO translator's tools;
    • a prototype demonstrating GF grammar-ontology interoperability;
    • a description of the cultural heritage ontology for museum artifacts and use case scenarios for the application grammar.

    1.3 Expected final results and their potential impact and use

    The expected final product of MOLTO is a software toolkit made available via the MOLTO website. It will consist in a family of open-source software flagship products:

    • Cloud-IDE for GF
    • Eclipse-IDE for GF
    • a professional translator tool with integrated terminology extension, predictive parsing, syntax editing, robust parsing
    • an ontology query system, based on a generic reusable grammar (carefully built manually) with automatically derived domain-specific extensions; demoed with patents and museum
    • a hybrid patent translation system that beats its competitors, if possible
    • the Mathematical Grammar Library (MGL), supporting both math documentation (Wiki) and tutorial (Sage, word problems)
    • a museum visitors' system generating descriptions and providing a query system, in 15 languages
    • multilingual semantic wiki system integrating the technologies and resources developed within MOLTO
    • a software product from Be Informed that is useful for their customers (to be specified)
    • a set of evaluations, for each of the flagships listed before, showing that high quality can be reached with reasonable effort using MOLTO tools

    These tools will be portable to different platforms as well as generally portable to new domains and languages. By the end of the project, MOLTO expects to have grammar resource libraries for 18 languages, whereas MOLTO use cases will target between 3 and 15 languages.

    The main societal impact of MOLTO will be on contributing to a new perception for the possibilities of machine translation, moving away from the idea that domain-specific high-quality translation is expensive and cumbersome. MOLTO tools will change this view by radically lowering the effort needed to provide high-quality scoped translation for applications where the content has enough semantic structure.

    2. Core of the report

    This section describes the progress of each workpackage and discusses changes to the workplan, if necessary.

    2.1 Project objectives for the period

    The work during the fourth semester has proceeded in parallel in all workpackages leading to a number of demonstrative prototypes.

    In WP2, the work concentrated on finishing the Cloud-based Editor and the Eclipse Plugin for GF, and in WP3, the design of an integrated architecture for the translation tools led to the adoption of the third-party platform GlobalSight as the translators workflow management framework in which to integrate the MOLTO tools.

    WP4 has come to its conclusion delivering a prototype on the company's website Ontotext for showing GF-OWL interoperability.

    The first hybrid models for the statistical and robust translation promised in WP5 have been implemented and evaluated on a specific testset, the results are available in a deliverable and will form the basis for the next developments. The MGL developed in the use case study of mathematics has been equipped with a command line interface to the Sage suite of Computer Algebra Systems, thus providing a natural language dialog to a computation system.

    The Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.

    In the cultural heritage museum study, the ad-hoc ontology facts, stored in the knowledge representation infrastructure delivered by WP4, can be queried in natural language in 5 languages.

    The milestones for the period have all been achieved as described later in the report.

    2.2 Work progress and achievements during M19-M24

    This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

    For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:

    • A summary of progress towards objectives and details for each task;
    • Highlights of clearly significant results;
    • A statement on the use of resources for Period 2, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex I (Description of Work).

    Moreover, if applicable:

    • The reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
    • The reasons for failing to achieve critical objectives and/or not being on schedule with remarks on the impact on other tasks as well as on available resources and planning;
    • Corrective actions.

    WP2 Grammar developer’s tools - M24

    Summary of progress

    This WP has delivered the GF grammar development infrastructure in anticipated ways, resulting in two IDE's (a cloud-based and an Eclipse-based) and a faster grammar compiler. The WP and its last deliverable have been extended in time to allow for interaction with the MOLTO-Enlarged EU, which was delayed. The skeleton of the final deliverable has been discussed during the 4th Project Meeting.

    Highlights

    GF 3.3.3: faster compilation of grammars, permitting on-the-fly changes of running translation systems.

    GF Cloud-Based IDE: an IDE for beginners, as well as for on-the-fly changes of running translation systems. New features in this year:

    • use of GF Resource Grammar Library
    • grammar cloning
    • sharing grammars on the cloud
    • example-based grammar writing

    GF-Eclipse plugin: an IDE for power users, with features such as

    • large project management
    • code cloning
    • library browsing
    • error highlighting
    • support for treebank-based testing

    GF Resource Grammar Library has 7 new languages since March 2012: Hindi, Latvian, Nepali, Persian, Punjabi, Sindhi, and Thai. Some MOLTO applications (e.g. the Phrasebook and the Math library) are ported to some of these languages.

    RGL support of lexicon building was evaluated in the article by Détrez and Ranta, Smart Paradigms and the Predictability and Complexity of Inflectional Morphology, to appear in EACL 2012.

    As a tutorial and reference for GF, a book has been published: Ranta, Grammatical Framework - Programming with Multilingual Grammars, CSLI, Stanford, 2011.

    Use of resources in Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 5 (R. Enache) 6 (J. Camilleri)
    UPC 1 (J. Saludes) 1 (C. España) 0 0
    UHEL 0 0 0 5 (L. Alanko)
    OntoText 0 0 0 4,2(M.Chechev,M.Damova,K.Krustev)

    Deviations from Annex I

    We moved Deliverable 2.3, "User Manual and Best Practices", to Month 27 (due 20 June 2012). The reason is that we want to include the experience from the new kind of users from Be Informed, and the start of the MOLTO enlargement was delayed.

    WP3 Translator's tools - M24

    Summary of progress

    The work done during the last year is related to the promises of WP3: to combine MOLTO tools with traditional CAT tools. As described in the appendix D9.1A, MOLTO tools would be used to translate real time multilingually some formulaic parts of a more complex document type, such as descriptions of chemical formulas in a patent. The rest of the document would be translated with more traditional methods. We have chosen the translation management system GlobalSight to combine the workflows.

    We have been modifying the editor released in MS3 adding term management and user authentication. We've been also developing a term search; currently it is a separate component, but we're planning to attach it to the editor. The search can be tested at http://tfs.cc/molto_term_editor./editor_sparql.html.

    Term management platform TermFactory (TF), a related project run by Lauri Carlson, is under development. The plan is to connect TF to the editor in order to allow on-the-fly user extensions of the lexicon of the grammar. The work done in WP2 by UGOT is in synergy with our WP: they have been developing ways to change the GF grammar without full recompilation thus in a significantly faster time.

    As for publications, a master's thesis called ''Ontology-based lexicon management in a multilingual translation system'', written within the project, will be finished during Spring 2012.

    Highlights

    As a part of MS8 (due September 2012), GlobalSight is now running on our server at http://tfs.cc/globalsight/.

    Use of resources in Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 0 4 (A. Slaski, S. Virk, N. Frolov)
    UPC 0,25 (L. Màrquez) 1,75 (M. Gonzàlez) 0 0
    UHEL 2 (L. Carlson) 0 0 5 (I. Listenmaa), 6 (J. Shen), 6 C. Li)
    OntoText 0 0 0 4,1(M.Damova, M.Chechev, S.Enev)

    Deviations from Annex I

    WP4 Knowledge engineering - M24

    Summary of progress

    This WP has delivered two way interoperability between the natural language and ontology. The prototype was build and made publicly available on http://molto.ontotext.com. The prototype integrates the infrastructure for knowledge modeling, semantic indexing and retrieval with tools for NL queries to the semantic repository and verbalization of the results.

    Highlights

    • natural interface for quering semantic repository.
    • verbalization of the results from the semantic repository.
    • integration of KRI infrastructure.
    • public prototype has been released
    • delivered D4.1, D4.2 and D4.3

    Use of resources in Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 0 0
    UPC 0 0 0 0
    UHEL 0 0 0 0
    OntoText 4(B.Popov) 0 0 10,1(M.Chechev,P.Mitankin,M.Nozchev,F.Alexiev,A.Ilchev,I.Kabakov,K.Krustev,V.Zhikov)

    Deviations from Annex I

    WP5 Statistical and robust translation - M24

    Summary of progress

    The milestone MS7 has been achieved in M24 (First prototypes of hybrid combination: The methods are implemented and evaluated on a specific test set).

    The work of the fourth semester corresponds to the three last tasks of the WP (T5.4 Baseline systems, T5.5 Hybrid Models and T5.6 Systems evaluation, see http://www.molto-project.eu/workplan/statistical-and-robust-translation). The baseline systems have been improved by extending the GF translator. Now the translator is able to deal with chunks so that the coverage has been widened (Task 5.4).

    For Task 5.5 we have implemented two kinds of hybrid models which we call Soft and Hard integrations. The following section outlines its main characteristics.

    Finally, for Task 5.6 both the baselines and the hybrid systems have been evaluated using a variety of lexical metrics and compared with generic public available translators such as Google and Bing. Also a manual evaluation has been carried out in order to compare the most promising hybrid system according to the automatic evaluation and the pure SMT translator.

    The work done for these tasks has been submitted to the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) and the submitted paper with title "A Hybrid System for Patent Translation" can be found in MOLTO web page.

    At the same time of writing this report, the deliverable D5.2 Description and evaluation of the combination prototypes is being submitted as a regular publication. It is a public document accessible from the MOLTO web page.

    Highlights

    Two kinds of hybrid translators for patents have been developed. The final systems are not only a combination of two different engines but the subsystems also mix different components. We have developed a GF translator for the specific domain that uses an in-domain SMT system to build the lexicon; an SMT system is on top of it to translate those phrases not covered by the grammar.

    In the previous report we showed that the GF grammar-based system alone could not parse most patent sentences. Consequently, the current translation system aims at using GF for translating patent chunks, and assemble the results in a later phase. As explained in D5.2, this implies several modifications to the GF baseline itself.

    To gain robustness in the final system, the output of the GF translator is used as a priori information for a higher level SMT system. The SMT baseline is fed with phrases which are integrated in two different ways. First, what we call "Hard Integration", phrases with GF translation are forced to be translated this way. The system can reorder the chunks and translates the untranslated chunks, but there is no interaction between GF and pure SMT phrases. Second, in the "Soft Integration" system, phrases with GF translation are included in the translation table with a certain probability so that the phrases coming from the two systems interact.

    The hybrids exploit the high coverage of statistical translators and the high precision of GF to deal with specific issues of the language. At this moment the grammar tackles agreement in gender, number and between chunks, and reordering within the chunks. Although the cases where these problems apply are not extremely numerous both manual and automatic evaluations consistently show their preference for the hybrid system in front of the two individual translators. In the near future we plan to widen the number of issues approached by the grammar. Also, modifications with SMT components to the GF translator and new kinds of combination of phrases will be introduced.

    Use of resources for Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 1 (R. Enache) 0
    UPC 7.75 (L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) 11.25 (C. España, M. Gonzalez, X. Carreras) 0 0
    UHEL 0 0 0 0
    OntoText 0 0 0 0

    Deviations from Annex I

    The final hybrid translators have been developed for the French-English language pair. We also aim at including German, so in the following months the concrete syntax for German will be completed. We plan to complete the task in May and it does not affect any other tasks of the project. The systems in the three languages will be available for the final evaluation.

    WP6 Case study: mathematics - M24

    Summary of progress

    Deliverable D6.2 has been released as tagged SVN content publicly available at svn://molto-project.eu/tags/D6.2. Bug fixing and some more features may continue to be developed in the head branch.

    With respect to M18, we added an upper layer to the MGL library to support commands issued to a Computer Algebra System (CAS) and to render the answers in the natural languages as text or speech using actual concrete syntaxes for 3 languages: English, Spanish and German.

    We developed software components to interact with a CAS (Sage) both externally using the http protocol, or inside the Sage shell and notebook interfaces. Furthermore, we developed a testing procedure to assist in regression tests for the tool.

    Highlights

    • Developed a prototype to engage Sage in a dialog using natural language that runs on Linux and Mac OSX. The system assists command composing by providing autocompletion and gives spoken output on demand.

    • Developed a Sage interface to issue commands to a Sage process from the native Sage shell or notebook. In Linux it provides autocompletion using the native shell mechanism for it.

    • The dialog prototype has been demonstrated at DEIMS12

    Use of resources for Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 0 0
    UPC 6 (J. Saludes) 0 0 6 (A. Ribó Mor)
    UHEL 0 0 0 0
    OntoText 0 0 0 0

    Deviations from Annex I

    WP7 Case study: patents - M24

    Summary of progress

    During this period, WP7 has done a step forward in the development of the prototype: the Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.

    Due the incremental development of the prototype, most of the tasks span till M27, when the final prototype must be delivered. The following lines describe the progress of the following tasks:

    • 7.2 Patent Corpora
    • 7.3 Grammars for the patent domain
    • 7.4 Ontologies and Document Indexation
    • 7.5 Patents Retrieval System
    • 7.6 Machine Translation Systems
    • 7.7 Protoype (User Interface)

    In relation to Task 7.2, the EPO provided a parallel corpus of patents from which only 66 patents belongs to the biomedical domain. We downloaded an alternative corpus of 7,705 document directly from their website (i.e. publicly available) The following summarizes the content of these documents: 4,274 out of the 7,705 documents have claims (6M lines), 2,058 out of them are trilingual (3M lines). 2,116 documents have claims written only in English, 66 have claims only in German (260K lines), 34 only in French (88K lines). There are no extra files having other combination of languages.

    Regarding Task 7.4 and Task 7.5, the ontologies, indexes, databases and retrieval engines have been set up for the specific domain and using the patent documents described above. The semantic annotation process is carried out by a GATE pipeline on the English texts. We are working to export the annotations during the translation process in order to be able to show the annotations also in the French and German texts.

    As for Task 7.3 and Task 7.6, the grammars development and SMT adaptation to the domain is being developed jointly with WP5 tasks. The grammars have been developed for English and French, and in the following will be developed also for German.

    Finally, regarding Task 7.7, the interface allows accessing the system in three different ways: the controlled language, SPARQL and terms in the index. In the future we will include free text and a combination of it with the controlled language.

    Highlights

    Since M21 there is a fully functional version of the prototype at http://molto-patents.ontotext.com/. The demo allows querying the system in English and French. The patents in the database has original text in English, French and German.

    The retrieval system can be queried in three different ways. The NL-based interface allows the user to query the system in English and French using written natural language. The SPARQL interface, more suitable for advanced users, allows to accurately browse the repository using SPARQL queries The keyword-based visual browsing interface uses the RelFinder tool in which the user can search for keywords using the autocomplete functionality. The results from the RelFinder search are visualised as graphs.

    The visualization of the results displays the list of classes from the ontologies that match the query and the list of patent documents indexed under the matching criteria. The interface provides also a link to access the semantically annotated documents and the original patents. The interface that shows the annotated documents highlights on the text the words that are related to any semantic item. Colors are given according to the semantic annotations type. The right side of the page gives the list of semantic types and colors that are present in the text.

    A paper about the Patent retrieval system was accepted at WWW2012 Conference, to be held in April.

    Use of resources for Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 3 (R. Enache) 3 (A. Slaski)
    UPC 0 7,5 (M. Gonzales, C. España) 0 0
    UHEL 0 0 0
    OntoText 0 0 0 8,8 (M.Chechev, M.Damova, V.Zhikov, I.Kabakov)

    Deviations from Annex I

    In general lines, we are achieving the objectives related to WP7 tasks within the timeframe. However, due the several issues related to the gathering of the corpora, the databases of the retrieval system do not include yet automatic translations of the patent document but only real translations. The issue affects directly the annotation process of Tasks 7.5, but it does not imply a delay for the whole prototype. The estimation is that the automatic translations and annotations will be included in the final prototype.

    WP8 Case study: cultural heritage - M24

    Summary of progress

    The work package has started by data collection, proceeded with developing the ontology interface, and lately focused on the baseline translator. The translator is only for five languages so far, but will be extended soon. The ontology interface will permit multilingual queries about museum objects exploiting the MOLTO Knowledge Representation Infrastructure. It also makes this case study into an example of multilingual ontology verbalization.

    Highlights

    Ontology and corpus study (D8.1).

    Grammars for translation and multilingual NLG for painting descriptions in five languages: English, Finnish, French, Italian, Swedish. This was built in a modular way that is easy to extend to new languages, which we will do soon.

    Ontology verbalization in a generic way. The same languages will be usable as in translation, but aren't yet.

    Use of resources for Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 1 (R. Enache) 0
    UPC 0 0 0 0
    UHEL 0 0 0 0
    OntoText 0 0 0 5,9 (M.Chechev, M.Damova, K.Krustev, V.Zhikov)

    Deviations from Annex I

    We have not been able to proceed at the planned pace, and would like to have an extension of the WP time. We would like to extend this WP and its last deliverable 8.3 till Month 36. We have not been able to use the person months as planned (as can be seen from the Use of resources). One of the planned key persons, Dana Dannélls at UGOT, will be able to join MOLTO full-time a few months later than originally planned, probably October/November 2012.

    WP9 User requirements and evaluation - M24

    Summary of progress

    This WP is working on collecting evaluation plans from each site.

    An extended D9.1E Evaluation plan has been written.

    Highlights and planned evaluation methods

    Progress evaluation has mainly been carried out by each site during development. This would be a good idea to collect this more systematically.

    For the SMT/hybrid patent case, automatic measures (BLEU but also others - maybe check Cristina/UPC slides for examples) are probably mainly used.

    In developing the GF grammars, informants (native speakers of the relevant languages) have been used during the grammar writing process to check and correct output. The informants have been given output to read and have informed the developer if sentences are correct or if not, how they should be corrected.

    Moving forward, the final evaluations will need to include usability of the tools as well as quality evaluation of the output. (WP9 review slides have some examples of the user communities that might be mobilized for usability evaluation and the platforms that could be used. One thing that we were discussing wrt to mobilizing evaluators is that they need to be motivated to use the tools in some way?)

    For output quality, final evaluations will likely involve both automatic and manual methods. For automatic methods, UPC's Asiya evaluation kit offers some syntactically and semantically oriented metrics in addition to the purely lexical ones like BLEU, but only for a couple of languages. As all automatic metrics rely on comparison to gold standard human translations, these need to be obtained for the test sets, if they are to be used.

    Manual evaluation methods on the other hand require humans to do evaluations. For the patent case, evaluators need to have sufficient understanding of the material to be able to assess whether the translations are correct or not, particularly since we expect one of the strengths of the GF hybrid to be in correctly handling long formulae. Therefore plans have been made to hire professional patent translators of the languages in question to do the evaluation expectedly in June. Since Google is now also providing patent translations, that will be used as a point of comparison. The TAUS scale, fluency etc. could be used in this case.

    For the museum case one manual evaluation approach was to produce museum descriptions in various languages that combine the simpler rules - e.g. "Painter painted Painting in City in Year on Canvas" etc. and then have the native speakers check the individual relations involved (Who painted? What did they paint? Where? When? etc.) and combine these into a measure of the overall fidelity. For this, evaluators do not necessarily need to be museum experts, any native speakers of the language in question should do. If you want a reference for this, an interesting description of such approach is in http://www.cs.ust.hk/~dekai/library/WU_Dekai/LoWu_Acl2011.pdf Other measures such as fluency, TAUS fitness scale could also be used.

    Use of resources for Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 0 0
    UPC 2,16 (L. Màrquez, LluísP, D. Farwell) 0,5 (C. España) 0 0
    UHEL 0.5 (L. Carlson) 0 9 (S. Nyrkkö) 0
    OntoText 0 0 0 2(M.Chechev, K.Krustev)
    UZH 0 0 0 0
    BI 0 0 0 0

    Deviations from Annex I

    WP10 Dissemination and exploitation - M24

    Summary of progress

    A few events have been organized by MOLTO and some are in the making. The your researchers in the Consortium have published a number of papers at international venues on initial results of the project. Project meetings have taken place in Helsinki, and in Zürich with the extra MOLTO-EEU kickoff meeting in Gotheburg. GF tutorials and tutorials on MOLTO works have been delivered on various occasions, more prominently during the GF Summer School: Frontiers of Multilingual Technologies in August 2011.

    Highlights

    At UGOT, R. Enache and S. Virk have passed their licenciate, a step towards their PhD, by publishing work in connection to MOLTO. Moreover, D. Dannels has also defended her PhD seminar by discussing work done in the natural language analysis of cultural heritage domain. Finally, K. Angelov, obtained his PhD at Chalmers with a thesis on the inner workings of GF, much of which goes to benefit WP2 and WP3.

    The list of publications, archived on the MOLTO website (http://www.molto-project.eu/biblio?sort=year&order=asc), follows here below.

    Controlled Language for Everyday Use: the MOLTO Phrasebook, Ranta, Aarne, Enache Ramona, and Détrez Grégoire, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)

    The GF mathematics library, Saludes, Jordi, and Xambó Sebastian, THedu'11, (2011)

    Grammatical Framework: Programming with Multilingual Grammars, Ranta, Aarne, CSLI Studies in Computational Linguistics, Stanford, p.350, (2011)

    MOLTO Enlarged EU Annex I - Description of Work, Consortium, MOLTO , (2011)

    MOLTO poster presented at EAMT Conference (European Association for Machine Translation) 2011, Leuven, Ranta, Aarne, and Enache Ramona, (2011) - also presented at META-FORUM by Listenmaa, Inari in Budapest, 2011.

    Typeful Ontologies with Direct Multilingual Verbalization, Angelov, Krasimir A., and Enache Ramona, Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)

    Typeful Ontologies with Direct Multilingual Verbalization poster, presented at the Google Anita Borg retreat, June 2011, Zurich, Enache, Ramona, (2011)

    The GF Mathematics Library, Saludes, Jordi, and Xambó Sebastian, Proceedings First Workshop on CTP Components for Educational Software (THedu'11), 02/2012, Volume Electronic Proceedings in Theoretical Computer Science, Number 79, Wrocław, Poland, p.102–110, (2011)

    D1.3A Advisory Board Report, Hall, Keith, and Pulman Stephen, 03/2011, Number D1.3A, Gothenburg, (2011)

    MOLTO - Multilingual On-line Translation - Annual Report 2010-2011, Caprotti, Olga, España-Bonet Cristina, and Alanko Lauri, 03/2011, Gothenburg, (2011) - Published on cordis.eu.

    A Framework for Improved Access to Museum Databases in the Semantic Web, Dannélls, Dana, Damova Mariana, Enache Ramona, and Chechev Milen , RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING, 09/2011, Hissar, Bulgaria, (2011)

    Hybrid Machine Translation Guided by a Rule–Based System, España-Bonet, Cristina, Labaka Gorka, Díaz De Ilarraza Arantza, Màrquez Lluís, and Sarasola Kepa , Machine Translation Summit, 09/2011, Xiamen, China, p.554-561, (2011)

    The painting ontology, Dannélls, Dana, CIDOC 2011 conference, 09/2011, (2011)

    Patent translation within the MOLTO project, España-Bonet, Cristina, Enache Ramona, Slaski Adam, Ranta Aarne, Màrquez Lluís, and Gonzàlez Meritxell, Workshop on Patent Translation, MT Summit XIII, 09/2011, p.70-78, (2011)

    Reason-able View of Linked Data for Cultural Heritage, Damova, Mariana, and Dannélls Dana, The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES (S3T), 09/2011, Bourgas, Bulgaria, (2011)

    Deep evaluation of hybrid architectures: simple metrics correlated with human judgments, Labaka, Gorka, Díaz De Ilarraza Arantza, España-Bonet Cristina, Sarasola Kepa, and Màrquez Lluís, International Workshop on Using Linguistic Information for Hybrid Machine Translation, 11/2011, Barcelona, Spain, p.50-57, (2011)

    The Patents Retrieval Prototype in the MOLTO project, Chechev, Milen, Gonzàlez Meritxell, Màrquez Lluís, and España-Bonet Cristina, WWW2012 Conference, Lyon, France, (2012)

    MOLTO - Multilingual On-Line Translation, Ranta, Aarne, Talk given at Xerox Research Centre Europe, Grenoble, 19 January 2012, 01/2012, (2012)

    Using GF in multimodal assistants for mathematics, Archambault, Dominique, Caprotti Olga, Ranta Aarne, and Jordi Saludes, 02/2012, Digitization and E-Inclusion in Mathematics and Science 2012, (2012)

    Trips and travels by meeting, partners

    • MOLTO 2nd Progress Meeting, Gothenburg (UGOT: S. Pulman, K. Hall) (UPC: C. España, J. Saludes, S. Xambo) (UHEL: S. Nyrkkö, M. Koponen, L. Carlson, I. Listenmaa, L. Alanko) (Ontotext: B. Popov, P. Mitankin)
    • MOLTO 1st Year Review, Luxemburg (UGOT: A. Ranta, O. Caprotti) (UPC: S. Xambo, C. España) (UHEL: L. Carlson) (Ontotext: B. Popov)
    • MOLTO WP3 Meeting, Helsinki (UGOT: A. Ranta, R. Enache, K. Angelov, A. Slaski)
    • MOLTO GF Summer School, Barcelona (UGOT: A. Ranta, R. Enache, O. Caprotti, A. Slaski, K. Kaljurand, T. Kuhn) (UHEL: I. Listenmaa)
    • META-FORUM, Budapest (UHEL: I. Listenmaa)
    • THedu 11 and CADE 23, Wrocław (UPC: J. Saludes)
    • MOLTO 3rd Progress Meeting, Helsinki (UPC: J. Saludes, C. España, M. Gonzalez) (UHEL: A. Ranta, R. Enache, O. Caprotti, D. Dannels, J. Camilleri) (Ontotext: M. Chechev, B.Popov)
    • MOLTO-EEU Kick-off Meeting, Gothenburg (UPC: J. Saludes) (UHEL: ) (Ontotext:M.Chechev) (BI:) (UZH: N. Fuchs, K. Kaljurand)
    • MTSummit XIII, Xiamen (UPC: C. España)
    • DEIMS12, Tokyo (UGOT: O. Caprotti)
    • EAMT 2011, Leuven (UGOT: A. Ranta, R. Enache)
    • GSCL 2011, Hamburg (UGOT: O. Caprotti)

    Use of resources for Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 4 (O. Caprotti) 0 0 0
    UPC 1 (S. Xambo) 0 0 0
    UHEL 0 0 0 0
    OntoText 1 (B.Popov) 0 0 2,6 (M.Chechev, M.Damova)
    UZH 0 0 0 0
    BI 0 0 0 0

    Deviations from Annex I

    None to report.

    WP11 Multilingual semantic wiki - M3

    Summary of progress

    During the first 3 months of our participation in the MOLTO project we completed an initial integration of the GF-provided services (mainly translation and look-ahead editing) into AceWiki.

    We implemented a new Java front-end to the GF Webservice, and use it to connect to the GF services from AceWiki. The existing AceWiki user interface was extended to allow for an easy switching between different languages and to present with each sentence its GF-provided analysis (translations into other languages, word alignment diagrams, GF syntax trees, etc.). The AceWiki storage format was changed to a one based on GF abstract trees (which are language-neutral).

    The other main part of our work dealt with the implementation of the ACE grammar in GF. We tested an existing implementation (Angelov and Ranta, 2009) which targets an earlier version of ACE for its recall and precision, and found that some changes need to be introduced to make it compatible with the latest version of ACE. More importantly, we decided to focus on and also started work on a grammar of the subset of ACE that is used in the current AceWiki.

    We also experimented with taking the content of an existing AceWiki demo wiki (domain Geography) and using it to pre-populate the new GF-based AceWiki.

    Highlights

    • integration of the existing AceWiki with the existing GF webservice, allowing for multilingual viewing and editing of the wiki content
    • analysis of the existing GF implementation of the ACE grammar

    Use of resources for Period 2

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 0 0
    UHEL 0 0 0 0
    UZH 0 2 (K. Kaljurand) 0 0

    Deviations from Annex I

    None to report, besides delayed start as explained later in this document.

    Links to online resources

    WP12 Interactive knowledge-based systems - M3

    Summary of progress

    The first 2 months we started with the Adoption phase as described in the DoW for WP12. We've focused our efforts on the requirements for the verbalization component(D12.1). We distinguish 4 categories of relevant requirements.

    • Ontology verbalization
    • Sentence Variants generation
    • Adopt of GF in Be Informed modelling process
    • Adopt GF in interactive applications for integration generated text

    We presented a requirements draft to our partners in March 2012.

    At the kickoff hosted by UGOT we did a first round with the UGOT people to draw up the specification to migrate Be Informed current explanation prototype to GF.

    Next Steps

    • Finalize requirements for D12.1
    • Migrate current explanation engine to GF
    • Train Be Informed co workers in GF
    • Start development of business domain related abstract grammars.

    Highlights

    • Kickoff MOLTO-EEU
    • Started requirements for verbalization component. Draft Outline shared with Partners M3.
    • Started a joint paper with Monnet and MOLTO partners. The article focuses on the GF (MOLTO) and lemon(MonnetO integration in ontology lexicalization and verbalization.

    Use of resources

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 0 0 0 0
    BI 0,25 (J. van Grondelle, J. van Aart ) 0 0 0,25 (H. ter Horst)

    Deviations from Annex I

    WP12 runs longer than is indicated in the Gantt chart appearing in the Annex, however duration is correctly listed under 3.3.3. We planned for a duration of 15 months, D12.2 for instance is projected for March 2013.

    2.4. Deliverables and milestones tables

    Deliverables of MOLTO are listed and linked for download on the web page http://www.molto-project.eu/workplan/deliverables.

    Below is a summary of deliverables due until the fourth semester.

    The milestones until now have been achieved to a different degree of completion, either by a deliverable or by some online prototype. In particular, MS3, is the translation editor available at http://www.grammaticalframework.org:41296/editor/#translate and is being integrated in the Translator's Tools due next September 2012. The grammar-ontology-interoperability has been documented in D4.3 however is has been requested that more details be made explicit. Concerning MS7, the methods implemented by the hybrid combination prototypes have been evaluated on a specific test set as reported in D5.4.

    ID Title Due date
    MS1 15 Languages in RGL 1 September, 2010
    MS2 Knowledge Representation Infrastructure 1 September, 2010
    MS3 Web-based translation tool 1 March, 2011
    MS4 Grammar-ontology interoperability 1 October, 2011
    MS5 First prototypes of the cascade-based combination models 1 October, 2011
    MS6 Grammar tool complete 1 March, 2012
    MS7 First prototypes of hybrid combination models 1 March, 2012

    3. Project management during M19-M24

    The third semester of the project saw the enlargement of the Consortium by two new partners University of Zürich and Be Informed. While the original planned start of the MOLTO-EEU project enlargement was scheduled for September 2011, and accounted for synchronicity of the deliverables with ongoing workpackages, the actual kickoff only happened in January 2012. Consequently, the end date of the project is now shifted to 31 May 2013 and the main deliverable of Workpackage 2 has been shifted 3 months to take into account the feedback from the new use cases added by the enlargement.

    The following inconsistencies have been noticed in the revised Annex:

    • WP12, on pag. 33, starts M22 ends 33, while on pag. 48, starts M18 ends M33. Figures on pag. 48 hold.

    However, due to the delay in start, both WP11 and WP12 will now be ongoing in the period M22-M37, as in the chart below. Notice also the changes affecting WP2 and WP8.

    Response to the review in March 2011

    The following actions were taken as a result of the review report quoted here below.

    Some observations, comments and remarks, raised and discussed at the review meeting, follow. These should be addressed in the respective deliverable(s) as well as in the planning for the next period.

    Rule extraction (from lexical databases, ontologies, text examples) needs to be specified in detail and a concrete schedule should be included in the updated workplan (D1.1).

    The workplan in maintained online using a dynamically generated list of tasks entered by the workpackage leaders. It is available, if logged in, under http://www.molto-project.eu/workplan and tasks http://www.molto-project.eu/workplan/tasks. It is the responsibility of the workpackage leader to actively use and document ongoing work using this tool.

    The topic of rule extraction will be detailed in the last, main deliverable of WP2.

    Concerning the integration of the TermFactory (TF) and Knowledge Representation Infrastructure (KRI), it seems that there are overlaps between these tools. The partners must clarify which functions of these tools will be used in the case studies in order to exploit complementarities of the tools and avoid overlaps.

    The Term Factory is not only a component in molto but a stand-alone software, which vitally requires some functionalities of its own for technical purposes. Any excessive development of overlapping functions will avoided by co-operation and planning with the KRI developers. It is in deed notably relevant for evaluation and reporting that the case studies describe the tools that provide each functionality.

    Critical issues with respect to the semi-automatic creation of abstract grammars from ontologies, as well as deriving ontologies from grammars, are still to be clarified. Concrete steps to handle these issues need to be specified in detail and a schedule should be included in the updated workplan (D1.1).

    As part of the prototype for D4.3 an automatically build from an ontology abstract and concrete English grammar have been integrated. They are used to verbalize the results from the semantic repository. Experiments and discussions, about using a similar approach for automatically buiding a query grammar from the semantic repository, were performed, but the provided from UGOT GF query grammar was selected as better tool because of its expressing power and the possibilities to generate better natural language. The query grammar has different types of question templates and it can be easily ported for new domain with minor modifications at the abstract and concrete grammars. The mapping rules that are used for connection between the abstract grammar and SPARQL are selected as the best semi-automated aproach for connection between the grammar and SPARQL. The mapping rules provide possibilities to make an general rules for transformation, but also to make a fine tune for a specific cases. The rules that are currently used are general enough to be used at new domains with a ported GF query grammar and this will be demonstrated at WP7 and WP8 prototypes.

    Current description of work in WP6 lacks details on the prototype multilingual dialog system to be developed. Including an example dialog and specifications of this prototype in a new version of deliverable D9.1 is recommended.

    WP7 work should focus on the major issues examined in MOLTO, especially in relation to the grammar – ontology interoperability rather than chemical compound splitting. Specific scenarios are needed for the exploitation of MOLTO tools in this case study. It is recommended to include such scenarios in a new version of deliverable D9.1.

    Specific scenarios are needed for the exploitation of MOLTO tools in the case study on cultural heritage (WP8) which just started. It is recommended to include such scenarios in a new version of deliverable D9.1.

    Use cases are listed in http://www.molto-project.eu/workplan/usecases and they include two scenarios for WP8 and two for WP7. The specific use case scenarios for WP7 were described in: UC-71 and UC-72. Details about them were given in Section 2 of D.7.1.

    UC-71 focuses on grammar-ontology interoperability. User queries, written in CNL (controlled natural language) are used to query the information retrieval system.

    UC-72 focuses on high-quality machine translation of patent documents. It uses an SMT baseline system to translate a big dataset and fill up the retrieval databases. In order to study the impact of hybrid systems in translation quality, a smaller dataset will be translated using the hybrid system developed in WP5.

    The way the project’s web site is structured, although it contains the necessary content, affects its readability in some cases.

    We have added a direct navigation link to Sites and People, and a quick link to the public deliverables list. Publications can be tagged by workpackage or event, thus making the selection of publications by tag easier.

    The deliverables on the workplan (D1.1) and the dissemination plan (D10.1) should be regularly updated (at the beginning of 2nd and 3rd year).

    We have kept an updated list of deliverables with administrator's view at http://www.molto-project.eu/workplan/deliverables and quick links at http://www.molto-project.eu/view/biblio/deliverables. The dissemination plan is kept uptodate on the wiki page, http://www.molto-project.eu/wiki/living-deliverables/d101-dissemination-.... We now added a Section to summarize Exploitation plans.

    Taking into account the numerous endeavors undertaken in the translation domain, both research and commercial, the market segment addressed by MOLTO should be identified with maximum precision. The specific case studies should also be taken into account in this effort. It is suggested that careful planning is initiated as early as possible and not later than the next reporting period.

    The addition of the new partner BI will open extra markets for the tools of MOLTO. We have also started to look into usage of constrained natural languages in software localization, in social networks and in specific mathematical domains.

    4. Use of the resources

    Official tables on the usage of resources are available for yearly reporting in Forms C.

    Here we have a rough estimate of person's months given by each node. Note that the figures listed previously do not include management months, hence totals may differ.

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UGOT 9 (O. Caprotti, A. Ranta) 0 10 (R. Enache) 12 (J. Camilleri, A. Slaski, S. Virk)
    UPC 19.16 (J. Saludes, L. Màrquez, L. Padró, H. Rodríguez, D. Farwell) 22 (C. España, X. Carreras, M. Gonzalez) 0 6 (A. Ribó Mor)
    UHEL 2,5 (L. Carlson) 0 7 (S. Nyrkkö) 5 (L. Alanko), 5 (I. Listenmaa), 12 (J. Shen, C. Li)
    Ontotext 6 (B.Popov, S.Karagova) 0 0 36 (P.Mitankin, M.Nozchev, F.Alexiev, A.Ilchev, I.Kabakov, K.Krustev, M.Damova, M.Chechev, V.Zhikov, S.Enev)
    UZH 0 2 (K. Kaljurand) 0 0
    BI 0,25 (J.van Aart, J. van Grondelle) 0 0 0,25 (H. ter Horst)

    We found a typo in the table for WP7 in the new Annex I for MOLTO EEU. The person months must be the same as for the previous DoW (Version number: 3 Revision 1 (21/01/2011)): namely WP7 description, pag. 31, PMs: UGOT 12, UPC 15, and Ontotext 15 (and not Ontotext 0).

    Declaration by the scientific representative of the project coordinator

    I, as scientific representative of the coordinator of this project and in line with the obligations as stated in Article II.2.3 of the Grant Agreement declare that:

    1. The attached periodic report represents an accurate description of the work carried out in this project for this reporting period;

    2. The project (tick as appropriate):

    ☐ has fully achieved its objectives and technical goals for the period;
    x has achieved most of its objectives and technical goals for the period with relatively
    minor deviations.
    ☐ has failed to achieve critical objectives and/or is not at all on schedule.

    3. The public website, if applicable:

    x is up to date
    ☐ is not up to date

    4. To my best knowledge, the financial statements which are being submitted as part of this report are in line with the actual work carried out and are consistent with the report on the resources used for the project (section 3.4) and if applicable with the certificate on financial statement.

    5. All beneficiaries, in particular non-profit public bodies, secondary and higher education establishments, research organisations and SMEs, have declared to have verified their legal status. Any changes have been reported under section 3.2.3 (Project Management) in accordance with Article II.3.f of the Grant Agreement.


    Name of scientific representative of the Coordinator:

    Aarne Ranta
    ....................................................................

    Date: 26/4/2012

    </hr/>

    D1.6 Periodic Management Report T30


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D1.6 Periodic Management Report T30
    Security (distribution level): Confidential
    Contractual date of delivery: M30
    Actual date of delivery: 7 Nov. 2012
    Type: Report
    Status & version: Final
    Author(s): O. Caprotti et al.
    Task responsible: UGOT
    Other contributors: All


    Abstract

    Progress report for the fifth semester of the MOLTO project lifetime, 1 Mar 2012 - 31 Aug 2012.

    1. Publishable summary

    The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and will run until 31 May 2013 with the task to develop tools for translating texts between multiple languages in real time with high quality. MOLTO grounding technology is multilingual grammars based on semantic interlinguas and statistical machine translation to simplify production of multilingual documents without sacrificing the quality. The specific interlinguas are based on domain semantics and are equipped with reversible generation functions: namely translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework, which in MOLTO is furthermore complemented by the use of ontologies, as in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data. GF has been applied in several small-to-medium size domains, typically targeting up to ten languages but MOLTO will scale this up in terms of productivity and applicability.

    A part of the scale-up is to increase the size of domains and the number of languages. A more substantial part is to make the technology accessible to domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, the MOLTO tools will reduce the overall task to just extending a lexicon and writing a set of example sentences.

    MOLTO is committed to dealing with 15 languages, which includes 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. In addition, there is on-going work on at least Arabic, Farsi, Hebrew, Hindi/Urdu, Icelandic, Japanese, Latvian, Maltese, Portuguese, Swahili, Tswana, and Turkish.

    While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO will mainly target the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 10 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.

    MOLTO technology will be released as open-source libraries, accompanied by cloud services, to be used for developing plug and play components to translation platforms and web pages and thereby designed to fit into third-party workflows. The project will showcase its results in web-based flagship demos applied in three case studies: mathematical exercises in 15 languages, patent data in at least 3 languages, and museum object descriptions in 15 languages. The MOLTO Enlarged EU scenarios will apply MOLTO tools to a collaborative semantic wiki and to an interactive knowledge-based system used in a business enterprise environment.

    2. Core of the report

    This section describes the progress of each workpackage and discusses changes to the workplan, if necessary.

    2.1 Project objectives for the period

    The main objective of this 5th semester has been to consolidate the project tools and technologies towards the production of the final deliverables. In order to focus the developments to clear goals, the Consortium has agreed to identify 9 "MOLTO flagships" that highlight the achievements of the project and combine what has been produced across work packages:

    • an IDE for GF including the Cloud-IDE and the Eclipse-IDE
    • a professional translator tool with integrated terminology extension, predictive parsing, syntax editing, robust parsing
    • an ontology query system, based on a generic reusable grammar (carefully built manually) that can be instantiated with automatically derived domain-specific extensions; demoed with mathematics and museum artefacts
    • a hybrid patent translation system, scoring better results than its competitors, if possible
    • the Mathematical Grammar Library, MGL, supporting both scenarios of math documentation (Wiki) and interactive tutorial (Sage, word problems)
    • a museum visitors' system generating descriptions and providing a query system, in 15 languages
    • the multilingual semantic wiki integrating almost all aspects (even SMT? an open question)
    • a commercial product from Be Informed that is useful for their customers (to be specified)
    • a trasversal methodology of evaluations, for each of the above, showing that high quality can be reached with reasonable effort using MOLTO tools

    2.2 Work progress and achievements during M25-M30

    This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

    For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:

    • A summary of progress towards objectives and details for each task;
    • Highlights of clearly significant results;
    • A statement on the use of resources for Period 2, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex I (Description of Work).

    Moreover, if applicable:

    • The reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
    • The reasons for failing to achieve critical objectives and/or not being on schedule with remarks on the impact on other tasks as well as on available resources and planning;
    • Corrective actions.

    WP2 Grammar developer’s tools - M30

    Summary of progress

    GF Eclipse plugin http://www.grammaticalframework.org/eclipse/index.html has grown to Version 1.5.1 by June. It has been adopted by Be Informed and Ontotext. Camilleri and Ranta gave a GF crash course at Be Informed using Eclipse. There are moreover two publications: one in EAMT (a poster), one at FreeRBMT (full paper).

    The Resource Grammar Library has been enhanced by two languages as external contributions: Japanese and Latvian. The MOLTO Phrasebook has been extended to Latvian. Work on Chinese and Maltese is going on.

    With the release of D2.3, Grammar Tools and Best Practices, the work in this WP finished on 30 June. But there is further work planned, as dissemination and exploitation.

    A preview version of libpgf, a C-based reimplementation of the GF runtime, is available since July. When finished, it should make GF technology accessible to applications that cannot make use of the current Haskell- and Java-based runtimes either due to resource constraints or interoperability concerns. In particular, libpgf should be easier to access from non-JVM-based programming languages. Bindings for Python are available since September.

    Highlights

    GF Eclipse plugin http://www.grammaticalframework.org/eclipse/index.html

    D2.3 http://www.molto-project.eu/biblio/deliverable/grammar-tools-and-best-pr...

    Deviations from Annex I

    The delivery date of D2.3 was postponed from M24 to M27 to be able to profit from the initial experiences with the new partners' scenarios.

    WP3 Translator's tools - M30

    Summary of progress

    After drawing the specifications in D 3.1, we published the prototype as Deliverable 3.2.

    A first prototype of the translation editor, one built using the Google Web Toolkit, was tested for integration into the GlobalSight translation manager. This first prototype turned out to be too fragile and has now been replaced by simpler version, the Simple Translation Editor, found in http://cloud.grammaticalframework.org/translator.

    Work has progressed integrating the translation tools prototype with the TermFactory web based term ontology editor (TF) that is to be used by ontologists and terminologists to create and maintain MOLTO domain dependent vocabulary.

    The milestone MS8 (Translation tool complete) is achieved in parallel with WP5 (Statistical and robust t.) in the sense that the MOLTO GlobalSight translation project management system is set up and available for utilisation in other work packages, including the use cases and their evaluation in WP9. Also, MS8 can be considered achieved in the sense that TermFactory is integrated with OWLIM ontology repository, completing the ontology backend of the tool.

    Highlights

    During the reporting period, a login and access control component (GateService) has been added to TF. This component is integrated with and maintained through the GlobalSight user manager.

    The OWLIM ontology repository has been integrated with TF, using a common interface based on Jena assembler library. Besides OWLIM, Jena triple databases (TDB, SDB) and WebDAV ontology documents can be edited directly from TermFactory.

    Deviations from Annex I

    UHEL has unspent money, and has recruited an evaluation project manager to work with WP3 and WP9 until M36+3, beginning in November 2012 (start of M32).

    We are asking to move Deliverable D3.3 (Translation tools / workflow manual) to M33. The D3.3 manual is expected to include documentation of the workflow with help from other work packages and input from the new recruit, that have all been subject to a 3 months shift.

    WP4 Knowledge engineering - M30

    Summary of progress

    During this reporting period, a new version of D4.3 Grammar Ontology Interoperability was submitted elaborating on the automatization of the grammar creation for controlled language (CL) to RDF interoperability. A new way of building the transformations between CL and SPARQL, treating SARQL as another concrete grammar along with the language specific ones was suggested.

    We have designed and implemented the Query Grammar Helper Builder tool to connect a SPARQL end point and support an inexperienced user to create GF grammars - Abstract, English and SPARQL. As a part of the work in WP2 we have developed an Eclipse plugin wrapper of this tool.

    We have also started designing the needed improvements to the interoperability approach in order to productize it as an outcome of the project and in alignment with the exploitation objectives.

    We have also introduced several new members to the Ontotext team - Laura Tolosi, Maria Mateva, Ilia Trendafilov and Georgi Georgiev.

    Highlights

    • improvements on Deliverable 4.3 and opening potential for better CL to RDF interoperability
    • improvement of the autocomplete suggestion in the KRI-prototype
    • design, implementation, testing, bug fixing of the Query Grammar Builder Tool
    • MOLTO-KRI prototype(http://molto.ontotext.com) – improvements on the site

    Deviations from Annex I

    None

    WP5 Statistical and robust translation - M30

    Summary of progress

    The milestone MS8 has been achieved in M30 (Translation tool complete), which for WP5 meant to have a complete system integrating the grammar and STM. Although the system is already available on Gothenburg's server we are still working on improvements.

    The work done during the fifth period has been focused on 4 of the 6 tasks of the workpackage:

    5.3 Robust Parsing First efforts to include the robust parsing work done in the previous semesters into the hybrid systems are being done. The work is in progress and the final idea is to be able to use GF's robust parsing to deal with the chunks instead of relying on Genia.

    5.4 Baseline systems Refinements on the French GF grammar have been done in order to improve the performance. The German grammar has been done from scratch and it is now comparable to the French one.

    5.5 Hybrid Models The new grammars have been integrated in the final hybrid system. Different versions of the previous hybrids are now available. In particular, a new system considers different probabilities for the GF translations according to the confidence in obtaining them. This information can be also used in the development step of the statistical system. A one-click system has been developed with the most promising hybrid system. This system will be updated with new hybrids whenever we obtain a better translation performance.

    5.6 Systems evaluation A wider evaluation of the baseline systems has been done by including syntactic and semantic metrics into the evaluation. Also, the comparison with external translation systems such as Google and Bing has been redone in order to reflect the improvements of these systems during the last year. A comparison with Pluto is also done. However, we realized that since we share some data there is the possibility that our test sets are in their training data. We plan on using confidence estimation measures in order to be able to test on different patents for which none of us have translations such as American patents.

    Highlights

    • A GF grammar for patents has been developed for German and improved for French

    • A hybrid system with the new grammars have been evaluated and a new one which takes into account probabilities for the GF translations has been built

    • The work on robust parsing has resulted in two submissions to the Coling 2012 conference

    • A one-click system for the hybrid translator has been build and is now available as a shell command on the server in Gothenburg. Partners wishing to test the system should contact UGOT to obtain access to the server.

    Deviations from Annex I

    There are no deviations from Annex I and at M30 the workpackage has produced a hybrid system for patent translation for English-to-French and English-to-German. For the opposite directions we use SMT as fallback.

    However, we plan to continue the work on hybrid systems by improving the current German translator and the integration with robust parsing. With this, D5.3 will be postponed till January, when new hybrid systems will be also finished and prepared to be evaluated within WP9.

    WP6 Case study: mathematics - M30

    Summary of progress

    • Created a prolog-based reasoner to deal with elementary problems in arithmetic, implementing partition by subclasses and decomposition.

    • Created GF grammars to express some word problems in English and Prolog

    • Started integration with UZH AceWiki/gfservice.

    Highlights

    • An AceWiki/gfservice fork that allows entering the problem sentence by sentence in natural language and then automatically computes the associated mathematical model.

    Deviations from Annex I

    We are running out of time to fully develop the prototype. The promised system would work in two modes: Author mode for entering a problem and Student mode for attempting to solve it. The first mode is more or less working but the last will take a lot of time. The scheduled Deliverable D6.3, Assistant for solving word problems, due by 1 September 2012 has been postponed to 1 December 2012. Some resources of UPC originally planned in WP8 have been moved to this workpackage in order to accomplish the promised tasks.

    WP7 Case study: patents - M30

    Summary of progress

    Due the incremental development of the prototype, most of the tasks have span till M30, when the final prototype is being completed.
    The next lines describe the progress of the following tasks:

    • 7.2 Patent Corpora
    • 7.3 Grammars for the patent domain
    • 7.4 Ontologies and Document Indexation
    • 7.5 Patents Retrieval System
    • 7.6 Machine Translation Systems
    • 7.7 User Interface

    In relation to Task 7.2, the patents downloaded from the EPO website have been automatically translated and semantically annotated. The complete collection of files is available in the MOLTO repository, and it consists of 1) the original patent documents, 2) the English version of the patent documents having the semantic annotations, and 3) the automatic translations of claims, abstracts and descriptions. These documents constitute the main content of the retrieval databases.

    As for Task 7.4 and Task 7.5, the ontologies, indexes and databases have been updated with the new dataset of documents.

    Regarding Task 7.6, we designed process for patents translation that allows for building a translated document having the same XML structure as the original patent. As a result, the interface of the prototype can show the translated patents using the same user-friendly view as for the original ones. The translation of the documents consists of a pipeline involving the following 5 steps: First, the patent files are preprocessed in order to extract the text contained into the sections in a structured manner (step 1). Then, the formatting marks inline with the text are replaced by placeholders (step 2). And then, the resulting text is segmented and tokenized as required by the translation system (step 3). Soon after, the raw text is translated using the SMT system (step 4). The translated text is post- processed in order to recover the original structure of the document (step 5), including original formatting, claims enumeration and images.

    Regarding Task 7.3, the query grammars have been refactored using the set of primitives defined in the Query Library work conducted in WP4. In consequence, the English and French version of the patents query grammar were adapted to the new structure, and the German version has been developed from scratch. The new grammar is equivalent to the old one. The difference is the fact that it relies on the primitive query building functions defined in the Query Library. Developing a grammar using the Query Library requires less linguistic knowledge, but just selecting the right set of primitives that would be right for the task. In comparison to the previous patent query grammar, now it has fewer constructions, because of the fact that it is developed on top of the Query Library. As a consequence, the constructions are also more natural and the number of malformed constructions have decreased considerably. The current grammar consists of 31 patterns and it is able to parse/generate 359 query constructions in English, 111 in French and 147 in German.

    Finally, regarding Task 7.7, the interface has been updated with the German version of the query grammars. Also, some basic tests have been carried out at two levels in order to assess the prototype functionalities. First, some deficiencies have been corrected regarding the usability of the interface, i.e., examples of the main page, the language selection and the visualization of the results in French and German. In addition, we studied the inherent logic of the queries and the expected results, so that the system returns results that can now be considered more appropriate or accurate.

    The Deliverable 7.2 gives a detailed description of the modules and their functionalities.

    Highlights

    • A fully functional version of the final prototype is available by M30.
    • The demo allows for querying the system in English, French and German.
    • The patents in the database has original text in English, French and German and the translated documents.
    • A fully completed pipeline for patent document translation.
    • The new query library and its application to the patents use case has been presented at the Third Workshop on Controlled Natural Language (CNL 2012), being held in Zurich at the end of August. http://attempto.ifi.uzh.ch/site/cnl2012/

    Deviations from Annex I

    In general lines, we are achieving the objectives related to WP7. However, the Deliverable 7.2, planned for M27, has been delayed to M30 due several issues related to the gathering of the corpora, the pro/post process of the documents and the integration of the new query library. Also, we carried several basic tests in order to assess the behavior of the prototype in terms of query results and user interaction, which reported several deficiencies that have been corrected. Since D72 has been postponed, D73 is delayed accordingly from M33 to M36.

    WP8 Case study: cultural heritage - M30

    Summary of progress

    The data collection (D8.1) and first prototype of grammars (D8.2) were delivered on time. The grammar prototype has six languages, but is being extended to 15. It implements the generation of describing texts from facts in the database. The final system (delivered as a part of D8.3) will also allow natural language queries about museum objects, applying the technology developed in WP4.

    In addition to Gothenburg City Museum, there has been interest to this WP from the Europeana project http://www.europeana.eu/portal/ A plan for later dissemination of the work includes the generalization of the results by making them available for Europeana. Also the Monnet project http://www.monnet-project.eu/ has a common interest in this WP in the area of ontology localization and verbalization.

    The work in this workpackage in particular addresses multilingual text planning, and was exploited in D8.2 and has resulted in two publications in 2012; see WP10.

    Highlights

    • Delivery of D8.2, Multilingual Grammar for Museum Object Descriptions

    Deviations from Annex I

    The schedule for D8.3 is postponed from M30 to M36 due to a delay in the PhD defense of Dana Dannélls, one of the key persons of this WP; as a bonus, MOLTO can then fully profit from her thesis work funded from other sources.

    WP9 User requirements and evaluation - M30

    Summary of progress

    To improve the communication between work packages, we have set up a bug tracking system in http://tfs.cc/trac, and assigned the flagship leaders to be in charge of their component. Everyone using the tools is encouraged to leave their comments and requests in trac.

    Maarit Koponen's work in evaluating semantical aspects of machine translation quality is progressing. We have started recruiting people for translation quality evaluation, and got response from the University of Pisa, with Italian-English and Italian-German as possible language pairs.

    Targeting towards D9.2, we have started gathering evaluations from the grammar writers. Individual use cases can be measured in terms of translation quality, but good grammar design principles will make any grammar easier to write and maintain. We evaluate the grammars in terms of D2.3, a best practices document.

    Highlights

    Maarit Koponen's article Comparing human perceptions of post-editing effort with post-editing operations was accepted in Seventh Workshop on Statistical Machine Translation (Montreal) and published in the proceedings.

    Deviations from Annex I

    No major deviations reported. Minor actions include the addition of grammar quality evaluation.

    WP10 Dissemination and exploitation - M30

    Summary of progress

    New on the website, the publishing of news items from RSS feeds of MOLTO Consortium partners and from the GF source code repository, in the footer, and news items from MOLTO in the header, alongside the publications and the demos. A new collective demo of the GF application grammars together with the novel GF cloud services is prominently featured on the website.

    We are also collecting the Use of resources in a overall table (http://www.molto-project.eu/workplan/resources) that summarizes the data provided by the partners. Personal views (e.g. http://www.molto-project.eu/workplan/resources/olga.caprotti) and workpackage views will be available soon.

    Two major events have been organized with the sponsorship of MOLTO: FreeRBMT 2012, and CNL 2012. Free Rule-Based Machine Translation, FreeRBMT 2012, took place in Gothenburg on 13-15 June, 2012 and was organized by UGOT (see http://www.chalmers.se/hosted/freerbmt12-en). A tutorial on the Apertium system followed as additional satellite event, it was attended by MOLTO partners from UPC and UGOT and resulted in the adoption of some of the Apertium lexicons in GF. Papers by J. Camilleri and by Cristina España-Bonet et al. presenting MOLTO results will appear in the online proceedings. Additionally the program included a series of presentations on MOLTO's current work in GF resources and tools for machine translation (see http://www.molto-project.eu/freerbmt-program.html). Many of the MOLTO talks have been streamed live from the moltoproject YouTube channel, http://www.youtube.com/moltoproject, where they can still be watched.

    The Third Workshop on Controlled Natural Language, CNL 2012, took place on 29–31 August 2012 in Zurich, Switzerland. UZH has been organizing this meeting over the past years, and this time as a MOLTO activity. A few papers were presented by the MOLTO Consortium, listed below, but we also note a contribution by external researchers, Normunds Grūzītis, Pēteris Paikens and Guntis Bārzdiņš, FrameNet Resource Grammar Library for GF, using the MOLTO Phrasebook as case study in their work.

    On 14 August, Aarne Ranta visited Lingsoft Inc in Helsinki. Lingsoft is "a full-service language management company", producing for instance the proofing tools for the Nordic languages and German in Microsoft Office products. Lingsoft is one of the most successful language technology companies, founded in 1986 and working with numerous partners and products. Recent products range from spell checking to language education tools, speech recognition, and translation. He was invited by the CEO Juhani Reiman and the Senior Advisor Simo Vihjanen to give a presentation of MOLTO's tools and discuss possible collaborations. MOLTO and Lingsoft share the belief in precise linguistic knowledge as a key to successful language processing. Lingsoft has now set up a group to explore the possibilities offered by MOLTO and GF. The focus is on machine-assisted translation for specific domains.

    Highlights

    YouTube videos of MOLTO related talks, http://www.youtube.com/moltoproject

    Toward multilingual mechanized mathematics assistants, Saludes, Jordi, and Xambó Sebastian, EACA 2012 (Proceedings), 06/2012, p.163–166, (2012)

    The Patents Retrieval Prototype in the MOLTO project, Chechev, Milen, Gonzàlez Meritxell, Màrquez Lluís, and España-Bonet Cristina, WWW2012 Conference, Lyon, France, (2012)

    Multilingual Online Generation from Semantic Web Ontologies, Dannélls, Dana, Enache Ramona, Damova Mariana, and Chechev Milen, WWW2012, 04/2012, Lyon, France, (2012)

    MOLTO Enlarged EU - Multilingual Online Translation, Caprotti, Olga, and Ranta Aarne, 16th Annual Conference of the European Association for Machine Translation, 05/2012, Trento, Italy, (2012)

    The GF Mathematical Grammar Library, Caprotti, Olga, and Saludes Jordi, Conference on Intelligent Computer Mathematics /OpenMath Workshop, 07/2012, (2012)

    Multilingual Verbalisation of Modular Ontologies using GF and lemon, Davis, Brian, Enache Ramona, van Grondelle Jeroen, and Pretorius Laurette, Third Workshop on Controlled Natural Language (CNL 2012), Volume 7427 LNCS, (2012)

    General Architecture of a Controlled Natural Language Based Multilingual Semantic Wiki, Kaljurand, Kaarel, Third Workshop on Controlled Natural Language (CNL 2012), 09/2012, Volume 7427 LNCS, p.110--120, (2012)

    Probabilistic Robust Parsing with Parallel Multiple Context-Free Grammars, Angelov, Krasimir A., COLING 2012, (Submitted)

    How Much do Grammars Leak?, Angelov, Krasimir A., COLING 2012, (Submitted)

    The GF Eclipse Plugin: An IDE for grammar development in GF, Camilleri, John, and Angelov Krasimir, 16th Annual Conference of the European Association for Machine Translation, 05/2012, Trento, Italy, (2012)

    An IDE for the Grammatical Framework, Camilleri, John, Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012), 06/2012, (2012)

    Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization, Cristina España-Bonet, Gorka Labaka, Arantza Diaz De Ilarraza, Lluis Marquez and Kepa Sarasola, Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012), 06/2012, (2012)

    Comparing human perceptions of post-editing effort with post-editing operations, Koponen, Maarit, Proceedings of the Seventh Workshop on Statistical Machine Translation, June, Montréal, Canada, p.181–190, (2012)

    Future activities:

    Deviations from Annex I

    WP11 Multilingual semantic wiki - M9

    Summary of progress

    Three lines of work were followed: developing a multilingual ACE grammar (ACE-in-GF), extending the AceWiki system based on the GF technology (currently referred to as AceWiki-GF) and extending the Attempto reasoner RACE.

    In collaboration with UGOT (John J. Camilleri) a GF-based multilingual grammar for ACE was developed. This grammar has the following properties:

    • it covers the AceWiki subset of ACE, but is easily extendable towards full ACE
    • it is available in 10 languages (in addition to ACE), but is easily extendable to other languages given that they are available in the GF resource grammar library
    • its performance is similar to existing ACE parsers
    • it is directly usable in AceWiki-GF

    This resource is fully presented in Deliverable D11.1.

    AceWiki-GF was further developed by adding preliminary support for multiple grammars, multiple articles, ambiguity management, and grammar editing. A large number of AceWiki-GF demo wikis have been made publicly readable/editable on the Attempto website. Most of these wikis are based on grammars developed in MOLTO and in previous GF-related projects. Some of the simpler grammars can also be edited. The current work on AceWiki-GF and its underlying ideas were published at CNL 2012 and presented as both a talk and a demo.

    The Attempto reasoner RACE is currently extended to handle arithmetic, linear equations and text problems. This work – not being part of the actual MOLTO tasks – aims at providing AceWiki-GF with an alternative reasoning capability that covers the complete first-order subset of ACE. The current, still incomplete, version of RACE was demonstrated at CNL 2012.

    Organization of meetings and conferences:

    • 7-9 March the 4th MOLTO project meeting was organized by Norbert E. Fuchs at the University of Zurich
    • 29-31 August the 3rd Workshop of Controlled Natural Language (CNL 2012) was organized by Tobias Kuhn and Norbert E. Fuchs at the University of Zurich

    Highlights

    • Multilingual GF-based ACE grammar (ACE-in-GF)
    • A demonstration version of AceWiki-GF is publicly available
    • RACE extensions
    • CNL 2012

    Links to online resources

    Deviations from Annex I

    A small deviation is expected for the deliverables of this workpackage to allow time for Tobias Kuhn's contributions, who is a leading developer of AceWiki currently on a researcher's visit abroad. The revised schedule is as follows:

    • D11.2 planned for M33 (end November 2012) is postponed to M34 (end December 2012)
    • D11.3 planned for M37 (end March 2013) is postponed to M38 (end April 2013).

    WP12 Interactive knowledge-based systems - M9

    Summary of progress

    During this period two major topics were addressed from the adoption phase as described in the DoW.

    A GF bootcamp was held at Be Informed at June, 4-6 in cooperation with UGOT. During this bootcamp the Be Informed team first received an in-depth introduction to grammar building using the Grammatical Framework. Building on that knowledge, several workshop sessions were held to discuss the theory and practice of (semi-)automatically converting Be Informed business modeling "language" (Be Informed meta models) and "speech" (Be Informed models) to Grammatical Framework constructs. Furthermore, discussions on a number of technical issues concerning the integration of Grammatical Framework technology into the Be Informed Business Process Platform.

    Also in this period, we tried to capture requirements from a large number of perspectives. Some requirements apply to the verbalization component to be developed in WP12, but many also apply to the functionality that can be based on this component. Requirements were derived from business usage scenario's:

    • Review, Validation and Feedback of Models
    • Text based Editing of Models
    • Self documenting Models
    • Textual User Interfaces for Model Driven Applications
    • Communicating Model Based Decisions Also on the non-functional part of a commercial implementation of GF in the Be Informed Business Platform requirements like portability, modularity and graceful degradation performance models.

    A full overview of these requirements are presented in Deliverable D12.1.

    Deviations from Annex I

    none

    2.4. Deliverables and milestones tables

    Number Title Attached files
    D 1.5 Periodic Management Report T24 D1.5.pdf
    D 2.3 Grammar Tools and Best Practices MOLTO_D2.3.pdf
    D 3.2 D3.2 The Translation Tools Prototype D3.2.pdf
    D 5.2 Description and evaluation of the combination prototypes D.5.2.pdf
    D 6.2 Prototype of comanding CAS D6.2.pdf
    D 7.2 Patent MT and Retrieval Prototype D72.pdf
    D 8.2 Multilingual grammar for museum object descriptions d8.2-grammars.tar.gz, D8.2.pdf
    D9.1 A D9.1A Appendix to MOLTO test criteria, methods and schedule D9.1A_2012-Apr-5.pdf
    D 11.1 ACE Grammar Library d11_1.pdf
    D 12.1 Requirements for GF based Verbalization in Be Informed D12.1.pdf
    D X.2 Annual Public Report M24 DX.2.pdf

    The only milestone due in this period is that of WP5 and WP3, Translation tool complete, which has been met by its due date, 1 September 2012. The next milestone MS9, Case studies complete, involves the work-packages on mathematics, patents retrieval, and cultural heritage. We are delaying the work on cultural heritage and therefore we will have to shift part of this milestone too.

    3. Project management during M25-M30

    Project management during the period consisted mainly in maintaining the routine communication with the partners, by holding a monthly skype call, and in distributing the second installment of the funding.

    UGOT received the 2nd interim payment from the EU and it distributed it to the partners on 15 August, 2012. Each partner received also the financial assessment from the EU and an overview of the payment that has been sent. The Consortium has now received 85% of the MOLTO total budget which is the maximum amount possible before the approval of the final reporting.

    Followup actions after the annual review included discussions within the Consortium on how to organize a better showcase for the final results of the project and in addressing the reviewers' remarks and suggestions (see Task 1.8). Updated versions of some deliverables were produced and made available on the website.

    In terms of infrastructure, the svn repository is currently being used by a larger number of members of the Consortium and in addition there is a new bug-tracking system installed and running at UHEL.

    4. Use of the resources

    Tables on the usage of resources are not available for midterm reporting, however we have a rough initial estimate of persons' months by almost all nodes. Ontotext has not been able to provide the data.

    D1.7 Final Management Report T39


    Contract No.: FP7-ICT-247914 and FP7-ICT-7-288317
    Project full title: MOLTO - EEU - Multilingual Online Translation
    Deliverable: D1.7 Final Management Report
    Security (distribution level): Confidential
    Contractual date of delivery: M39
    Actual date of delivery: Version 2: 25 October 2013. Version 1: 25 July 2013
    Type: Report
    Status & version: Version 2
    Author(s): O. Caprotti, A. Ranta et al.
    Task responsible: UGOT
    Other contributors: All


    Abstract

    Progress report for Period 3 of the MOLTO project lifetime, 1 Mar 2012 - 31 May 2013.

    1. Publishable summary

    The project MOLTO - Multilingual Online Translation, started on March 1, 2010 and ran until 31 May 2013. Its goal was to develop tools for translating texts between multiple languages in real time with high quality. MOLTO's grounding technology is multilingual grammars based on semantic interlinguas and grammar-based translation. It also explores ways to use statistical machine translation without sacrificing quality.

    MOLTO uses specific interlinguas that are based on domain semantics and are equipped with reversible generation functions. Thus translation is obtained as a composition of parsing the source language and generating the target language. An implementation of this technology is provided by GF, Grammatical Framework, which in MOLTO is furthermore complemented by the use of ontologies, as in the semantic web, and by methods of statistical machine translation (SMT) for improving robustness and extracting grammars from data. GF has been applied in several small-to-medium size domains, typically targeting several parallel languages. During its lifetime, MOLTO has scaled up this technology in terms of productivity, domain size, and the number of languages.

    The size of domains has been increased to involve up to thousands of concepts. and the number of languages to twenty parallel ones. A special focus has been to make the technology accessible to domain experts without GF expertise and minimize the effort needed for building a translator. Ideally, the MOLTO tools will reduce the overall task to just extending a lexicon and writing a set of example sentences.

    MOLTO was initially committed to dealing with 15 languages, which included 12 official languages of the European Union - Bulgarian, Danish, Dutch, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, and Swedish - and 3 other languages - Catalan, Norwegian, and Russian. The additional languages also addressed in MOLTO are Chinese, Hebrew, Hindi, Latvian, Persian, and Urdu.

    While tools like Systran (Babelfish) and Google Translate are designed for consumers of information, MOLTO's main target is the producers of information. Hence, the quality of the MOLTO translations must be good enough for, say, an e-commerce site to use in translating their web pages automatically without the fear that the message will change. Third-party translation tools, possibly integrated in the browsers, let potential customers discover, in their preferred language, whether, for instance, an e-commerce page written in French offers something of interest. Customers understand that these translations are approximate and will filter out imprecision. If, for instance, the system has translated a price of 100 Euros to 100 Swedish Crowns (which equals 11 Euros), they will not insist to buy the product for that price. But if a company had placed such a translation on its website, then it might be committed to it. There is a well-known trade-off in machine translation: one cannot at the same time reach full coverage and full precision. In this trade-off, Systran and Google have opted for coverage whereas MOLTO opts for precision in domains with a well-understood language.

    MOLTO technology is continuously released as open-source software and linguistic modules, accompanied by cloud services, to be used for developing plug and play components to translation platforms and web pages and thereby designed to fit into third-party workflows. The project showcases its results in web-based flagship demos applied in three case studies: mathematical exercises in 15 languages, patent translations and queries in 3 languages, and museum object descriptions and queries in 15 languages. The MOLTO Enlarged EU scenarios add to this an application of MOLTO tools to a collaborative semantic wiki and to an interactive knowledge-based system used in a business enterprise environment.

    2. Core of the report

    This section describes the progress of each workpackage and discusses changes to the workplan, if necessary.

    2.1 Project objectives for the period

    The last period of the MOLTO project and of its enlargement MOLTO-EEU has been a very intensive period of work for the Consortium. The major deliverables have been delayed to this period and had to be completed. They included:

    • GF in the cloud, a sample collection of web services for GF and MOLTO grammars
    • a revision of the GF-OWL interoperability
    • a unified grammar for handling natural language queries, customized for the case studies
    • a third party translators' platform in which the MOLTO tools were integrated
    • the overall evaluation of the MOLTO work packages

    To demonstrate the usage of the MOLTO tools and technologies, the partners worked towards joint prototypes for the various case studies listed in the workplan. The coordination work involved agreement on platforms, on formats, and on the overall architecture of each demonstrator.

    The final case studies include: - a proof-of-concept dialog system and reasoner for word problems (WP6) - patent translation by the robust hybrid approach and multilingual query interface (WP3, WP4, WP5, WP7) - museum artifacts multilingual query and descriptions (WP4, WP8) - multilingual semantic wiki AceWiki (WP2, WP11) - multilingual business modelling by GF (WP2, WP3, WP12).

    2.2 Work progress and achievements during M25-M39

    This section provides a concise overview of the progress of the work in line with the structure of Annex I to the Grant Agreement.

    For each work-package, except project management, which is reported in Section 2.3, the work-package leader provides the following information:

    • A summary of progress towards objectives and details for each task;
    • Highlights of clearly significant results;
    • A statement on the use of resources for Period 2, in particular highlighting and explaining deviations between actual and planned person-months per work package and per beneficiary in Annex I (Description of Work).

    Moreover, if applicable:

    • The reasons for deviations from Annex I and their impact on other tasks as well as on available resources and planning;
    • The reasons for failing to achieve critical objectives and/or not being on schedule with remarks on the impact on other tasks as well as on available resources and planning;
    • Corrective actions.

    WP3 Translator's tools - M39

    Summary of progress

    We have developed translator's tools further: lexicon extraction as an essential core technology, integration of GF translation to the Pootle platform as a concrete example. Preparing for the final review, UHEL has had the flagship of lexicon extraction and prepared a presentation. UGOT has contributed to the lexicon work with Shafqat Virk's PhD research about resources for Indo-Aryan languages. We have also collaborated with Ontotext in D4.3A to include contrast and comparison of TermFactory and KRI, as suggested by the reviewers.

    Highlights

    • Deliverable 3.3 which describes the GF integration to Pootle
    • Lexicon conversion to GF format added to TermFactory
    • ontoR (ontology based term alignment) tool providing lexicon output to TermFactory format

    Deviations from Annex I

    A part of the work on the web-based translation tool originally scheduled for UHEL was carried out by UGOT. This means a shift of workload from UHEL to UGOT of 3 person months.

    WP4 Knowledge engineering - M39

    Summary of progress

    In the final period of MOLTO, we have finalized the model for SPARQL generation and RDF facts verbalization. The D4.3A annex deliverable was published as a follow-up to the reviewers' recommendations and to summarize the progress of work in the field of grammar-ontology interoperability in the descendants of the KRI prototype.

    Highlights

    • We deliver D4.3A annex deliverable to D4.3, that addresses reviewers' remarks and recommendations from M24 and also describes the subsequent work we have done after M24
    • Development of the YAQL core module, that was propagated afterwards in WP7, WP8, and WP12. It allowed exploring SPARQL as yet another natural language and hence the translation from NL to SPARQL can be fascilitated directly by GF
    • Research on verbalization of RDF facts
    • A conceptual model for automatic verbalization of RDF facts through GF means, and its implementation for WP4 prototype(single facts) and WP8 prototype(painting descriptions)
    • Addition of answer generation grammars for the molto-kri prototype - (http://molto.ontotext.com/); Swedish and Finnish grammars were added to demonstrate the multilinguality power of GF.

    Deviations from Annex I

    WP5 Statistical and robust translation - M39

    Summary of progress

    The main goal of this WP has been to develop a hybrid system between GF and SMT specialised in patent translation and has implied the construction of new resources on the domain and conceiving techniques to integrate both technologies. The WP has also tasks devoted to widen GF and have been focused on building general purpose lexicons. Besides, a more robust GF has been achieved by the use of the robust statistical parser that will allow to translate free text or, at least, the parts covered by the grammars without being affected by unknown elements.

    Regarding the development of different types of general lexicons it has been used GF’s core idea of common abstract syntax and multiple concrete syntaxes to produce multilingual morphological lexicons. The abstract syntax is based on data from the Princeton WordNet and the Oxford Advanced Learner’s dictionary. The concrete syntaxes are produced using data from already existing lexical resources (i.e. Bilingual dictionaries and Universal WordNet), and GF’s morphological smart paradigms. Because words can have multiple senses, and it is often very hard to find one-to-one word mappings between languages, two different types of multi-lingual lexicons have been developed: Uni-Sense and Multi-Sense. In a uni-sense lexicon each source word is restricted to represent one particular sense of the word, and hence it becomes easier to map it to one particular word in the target language. These type of lexicons are useful for building domain specific NLP applications. A multi-sense lexicon, on the other hand, is a more comprehensive lexicon and contains multiple senses of words and their translations to other languages. This type of lexicons can be used for open-domain tasks such as arbitrary text translation. These lexicons cover a number of language including English, German, Finnish, Bulgarian, Hindi, Urdu and their size ranges from 10 to 50 thousand lemmas.

    In WP5 we also experimented with open-domain robust translation based solely on GF. This is a huge step since the traditional application domain of GF is in controlled languages where the domain is small and well defined, while in the task of translating running text the source language is not clearly defined anymore. As a simple numerical measure for the leap, we can say that the typical GF applications deal with grammars containing hundreds of lemmas while in this experiment our grammars contain more than 50,000 lemmas. We developed an entirely new runtime system for GF in C which has the advantage to be more portable and more efficient. The efficiency was the first requirements that we had to satisfy since otherwise interpreting these huge grammars would be intractable. Furthermore, we turned the original non-probabilistic algorithms for parsing and reasoning into probabilistic ones. The introduction of probabilistic models is crucial for the disambiguation of the grammars which are by necessity highly ambiguous. The third major contribution to the project is that we also made the GF parser robust, i.e. when faced with sentences which are not parseable, it returns a sequence of recognized chunks rather than an error. We evaluated our implementation with state-of-the-art statistical parsers for related grammatical formalisms, and we found that for sentences longer than 25 tokens, our implementation is at least two orders of magnitude faster. We also tried to use our new architecture in machine translation but here the results are not conclusive yet. We found that the two main limitations are in the quality of the translation dictionaries which we built and the still limited coverage of the grammars. Furthermore, we need to better address the word sense disambiguation and the proper translation for multiword expressions.

    The translation of patents using this robust parsing is still in an embryonic state, but we have developed a complete translation system that combines GF and SMT to overcome the input controlled language assumption. This hybrid system implies the construction of in-domain dictionaries and grammars that make use of probabilistic components, and the integration with an SMT engine that is able to complement GF translations. Regarding these resources for patent translation, we emphasise the generation of static lexicons obtained from SMT translation tables, and the on-line generation of lexicons with unseen vocabulary but available in the monolingual dictionaries. For German, also a dictionary of compounds has been built. A grammar for dealing with patents in English, French and German has been built on top of the resource grammar with several additions devoted to deal with chunks instead of sentences. Particular constructions appearing in patents are also covered by this new in-domain grammar. As a demand of the selected domain, we have also developed a detector and tokeniser of chemical compounds. A full translation system uses this tokeniser and prepares the patent to be translated. This involves chunking and parsing the source sentences which are first translated by GF and afterwards sent to an SMT decoder which is fed with this information. An SMT engine trained on the domain is also used by the top decoder. The final hybrid system is available for download and has several options that take into account which method to build the lexicon has to be used and which kind of integration is to be applied.

    Highlights

    The novelties since the last report correspond on the one hand to the improvement of the previous hybrid MT systems, its portage to German, and the development of new hybrid systems. On the other hand, we highlight the generation of lexical resources from WordNet, Apertium dictionaries, and SMT translation tables and the development of a statistical robust parser which results two order of magnitude faster than comparable state-of-the-art probabilistic parsers. The last points allow to extend the coverage of GF and are useful for a general translation or a translation in any domain. The first one, on the contrary, starts from the translation on a concrete domain and tries to extend the coverage outside the coverage of the grammar.

    Deviations from Annex I

    Although some specific tasks have been evolving through the life of the project the three main lines have been accomplished:

    i) GF grammar for the patents domain

    ii) SMT system for patents

    iii) Combination GF-SMT translators

    By the evolution of the tasks through the project we mean for example that more time than the estimated has been devoted to improve the GF patents grammar and to work on the soft integration hybrid system that depends on it. The hybrid system that depends on using GF tree fragment pairs is in a less mature state. The dependence on the performance of the robust parser showed to be crucial and most efforts have been devoted into this direction.

    WP6 Case study: mathematics - M39

    Summary of progress

    in the first part of the project we developed a GF Mathematical Grammar Library (mgl) based on several OpenMath content dictionaries. This encompasses the OpenMath layer of the mgl. For the next part we developed, on top of it, the module Commands that allows the use of human language at commanding a Computer Algebra System (cas) into computing the objects described in the OpenMath layer, and getting the answers delivered in natural language too.

    For the final part we undertook creating a prototype for assisting students into modeling and solving word problems: The statements of these problems relates to notions of ordinary life and the goal of proposing these to the students is for they to learn how to describe mathematically the relevant relations in the statement into equations (modeling) and then, how to solve these to get the numeric solutions interpreted in terms of the original statement (solving).

    The kind of reasoning needed in this the description logic used by WP4 (OWL reasoners) was found wanting in its arithmetical capabilities. We needed a dialog system more than a query/answer system. This moved into creating a new reasoner based on Prolog, along the lines of WP11, able to cope with basic arithmetic settings. That means, being able to automatically decide whether a problem statement is free from contradictions and whether it contains enough information to deduce the solution. On the other hand, since we want the system to guide the student into the proper equations, we need to account for the state of the modeling process, storing new facts discovered by the student and automatically providing next-step hints to him/her. All this took much time that originally planned and forced us to concentrate in the novel challenge (modeling) and keep aside the solving part.

    Highlights

    • We developed a tool that runs on a Scala shell for constructing simple word problems, sentence by sentence, using one of the four languages supported: Catalan, English, Spanish and Swedish. It checks that the sentences written so far are consistent and complete to make a problem and saves it as Prolog code with comments in GF.

    • We developed an assitant that runs on a text terminal and engages the student in a dialog in one of the aforementioned languages. This dialog starts with the statement of the problem and the proceeds by providing hints on how to do next or answering questions about the information that has been discovered. The process ends when the student provides an equation that captures the relevant information to solve the problem. Then, the system delivers the solution in natural language.

    Deviations from Annex I

    We could not use the grammars of WP4 as stated since the reasonong and language are different. In the Query Technologies worpackage, questions are about objects in classes having properties, while in our case the questions are about cardinals of sets of objects. On the other hand, we departed form the query/answer form and went into a dialog. All this required new grammars.

    Time constraints, as mentioned above, forced us to leave the integration of the solving step into the prototype. Apart from this, a vital component that mediated the communication between the GF side and the cas side (Sage simple server) was deemed obsolete by the Sage community, so it was no advisable to pursue it further until a clear standard for communicating with Sage arises. At the moment of writing this document such a candidate seems to dominate (sagecell) but still is not distributed among the standard packages of Sage (and fails to install in some platforms for the last version of Sage (5.9)).

    WP7 Case study: patents - M39

    Summary of progress

    The aim of "WP7:Patents Case Study" was to create a prototype for automatic translation and multilingual retrieval of patents. The online prototype is publicly available at: http://molto-patents.ontotext.com/.

    This patents case study has set up the grounds where to put together several technologies in order to come up with a useful platform for multilingual patent retrieval system. The main challenges addressed in the prototype are a) to translate semantically enriched patent documents, including the original mark-up, b) to design the mechanisms to enable the multilingual indexing and retrieval of the patents, c) to define and develop a query language and the query grammar to enable a user-friendly interaction with the system, and d) to set up an on-line application for retrieval of patent document that serves as a testbed of our work.

    The patents prototype combines semantic annotations, retrieval techniques and two different approaches for machine translation. The integration of different translation methodologies into the system has been crucial to increase its capabilities and make possible extended features and functionalities, with respect to preliminary version of the system.

    For the massive translation of text, a statistical system has been trained and adapted to translate the text and transfer the semantic annotations into the target languages. One of the challenges in this task was to come up with a mechanism to translate the semantics of the source texts to the target files. As a result, the patent documents are semantically enriched and translated using the statistical system. Then, the multilingual documents are used to feed the databases and indexes of the retrieval system. What remains as a future challenge is the use of these annotations to still increase either the accuracy of the annotations or the quality of the translations.

    On the other hand, a rule-based system is built in order to translate from (controlled) natural language to the semantic query language (SPARQL), in the interface. The GF has been proved an efficient way of generating the SPARQL queries, as if it was “Yet Another Query Language”. In other words, it allows to translate a natural language query from the user’s language to SPARQL, which makes the system accessible to a broader community rather than just skilled users. This automation facilitates also the interoperability between the query grammar and the ontologies and speeds up the development and maintenance of the querying subsystem.

    Finally, the patent prototype is not comparable with the interfaces exposed by the European Patent Office, namely because they were conceived for different purposes. Nonetheless, the MOLTO patents prototype demonstrates that a patents retrieval system that addresses multilingualism by means of automatic translation techniques is commercially viable.

    Highlights

    The preliminary version of the prototype, described in Deliverable 7.1 had only original patent documents in the databases and the system was only available in English and French.

    A complete version of the prototype, described in Deliverable 7.2, included resources also for German, and patent documents translated using the Statistical Machine Translation (SMT) system trained on the domain, and described in Deliverable 5.2.

    The news introduced with respect to previous versions of the prototype are: 1. A new process for statistical-based translation of patents that allows to transfer the semantic annotations and the original mark-up in the source documents to the target language.

    1. The development of the patent translator API in order to integrate the translation system into remote applications, such as online patent translation in the GF cloud.

    2. The updates on the retrieval architecture in order to improve the response time, such as snippeting.

    3. A new querying approach for SPARQL generation based on the grammar – ontology interoperability automation, driven by the Grammatical Framework.

    4. A new query grammar for the biomedical patents domain, which has been improved in terms of coverage and compliance to the patent domain ontology that is behind the information retrieval system.

    5. The new functionalities integrated in the user interface in order to improve the usability of the application, such as the integration of the free-text search as a back-off mechanism for the query language, based on free text search.

    6. Some updates on the on-line user interface that address usability aspects and further functionalities.

    Deviations from Annex I

    The main objectives of the work package have been fulfilled:

    i) create a commercially viable prototype of a system for MT and retrieval of patents in the bio-medical and pharmaceutical domains,

    ii) allowing translation of patent abstracts and claims in at least 3 languages

    iii) exposing several cross-language retrieval paradigms on top of them.

    This workpackage started with six months of delay because the WP leader, Matrixware, left the MOLTO Consortium during Month 3. After the re-scheduling, the tasks related to this workpackage were kept up to date according to the calendar. The final version of the prototype was agreed to be delayed till M36 due multiple dependencies with other workpackages.
    The new calendar allowed to incorporate the latest developments (grammar and ontologies interoperability in WP4 and hybrid translation from WP5), in the final demoed applications.

    WP8 Case study: cultural heritage - M39

    Summary of progress

    The multilingual Semantic Web system covers semantic data from the Gothenburg City Museum database and DBpedia. The grammar enables automatic coherent descriptions of paintings and answering to queries over them in 15 languages for baseline functionality and in 5 languages with an extended semantic coverage. The system contains an automatic process for translating museum names from Wikipedia. The process can be easily extended to translate names of painters, places, etc.
    The system provides a public SPARQL endpoint against which the user can explore the knowledge base with manually written natural language queries.

    Highlights

    • We created the Museum Reason-able View where several ontologies were linked, including: the CIDOC-CRM, the Painting ontology and the Museum Artifacts Ontology (MAO).

    • We build an ontology-based system for communication of museum content on the Semantic Web and made it accessible in 15 languages. The multilingual system automatically generates coherent Wikipedia-like articles. It has been made available online for cross-language retrieval and representation using Semantic Web technology.

    • We were able to reuse the query technology that has been developed in WP4 and adapt it successfully to our needs.

    • We extended the semantic coverage of the grammar to five languages and demonstrated the benefits of exploiting a modular approach in the context of multilingual Semantic Web.

    Deviations from Annex I

    • The time-line of the work has been shifted from M30 to M39

    WP9 User requirements and evaluation - M39

    Summary of progress

    Due to the progress of other work packages, the actual evaluation work was started at Spring 2013. Some of the evaluations were made within work packages, for instance the patent cases (WP7) were evaluated with automatic evaluation metrics, and the semantic multilingual wiki (WP11) was evaluated internally for usability. WP9's contribution to the project is translation quality evaluation with native or near-native speakers.

    In the evaluations, human evaluators were presented with translations by MOLTO tools and references by other MT systems (Google, Bing, Systran), and they chose the most adequate, either for post-editing or to accept as such. From these results we calculated error rates, and in addition, the percentages to what extent the evaluators preferred MOLTO translations over other systems. The results vary between languages and use cases, but in general, both automatic evaluation metrics and the percentage of the evaluators' preferred translations suggest that MOLTO method fares better in the chosen domains.

    During the evaluations, some errors were detected and the grammars in question were sent to be corrected. The time and effort needed to fix the languages that get the poorest results is another factor which is favorable to MOLTO tools: a systematic fix in the grammars corrects all instances of an erroneous construction.

    Some methodological issues about the qualitative evaluation were raised during the project, especially concerning the evaluation of Phrasebook. MOLTO's goal has been publishable quality automatically, but the evaluation results have been less than perfect—however, this doesn't mean that the results are incorrect, but simply that there are many ways to say the same thing, and an evaluation method that compares an edit distance to a reference doesn't capture the whole picture. This discrepancy between the human perceptions of quality and post-editing operations is discussed in the project deliverable, and has been a topic of two conference papers by Maarit Koponen, one between M31-M39 period in AMTA 2012 Workshop on Post-editing Technology and Practice, and one presentation at the XI Symposium on Translation and Interpreting: Technology and Translation in Turku, Finland.

    Highlights

    • Deliverable D9.2 published
    • Development of methodology of evaluating limited domain publishing quality systems

    Deviations from Annex I

    N/A

    WP10 Dissemination and exploitation - M39

    Summary of progress

    The major work has been to produce the final deliverables for this work-package, a report on dissemination and exploitation and the final version of the MOLTO web services. In order to produce these, we have tweaked the website and added a number of ways to generate and view the publication activity of the Consortium. Part of the work has also included the delivery of an archival version of the software prototypes as bibliographical items, with describing metadata, on the project's publication list and on a devoted page: http://www.molto-project.eu/view/biblio/type/Software.
    We have checked the Open Access policy of the partners and requested the publication on OAI-PMH compatible repositories. The listing of such archives is documented in Deliverable D10.4.

    The presence and dissemination of MOLTO via social sites has been constant throughout the lifetime of MOLTO and in the last period we have started to plan how to sustain the MOLTO Community after the project's end. We have been testing various platform, most recently a Google+ Community, where we also streamed the talks from the final Open Day and archived them on YouTube.

    The final demonstrations are reachable from the website and they are accompanied by videos in order to supply documentation also in the far future, when the technologies will be obsolete and not available any longer.

    Proper documentation and archiving of all these resources is underway. The resources produced by the project are very different in nature and present a challenge in terms of sustainability and future accessibility. They include software (often depending on third-party libraries), technical reports and publications in digital and/or printed form, and multimedia material. We intend to store all of these on an archival media however it is not clear how persistent they will remain.

    Highlights

    • Deliverable D10.3 MOLTO web service, final version documents the software that allows to deploy web services for any GF application grammar and describes some sample grammars made available online at http://www.molto-project.eu/cloud/gf-application-grammars
    • D10.4 MOLTO Dissemination and Exploitation Report contains an analysis of the possible venues of impact of MOLTO from an exploitation point of view. The industrial partners came together to discuss how to adopt the tools and technologies developed during the project, how to ensure a sustainable future for them that would benefit all partners in the Consortium.
    • Several new demonstration web sites have been linked from the project's web page: patent translation, query for museum artifacts, query for patents, multilingual semantic wiki, cloud services, translation services and the term factory.
    • Some deliverables are published in the MOBI format. The production of a e-Book reader compatible version has been discussed and the final result checked for readability within the Consortium.

    Deviations from Annex I

    WP11 Multilingual semantic wiki - M18

    Summary of progress

    We continued working on our two main projects: (1) developing ACE-in-GF (multilingual grammar of ACE) and (2) developing AceWiki-GF (multilingual CNL-based semantic wiki).

    ACE-in-GF was extended to almost all the languages supported in the GF resource grammar library (~20 languages), although only the languages reported in D11.1 are fully implemented and tested.

    The main work on AceWiki-GF was completed, and reported in D11.2 and a ESWC 2013 conference paper. Smaller extensions and improvements continue.

    In the last 5 months of the project we focused on the evaluation of both ACE-in-GF and AceWiki-GF. The design and results of both of these evaluations are reported in D11.3.

    Events

    Links to online resources

    Highlights

    • Multilingual GF-based ACE grammar (ACE-in-GF)
    • AceWiki-GF
    • Evaluations of ACE-in-GF and of AceWiki-GF (translation accuracy, user acceptance, both with good results)
    • RACE extensions (arithmetic, improved wh-queries)
    • paper on ACE-in-GF and AceWiki-GF accepted at ESWC 2013 (25% acceptance rate)

    Use of resources for Period 3

    Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
    UZH 3 (N. E. Fuchs) 11.5 (K. Kaljurand, T. Kuhn) 0 2.5 (L. Canedo, V. Ungureanu)

    WP12 Interactive knowledge-based systems - M18

    Summary of progress

    This period we continued the work on the further adoption of GF in the Business Process Platform of Be Informed. On of our goals to leverage the adoption of GF was to create a framework in which models could automatically be verbalized. Domain experts usually do not have a background in modeling and thus checking whether a rule or law is modeled correctly usually proves to be a difficulty for them. Be Informed wants to take away these barriers by creating verbalizations of their models. These verbalizations however should not be only a textual representation the models, but it wants the possibility to create verbalizations of the same models for a set of the distinguished tasks.

    In order to do this Be Informed created the 3D framework together with the University of Bielefeld. An article on this work will be published in "Jeroen van Grondelle, Christina Unger: A 3-Dimensional Paradigm for Conceptually-scoped Language Technology" in Towards the Multilingual Semantic Web, Paul Buitelaar and Philip Cimiano, eds., Springer, Heidelberg, Germany, 2013. This orthogonal modularization supports specification of the conceptualization and lexical information per dimension, i.e. specifying domains independent from tasks and vice versa. The dimensions can then be freely combined by choosing the particular domains, tasks and languages supported for a specific application.

    While the task grammars are written once by hand, each of the domain grammars is created automatically from a Be Informed or OWL ontology and plugged right into the grammars already created for the framework. In order to create these domain grammar automatically Be Informed created three verbalizers, each one with its own heuristics to create verbalizations.

    In an evaluation the likelihood of the verbalizations created with grammars from these verbalizers were compared to the verbalizations created by the velocity templates, the verbalizer which is currently implemented in the Be Informed product suite. The results show that the GF based verbalizers are better than these velocity verbalizers.

    Work on the adoption and evaluation has been finished and reported in D12.2 (http://www.molto-project.eu/biblio/deliverable/d122-user-studies-bis-exp...).

    Events

    • September 17th 2012, Utrecht, Research Workshop on the use of business models by multiple audiences. Participants : Be Informed Customers from the public sector (SVB, Bureau Forum Standaardisatie, Dutch Prosecution Office, Immigration Department) and members from Molto and Monnet.
    • December 11-12 2012, Course GF at Be Informed Apeldoon; course given by Kaarel Kaljurand.
    • December 13 and 14, 2012, PortDial members from Bielefeld met with Aarne Ranta (MOLTO), Jeroen van Grondelle, Frank Smit and Jouri Fledderman from Be Informed, and John McCrae (lemon) in order to discuss the mapping from ontology-lexica to grammars, as well as the modular combination of induced domain grammars with dialog task grammars.
    • February, 6-8, 2013; April 4-5 2013; Subsequent Lemon/GF workshops at Bielefeld with Christina Unge, John McCrae from PORTDIAL and MONNET and Jouri Fledderman/Jeroen van Grondelle from Be Informed.

    Use of resources for Period 3 (M31-M39)

    Node RTD MGNT OTHER
    BI WP1 0 0.4 0
    BI WP9 1.7 0 0
    BI WP10 1.3 0 0
    BI WP12 15.5 0 0

    2.3. Deliverables and milestones tables

    The list of project deliverables can be obtained from the web pages at http://www.molto-project.eu/view/biblio/deliverables and in annotated form at http://www.molto-project.eu/view/biblio/type/Deliverable. The administrative view provides the links to the final PDF but also to the online version of the documents at http://www.molto-project.eu/workplan/deliverables. Below is a summary for the final period.

    ID As planned (admin page) Due datesort icon Dissemination level Nature Publication Wiki
    D1.7 Final management report 31 May, 2013 Consortium Report D1.7 Final Management Report T39
    D10.3 MOLTO web service, final version 31 May, 2013 Public Prototype D10.3 MOLTO web service, final version
    D10.4 MOLTO Dissemination and Exploitation Report 31 May, 2013 Public Report D10.4 MOLTO Dissemination and Exploitation Report D10.4 MOLTO Dissemination and Exploitation Report
    D9.2 MOLTO evaluation and assessment report 31 May, 2013 Public Report
    D11.3 User studies for the multilingual semantics wiki 31 May, 2013 Public Report Evaluations of ACE-in-GF and of AceWiki-GF
    D12.2 User studies for BI's explanation engine 31 May, 2013 Public Report D12.2 User studies for BI's explanation engine
    D4.3A Grammar ontology interoperability - Final Work and Overview 31 May, 2013 Public Regular Publication
    D3.3 MOLTO translation tools / workflow manual 31 March, 2013 Public Regular Publication D3.3 MOLTO translation tools – workflow manual D3.3 MOLTO translation tools / workflow manual
    D5.3 WP5 final report: statistical and robust MT 31 March, 2013 Public Regular Publication WP5 final report: statistical and robust MT
    D7.3 Patent Case Study Final Report 31 March, 2013 Public Regular Publication Patent Machine Translation and Retrieval. Final Report.
    D8.3 Translation and retrieval system for museum object descriptions 1 March, 2013 Public Prototype D8.3: Translation and retrieval system for museum object descriptions
    D11.2 Multilingual semantic wiki 31 December, 2012 Public Prototype Multilingual semantic wiki
    D6.3 Assistant for solving word problems 1 December, 2012 Public Prototype D6.3 Assistant for solving word problems D6.3 Assistant for solving word problems
    D1.6 Periodic management report 5 1 October, 2012 Consortium Report Periodic Management Report T30 D1.6 Periodic Management Report T30
    D7.2 Patent MT and Retrieval Prototype 1 September, 2012 Public Prototype Patent MT and Retrieval Prototype
    D12.1 Requirements for BI's explanation engine 1 September, 2012 Public Report Requirements for GF based Verbalization in Be Informed

    3. Project management during M25-M39

    Project management during the last period aimed at strengthening the cooperation work of the partners for finalizing the tools and technologies delivered. This coordination was mainly achieved through the creation of a Google group for the MOLTO project: all members of the Consortium have been subscribed to it.

    A major issue has been the relocation of the project manager Olga Caprotti to the partner node UPC for the period 1 March 2013 - 11 June 2013. This was made necessary by national Swedish regulations that prevent hiring on temporary positions for longer than 2 years (she had already used a 1-year long non-renewable temporary post as a visiting researcher). A part of the funding has been transferred to UPC to cover the costs of hiring Dr. Caprotti until the end of the project.

    The use of resources has undergone some internal shift among work-packages in order to cover extra work that had become necessary due to initial delays and to personnel turnover. Where appropriate, they are documented case by case in the work-package reports.

    As reported in Deliverable D1.6, followup actions after the annual review included discussions within the Consortium on how to organize a better showcase for the final results of the project and in addressing the reviewers' remarks and suggestions. Updated versions of some deliverables were produced and made available on the website.

    Here below we address each comment in the review.

    3.1 Followup to the reviewers' report

    Recommendation 1.

    Technical coordination should be strengthened. Continuous and strict monitoring should be applied. Reviewers made several recommendations in the 1st review but most of them have not been implemented or it was unclear what was done with respect to them. As it is shown in the remarks per WP, the adoption of most of these recommendations would support monitoring of the work progress towards the project’s objectives.

    The greatest effort undertaken to strengthen the coordination of the partners was to define a number of "flagships" aimed at demonstrating the integration of the MOLTO technologies. These showcase demonstrators have been developed during the final months of the project by tight cooperation of the partners, each flagship adopting and reviewing some tool or technology from a different partner.

    Recommendation 2.

    The recommendation from the 1st review “How grammar rules are extracted (from lexical databases, ontologies, text examples) needs to be specified in detail and a concrete schedule should be included in the updated workplan (D1.1)” has not been included in D1.1.It should be included in D2.3 “Grammar tool manual and best practices”, due in M27. This is a crucial deliverable since the best practices with respect to the other work packages should be included here

    Recommendation 3.

    The recommendation from the 1st review “Details on the integration steps (the integration of the vocabulary editor with the translation editor, the integration of the vocabulary editor with TermFactory (TF), and the integration of TF with the Knowledge Representation Infrastructure (KRI) of WP4) need to be provided in the updated workplan (D1.1). Concerning the integration of TF and KRI, it seems that there are overlaps between these tools. The partners must clarify which functions of these tools will be used in the case studies in order to exploit complementarities of the tools and avoid overlaps.” has not been addressed properly and is presented as still “less understood” by the WP leader. This is a major issue of concern. The problems of the integration of WP3 tools remain. These should be discussed in an updated D1.1.

    Follow-up: D4.3A makes comparison between KRI and TF and suggests steps to be taken to integrate TF and KRI. The integration requires a mirror of a KRI site, whose semantic repository is open to edit with TermFactory. The resulting knowledge base in the mirror site will be grammatically enriched, so that the new information is presented to the user. Moreover, the integration can facilitate lexicon extraction for the GF grammars and the query language of KRI.

    Recommendation 4.

    The translator's tool s that should be developed in WP3 should not be given up. Although the WP's leader's impression is that the MT quality is too low for the tools to ever be used, the developed tools can be useful for those subdomains/language pairs where MT quality is better.

    Follow-up: The development of the translator's tool has been continued, but with another platform. Deliverables 3.1 and 3.2 use GlobalSight, a translation management system, and an external editor that supports GF. However, we found that GlobalSight was not maintained, and changed to Pootle, a modern and lightweight translation platform with an active user base. D3.3 describes the integration of the GF translation to Pootle. A demo video is found at MOLTO's youtube channel.

    Recommendation 5.

    The recommendation from the 1st review “Critical issues with respect to the semi-automatic creation of abstract grammars from ontologies, as well as deriving ontologies from grammars, are still to be clarified. Concrete steps to handle these issues need to be specified in detail and a schedule should be included in the updated work plan (D1.1). In addition, as noted with respect to WP3, complementarities between KRI and TF should be exploited avoiding possible overlaps. Terminology should be added and abbreviations explained in Deliverable D4.1 in order to facilitate reading by non-experts in the field” should still be addressed. The issue of the two-way interoperability between ontologies and GF grammars still remains unclear, although as noted in the DoW this represents one of the two most research-intensive parts of MOLTO. This should be solved in the new versions of D4.2 and D4.3 The current version of deliverables D4.2 “Data Models, Alignment Methodology, Tools and Documentation”, and D4.3 “Grammar-Ontology Interoperability” are not approved. D4.2 is too general. For instance, a lot is said about LOD and the museum case and not on the alignment methodology. D4.3, on the other hand, does not give a clear picture of the interoperability issues and the degree of automation that can be expected. What is required for porting this to a new application? Concrete steps should be provided making clear what can be automated and what cannot with the provided infrastructure.

    Follow-up:

    • D4.1 includes a "Glossary" with definitions and explanations of all technical terms. (see the .pdf version of the document, available on the website)
    • D4.3A gives details on the further work in the field of grammar-ontology interoperability and summarizes the achieved in WP4, WP7 and WP8 in the field(since the prototypes of these work packages are child projects of the KRI prototype). The highlights are using GF to generate SPARQL, observing it as yet another language, and also the details of (semi-)automatic verbalization of RDF facts with the help of GF.
    • D4.2 was updated
    • D4.3 was updated and D4.3A (annex deliverable) was published. D4.3A gives more specific details than D4.3 and explains what was achieved in the WP4 goals after M24 of MOLTO.
    • In the end of D4.3A we list the steps of customization of the KRI prototype for other specific domains. A highlight is that the need of GF expert and knowledge engineer cannot be overcome; the two experts should work in collaboration to design the prototype's query language(auto-suggestions are possible) and a few iterations of mutual work might be needed to refine the result.

    Recommendation 6.

    The current description of work in WP6 lacks details on the prototype multilingual dialogue system to be developed. As recommended in the 1st review, an example dialogue and specifications of this prototype should be provided. These can be included in D9.1E.

    Example dialog and description are available at D6.3 cover document (http://www.molto-project.eu/sites/default/files/D6.3.pdf) (Sections 1, 3 and 4).

    Recommendation 7.

    WP7 work should focus on the major issues examined in MOLTO, especially in relation to the grammar – ontology interoperability automation. Specific scenarios are needed for the exploitation of MOLTO tools in this case study. It is recommended to include such scenarios in deliverable D9.1E.

    Follow-up:

    In response, two use case scenarios were described: UC-71 and UC-72.

    UC-71 focuses on grammar-ontology interoperability. User queries, written in CNL, are used to query the patent retrieval system. We defined a query language and a new query grammar in order to a) decrease the number of ambiguities in the queries and b) increase the coverage of the ontology. As a result, we come up with a more reusable grammar (YAQL), easier to maintain, that facilitates the lavour of building query grammars for the application domain and languages. NL queries are translated into SPARQL using this approach. Additional details are given in D7.3 and D4.3A.

    UC-72 focuses on high-quality machine translation of patent documents, and the ultimate goal is to endow the retrieval system with the information required to enable multilinguality. We used an SMT baseline system to translate a big dataset of patents and feed the retrieval databases. The automatic translation included the semantic annotation, available only in English documents. This mechanisms allowed to extract multilingual lexicons for the domain ontology, which were used also to build the query grammars. More details are also given in D7.3.

    Finally, the exploitation plans for the technologies developed within this WP, which are further discussed in D10.4, are focused on multilingual text processing and cross-lingual translation of various domain data within search and retrieval techniques.

    Recommendation 8.

    The recommendation from the 1st review “Preparation of a new version of D9.1 is recommended including prototype specifications and scenarios for the three case studies (WP6, WP7, WP8)” should still be addressed. A concrete evaluation methodology is needed focusing on MOLTO's major goals: How will the consortium prove that its objectives were fully/partially met? We expect to see this in D9.1E “Addendum to the MOLTO test criteria, methods and schedule” hoping that the recommendations suggested above as well as in the 1st review, in relation to D9.1, will be included there.

    Follow-up: D9.1A “Appendix to MOLTO test criteria, methods and schedule” addresses these issues.

    Recommendation 9.

    The way the project’s web site is structured, although it contains the necessary content, affects its readability in some cases. It should contain a structure according to the work packages, including all documentation related to a specific work package.

    The content published on the web site can be navigated according to the way the producer has tagged it. If the author has decided to tag a certain item as belonging to a work-package then this content will display when selecting the proper tag: e..g http://www.molto-project.eu/category/dow/potential-impact/dissemination or, for publications, http://www.molto-project.eu/biblio/keyword/88 will select the WP7-related bibliography. However, to the casual reader of the website, the distinction in work-packages is not very informative and the results are best viewed independently of the contingent organization in the work-plan. Following this principle, we have created a navigation menu that distinguishes the internal, work-plan related items from the public more general publications.

    Recommendation 10.

    The deliverables on the work plan (D1.1) and the dissemination plan (D10.1) should be updated at the beginning of the 3rd year.

    We have adopted the methodology to continuously use online publication tools on the internal section of our web pages in order to maintain the work plan, the dissemination plan and their updates. Partners that are undergoing new activities use the news feed to inform the Consortium. Work package leaders have been given the option to create tasks, allocate and manage them. Some of the work planning has been coordinated by the partners using third party specific tools such as Trello (trello.com) and Symphonical (https://www.symphonical.com).

    4. Use of the resources

    The figures in the attached table come from the participants' time sheets. They are therefore a preliminary estimate: the final figures will be available in the NEF, Form C, when every participant has finalized their reporting there.

    Self Declaration of the Coordinator in D1.7

    I, as scientific representative of the coordinator of this project and in line with the obligations as stated in Article II.2.3 of the Grant Agreement declare that:

    1. The attached periodic report represents an accurate description of the work carried out in this project for this reporting period;

    2. The project (tick as appropriate):

      x has fully achieved its objectives and technical goals for the period;

      ☐ has achieved most of its objectives and technical goals for the period with relatively minor deviations.

      ☐ has failed to achieve critical objectives and/or is not at all on schedule.

    3. The public website, if applicable:

      x is up to date

      ☐ is not up to date

    4. To my best knowledge, the financial statements which are being submitted as part of this report are in line with the actual work carried out and are consistent with the report on the resources used for the project (section 4) and if applicable with the certificate on financial statement.

    5. All beneficiaries, in particular non-profit public bodies, secondary and higher education establishments, research organisations and SMEs, have declared to have verified their legal status. Any changes have been reported under section 3 (Project Management) in accordance with Article II.3.f of the Grant Agreement.

    Name of scientific representative of the Coordinator:

    ....................................................................

    Aarne Ranta

    Date: 30/7/2013

    D2.1 GF Grammar Compiler API


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D2.1. GF Grammar Compiler API
    Security (distribution level): Public
    Contractual date of delivery: M13
    Actual date of delivery: March 2010
    Type: Prototype
    Status & version: Draft (evolving document)
    Author(s): A. Ranta, T. Hallgren, et al.
    Task responsible: UGOT
    Other contributors:


    Abstract

    The present paper is the cover of deliverable D2.1 as of M13.

    1. Introduction

    GF, Grammatical Framework, is a programming language for multilingual grammars. GF is used in the MOLTO project to build translation systems. How to write GF grammars is specified in the numerous tutorials and manuals available via http://grammaticalframework.org. The compiler API is a document that explains aspects of the compiler of the GF language:

    • how GF source code is compiled to various formats usable in runtime systems
    • how the developers can test their grammars
    • how other formats (such as lexica and example sentences) can be converted to GF code
    • how to call the compiler in various ways:
      • GF shell commands and scripts
      • Haskell programs
      • web applications

    2. Installation

    The GF API can be downloaded from the MOLTO svn server at svn://molto-project.eu/compiler-api and compiled by running make. For compilation to succeed, and produce an HTML file readable in the browser, it is necessary to have the txt2tags software (from http://txt2tags.org) and the graphviz software (http://www.graphviz.org).

    A version maintained by the GF developers is also available online at http://www.grammaticalframework.org/compiler-api.

    D2.2 Grammar IDE


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D2.2 Grammar IDE
    Security (distribution level): Public
    Contractual date of delivery: M18
    Actual date of delivery: September 2011
    Type: Prototype
    Status & version: Final
    Author(s): A. Ranta, T. Hallgren, et al.
    Task responsible: UGOT
    Other contributors: John Camilleri, Ramona Enache


    Abstract

    Deliverable D2.2 describes the functionalities for an Integrated Development Environment (IDE) for GF (Grammatical Framework). The main question it addresses is how should such a system help the programmers who write multilingual grammars? Two IDE's are presented: a web-based IDE enabling a quick start for GF programming in the cloud, and an Eclips plug-in, targeted for expert users working with large projects, which may involve the integration of GF with other components. Example-based grammar writing is also described in the end.

    1. Introduction

    An IDE, Integrated Development Environment, is a software system targeted for programmers. It helps the programmer to write code and test and maintain it. The tasks where IDE helps can include

    • editing source code
    • compiling and/or interpreting the code
    • maintain complex projects with their module dependencies
    • test and debug the code

    (cf. http://en.wikipedia.org/wiki/Integrated_development_environment). While interactive command interpreters (such as Unix shells) were historically the first systems recognized as IDE's, the contemporary notion of "IDE's proper" assumes a set of visual tools. The most widely used IDE's are probably Eclipse (http://www.eclipse.org/), Microsoft Visual Studio (http://www.microsoft.com/visualstudio), and Apple's XCode (http://developer.apple.com/xcode/). Each of these is a desktop program of substantial size. But recent times have also seen Web IDE's (WIDE), where the user can write programs without implementing any software locally; an example is CodeRun (http://www.coderun.com).

    The purpose of this document is to introduce an IDE for GF, Grammatical Framework (http://www.grammaticalframework.org/). GF is a programming language designed for writing multilingual grammars and their applications (Ranta 2011). Typical applications are translation systems (with many simultaneous languages) and the localization of natural language processing systems such as question answering (with many alternative languages).

    This paper will introduce two IDE's for GF:

    • a Web IDE (http://www.grammaticalframework.org/demos/gfse/), which allows users to build GF grammars and run them "in the cloud"
    • an Eclipse plug-in, which allows users to build GF grammars on their desktop and link them with other Eclipse-enabled software, such as Android mobile applications and ontology engineering tools.

    The Web IDE is intended to be a quick way to use GF, since it doesn't require any software installation, and also has some helpful functionalities to guide novice users. But it is less adapted for large GF programs consisting of large numbers of modules, such as GF grammar libraries. The Web IDE is a mature program tested by many users, but new developments are still expected.

    The Eclipse plug-in is meant for power users of GF, who have to maintain perhaps hundreds of GF modules simultaneously and to link them with other software. But it is less quick to get started with, since it requires the installation of both the GF compiler, the Eclipse platform, and the GF Eclipse plug-in. The Eclipse plug-in is still in the beginning of its development.

    Both these tools are new, and have been built during 2011 within the MOLTO project. The traditional "IDE" for GF is one familiar from the Unix environment:

    The interactive shell is a Read-Eval-Print loop similar to LISP and, more recently, Haskell (GHCI). While it has more IDE functionalities than many programming languages provide, we are not calling it an IDE, but reserve that name to the graphical Web IDE and Eclipse systems. Actually, the GF shell can be seen as an API (Application Programmer's Interface) to the GF compiler. It provides a set of commands that can be used for compiling, diagnosing, and testing GF grammars. More sophisticated IDE's can be built by using the shell command language to communicate with the compiler. The document The GF Grammar Compiler API (MOLTO Deliverable 2.1) gives more information on the available functionalities.

    2. Multilingual grammars

    A GF program is a multilingual grammar, which for

    n languages consist of 1+n modules: one abstract syntax defining the semantic content in a language-independent way, and for each language a concrete syntax showing how this content is expressed in that language. Here is a "hello world" example for English, Finnish, and Italian:

      abstract Hello = {
        cat Greeting ; Recipient ;
        fun 
          Hello : Recipient -> Greeting ;
          World, Mum, Friends : Recipient ;
      }
      
      concrete HelloEng of Hello = {
        lin 
          Hello rec = "hello" ++ rec ;
          World = "world" ;
          Mum = "mum" ;
          Friends = "friends" ;
      }
      
      concrete HelloFin of Hello = {
        lin 
          Hello rec = "terve" ++ rec ;
          World = "maailma" ;
          Mum = "äiti" ;
          Friends = "ystävät" ;
      }
      
      concrete HelloIta of Hello = {
        lin 
          Hello rec = "ciao" ++ rec ;
          World = "mondo" ;
          Mum = "mamma" ;
          Friends = "amici" ;
      }
    
    

    The GF compiler produces from this code a system that can parse phrases like hello world, ciao mamma and also generate them each language, thus enabling translation between any pair of languages.

    The Hello grammar is of course extremely simple, on purpose. But it shows the essential structure of multilingual grammars, and it is easy to see how the grammar could be extended by adding new functions (i.e. combination rules like Hello and words like Mum).

    The GF compiler controls that the abstract and concrete syntaxes are in synchrony. For instance, it checks that each abstract syntax function (fun) actually has a linearization (lin) in each concrete syntax. An IDE is expected to go one step further: it reminds the programmer, prior to running the compiler, of those linearizations that are missing. And when a new language (i.e. a new concrete syntax) is added to the system, the IDE initializes its code with a template for all required linearization rules.

    Multilinguality is one aspect of GF's module system: each language, as well as the abstract syntax, has its own module. Larger GF applications have an additional complexity created by the inheritance and opening of modules; a large grammar can easily have 20 modules involved for each language, and this is multiplied by the number of languages plus one for the abstract syntax. While the opening and inheritance correspond to the module dependencies found in most other programming languages (such as inheritance and the use of libraries), the multilinguality aspect is an extra dimension, which makes GF programs more complex than usual programs.

    A GF project with 15 languages, as targeted in the MOLTO project, involves hundreds of modules in scope at the same time. These are roughly divided to two groups,

    • the application grammar: the code written by the programmer
    • the resource grammar: the code imported from libraries

    The total resource grammar code in September 2011 comprises 755 modules, addressing 20 natural languages. This code is normally distributed in binaries (although the source is also available) and never read or written by the application programmer. But the programme of course needs to inspect the code: to see, for instance, what functions are available to contruct objects of a given type such as noun or sentence. Inspecting the library code is one of the most important things that should be supported by the IDE.

    3. The Web IDE

    Traditionally, GF grammars are created in a text editor and tested in the GF shell. Text editors know very little (if anything) about the syntax of GF grammars, and thus provide little guidance for novice GF users. Also, the grammar author has to download and install the GF software on his/her own computer.

    In contrast, the GF online editor for simple multilingual grammars is available online, making it easier to get started. All that is needed is a reasonably modern web browser. Even Android and iOS devices can be used.

    The editor also guides the grammar author by showing a skeleton grammar file and hinting how the parts should be filled in. When a new part is added to the grammar, it is immediately checked for errors.

    Editing operations are accessed by clicking on editing symbols embedded in the grammar display: + = Add an item, × = Delete an item, % =Edit an item. These are revealed when hovering over items. On touch devices, hovering is in some cases simulated by tapping, but there is also a button at the bottom of the display to "Enable editing on touch devices" that reveals all editing symbols.

    In spite of its name, the editor runs entirely in the web browser, so once you have opened the web page, you can continue editing grammars even while you are offline.

    3.1 Status

    At the moment, the editor supports only a small subset of the GF grammar notation. Proper error checking is done for abstract syntax, but not (yet) for concrete syntax.

    The grammars created with this editor always consists of one file for the abstract syntax, and one file for each concrete syntax.

    3.1.1. Abstract syntax

    The supported abstract syntax corresponds to context-free grammars (no dependent types). The definition of an abstract syntax is limited to

    • a list of category names, Cat_1 ; ... ; Cat_n,
    • a list of functions of the form Fun : Cat_1 -> ... -> Cat_n
    • and a start category.

    Available editing operations:

    • Categories can be added, removed and renamed. When renaming a category, occurences of it in function types will be updated accordingly.
    • Functions can be added, removed and edited. Concrete syntaxes are updated to reflect changes.
    • Functions can be reordered using drag-and-drop.

    Error checks:

    • Syntactically incorrect function definitions are refused.
    • Semantic problem such as duplicated definitions or references to undefined categories, are highlighted.

    3.1.2. Concrete syntax

    At the moment, the concrete syntax for a language L is limited to

    • opening the Resource Grammar Library modules SyntaxL and ParadigmsL, LexiconL and ExtraL,
    • linearization types for the categories in the abstract syntax,
    • linearizations for the functions in the abstract syntax,
    • parameter type definitions, P = C_1 | ... |C_n,
    • and operation definitions, op = expr, op : type = expr,

    Available editing operations:

    • The LHSs of the linearization types and linearizations are determined by the abstract syntax and do not need to be entered manually. The RHSs can be edited.
    • Parameter types can be added, removed and edited.
    • Operation definitons can be added, removed and edited.
    • Definitions can be reordered (using drag-and-drop)

    Also,

    • When a new concrete syntax is added to the grammar, a copy of the currently open concrete syntax is created, since copying and modifying is usually easier than creating something new from scratch. (If the abstract syntax is currently open, the new conrete syntax will start out empty.)
    • When adding a new concrete syntax, you normally pick one of the supported languages from a list. The language code and the file name is determined automatically. But you can also pick Other from the list and change the language code afterwards to add a concrete syntax for a language that is not in the list.

    Error checks:

    • The RHSs in the concrete syntax are checked for syntactic correctness by the editor as they are entered.(TODO: the syntax of parameter types is not check at the moment.)
    • Duplicated definitions are highlighted. Checks for other semantic errors are delayed until the grammar is compiled.

    3.2. Compiling and testing grammars

    When pressing the Compile button, the grammar will be compiled with GF, and any errors not detected by the editor will be reported. If the grammar is free from errors the user can then test the grammar by clicking on links to the online GF shell, the Minibar or the Translation Quiz.

    3.3. Grammars in the cloud

    While the editor normally stores grammars locally in the browser, it is also possible to store grammars in the cloud. Grammars can be stored in the cloud just for backup, or to make them accessible from multiple devices.

    There is no automatic synchronization between local grammars and the cloud. Instead, the user should press to upload the grammars to the cloud, and press to download grammars from the cloud. In both cases, complete grammars are copied and older versions at the destination will be overwritten. When a grammar is deleted, both the local copy and the copy in the cloud is deleted.

    Each device is initially assigned to its own unique cloud. Each device can thus have its own set of grammars that are not available on other devices. It is also possible to merge clouds and share a common set of grammars between multiple devices: when uploading grammars to the cloud, a link to this grammar cloud appears. Accessing this link from another device will cause the clouds of the two devices to be merged. After this, grammars uploaded from one of the devices can be downloaded on the other devices. Any number devices can join the same grammar cloud in this way.

    Note that while it is possible to copy grammars between multiple devices, there is no way to merge concurrent edits from multiple devices. If the same grammar is uploaded to the cloud from multiple devices, the last upload wins. Thus the current implementation is suitable for a single user switching between different devices, but not recommended for sharing grammars between multiple users.

    Also note that each grammar is assigned a unique identity when it is first created. Renaming a grammar does not change its identity. This means that name changes are propagated between devices like other changes.

    3.4. Future work

    This prototype gives an idea of how a web based GF grammar editor could work. While this editor is implemented in JavaScript and runs in the web browser, we do not expect to create a full implementation of GF that runs in the web browser, but let the editor communicate with a server running GF.

    By developing a GF server with an appropriate API, it should be possible to extend the editor to support a larger fragment of GF, to do proper error checking and make more of the existing GF shell functionality accessible directly from the editor.

    The current grammar cloud service is very primitive. In particular, it is not suitable for multiple users developing a grammar in collaboration.

    4. The Eclipse plug-in

    The aim behind developing a desktop IDE for GF is to provide more powerful tools than may be possible and/or practical in a web-based setting. In particular, the ability to resolve cross-references between source files and libraries instantaneously during development time is one of the primary goals and motivations for the project.

    The choice was made to develop this desktop IDE as a plugin for the Eclipse Platform as it seemed to be the most popular choice among the GF developer community. Support for the platform is vast and many tools for adapting Eclipse to domain-specific languages already exist. Unlike the zero-click WIDE approach, using the GF Eclipse plugin (GFEP) will require some manual installation and configuration on the development machine. Thus the GFEP is aimed more at seasoned developers rather than just the curious.

    4.1. Features

    Implemented (including partially)

    1. Syntax highlighting and error detection
    2. Code folding, quick block-commenting, automatic code formatting
    3. Definition outlining, jump to declaration, find usage
    4. Warnings for problems in module dependancy hierarchy
    5. Launch configurations, i.e. compilation directly from IDE

    Coming soon

    1. Auto-completion for declared identifiers
    2. Inline documentation for function calls, overloads
    3. Quick-fix suggestions for syntax and naming errors
    4. Code generation for concrete/instance modules
    5. Code generation for new languages in application grammars
    6. Grouping of concrete syntaxes by language, fast switching and linked navigation
    7. Built-in library browser (in particular for GF resource grammar library)

    Long-term goals

    1. Test-suite functionality
      • Treebank management and testing
      • Possibility to incorporate treebank tool demonstrated by Jordi Saludes in the Math Grammar Library
    2. Provide a single platform for developing and using embedded grammars
    3. Integration with ontology engineering tools

    4.2. Status

    The starting point for the GFEP is using the Xtext DSL Framework for Eclipse (http://www.eclipse.org/Xtext/). By converting the GF grammar into the appropriate Extended-BNF form required by the LL(*) ANTLR parser, the framework provides a good starting point for future plugin development, already including a variery of syntax checking tools and some cross-reference resolution support. The specific requirements of the GF language, particularly in the way of its special module hierarchy, mean that significant customisations to this generated base plugin are needed.

    As of 1st October 2011, a first prototype of the GFEP has been released to GF developers to gather some initial feedback. This first release is not intended to be a mature development tool, but a showcase of some of the potential features that can be provided by developing GF grammars within a powerful desktop IDE. Reactions from within the GF developer community will guide the way forward, both in prioritizing the future tasks and also in better guaging the person-month cost that an eventual mature version of the plugin would require.

    4.3. Trying out the GFEP prototype

    Installation

    1. Eclipse is of course required. The plugin was developed using Eclipse 3.7 but older versions should also work.
    2. Inside Eclipse, go to Help > Install New Software.
    3. Add new software site using the URL: http://www.grammaticalframework.org/eclipse/beta/.
    4. Select the GF Eclipse Plugin, click Next, accept the license agreement and install. If it takes a long time to calculate dependencies, just be patient. I'm not yet sure if this is an abnormal issue or not.
    5. Accept the prompt warning that the software is unsigned.
    6. Restart Eclipse when prompted.
    7. (Optional) Add the GF perspective clicking Open Perspective > Other.
    8. (Optional) Go to Run > Run Configurations and add a new Grammatical Framework configuration. Fill in the necessary fields in the Main tab, and click Apply to save the new configuration.

    Getting started

    1. Create a new blank General project in the usual way. If asked whether you want to add the Xtext nature to your project, you can safely say no.
    2. There is a wizard for adding a new GF source file from File > New > Other > GF Source File:

    3. You can find a small example at http://www.grammaticalframework.org/eclipse/examples/hello/. Download the files and manually add them to your Eclipse workspace.

    4. Note how changing a cat definition for example will produce warnings and/or errors in other the modules.

    5. Compile your source using the provided Run Configuration.

    Known issues

    1. Local parameter/binding identifiers show up as unresolved, e.g. recip in:
         Hello recip = {s = "hello" ++ recip.s ! Masc} ;
      
    2. Qualified names currently aren't treated correctly, so Masc works but ResEng.Masc does not.
    3. If the Apply button in the Run Configurations dialog doesn't remains disabled, swap to the Common tab and back.
    4. Selective inheritance has not been properly tested.
    5. Interfaces/functors are currently not supported.
    6. The in-editor validation often needs to be triggered by some keystrokes, especially when Eclipse laods with some already-opened files.

    5. Future direction: example-based grammar writing

    It is typically the case that the writer of a GF concrete grammar is at least fluent in the language and has GF skills which are directly proportional with the complexity of the abstract syntax to implement. However, in the case of a rather complex multilingual grammar comprising 5 to more languages, as for instance was the case with the MOLTO Phrasebook(reference...) which was first available in 14 languages and which has a reasonably rich semantic interlingua, the task of finding grammar developers is a difficult one. Even if there exist such developers, their task can still be made easier, by trying to automate where possible, and alleviate over certain technicalities of GF programming that would slow down the grammar development.

    When writing a application grammar, one such problem would be to use the resource library in order to build generate text for a given language with the help of the primitives already defined in the correspondent resource grammar. For this, however, one needs to be familiar with the almost 300 existing functions, assuming that the domain writer is different than the resource grammar write, as it is often the case.

    In order to make the users' task easier, an API is provided so that the domain grammar writer only needs to know the GF categories and how they can be built from each other. This layer makes the interaction with the resource library smoother for users, and also makes it easier to make new constructions from the library available.

    For example, the sentence "I talked to my friends about the book that I read", is parsed to the following abstract syntax tree:

      UseCl (TTAnt TPast ASimul) PPos (PredVP (UsePron i_Pron) (ComplSlash 
      (Slash3V3 talk_V3 (DetCN (DetQuant DefArt NumSg) (RelCN (UseN book_N) 
      (UseRCl (TTAnt TPast ASimul) PPos (RelSlash IdRP (SlashVP 
      (UsePron i_Pron) (SlashV2a read_V2))))))) 
      (DetCN (DetQuant (PossPron i_Pron) NumPl) (UseN friend_N))))
      
    

    If we use the API constructors, the abstract syntax tree is simpler and more intuitive:

      mkS pastTense (mkCl (mkNP i_Pron) (mkVP (mkVPSlash talk_V3 (mkNP the_Art 
      (mkCN (mkCN book_N) (mkRCl pastTense  (mkRCl which_RP (mkClSlash 
      (mkNP i_Pron) (mkVPSlash read_V2))))))) (mkNP (mkQuant i_Pron) 
      plNum friend_N)))
    

    In this way, the domain grammar writer, can just use the functions from the API, and combine them with lexical terms from dictionaries and functions from outside the core resource library that implement non-standard grammatical phenomena, that do not occur in all languages.

    One step further in the direction of automating the development of domain grammars is to have the possibility to enter function linearizations as a positive example of their usage. This is particularly helpful in larger grammars containing syntactically complicated examples that would challenge even the more experienced grammarians. If instead an example is provided, even though the grammar could return more than one parse tree, the user can select the good tree or take advantage of the probabilistic ranking and take the most likely one.

    The example-based grammar writing system is still work in progress, but there is a basic prototype of it available already, and it will be further developed and improved. The basic steps of the system will be shortly described further on, along with the directions for future work.

    The typical scenario is a grammarian working on a domain concrete grammar for a given language - which we call X for convenience.

    In this case, he would need at least a resource grammar for X. Preferably there should also be a large lexical dictionary and/or a larger-coverage GF grammar with probabilities. Currently, larger lexical resources exist for English, Swedish, Bulgarian, Finnish and French. For Turkish there exists a large lexicon also, but the resource grammar is not complete.

    We also assume that the user has an abstract syntax for the grammar already and that the _lincats_ (namely representations of the abstract categories in the concrete grammar) are basic syntactic categories from the resource grammar(NP, S, AP).

    Consequently, the functions from the abstract syntax will be grouped in a topological order, where the ordering relation a < b <=> b takes a as argument in a non-recursive rule. There are no cycles in this chain of ordered elements, since a similar check is being performed at the compilation stage. The elements will be ordered in a list of lists - where every sub-list represents incomparable elements. The user will be provided first with the first sub-list and after completing it, with the next ones.

    For each such function, an abstract tree from the domain grammar having as root the given function will be generated. The arguments are chosen among the functions already linearized. In case that another concrete grammar exists already, the user can also see a linearization of the tree in the other language, and also an example showing how the given construction fits into a context. For example, if the user needs to provide an example for Fish in a given grammar, say the tourist phrasebook, and there is an English grammar already then he would get a message asking him to translate fish as in "fish" / "This fish is delicious".

    When providing the translation, the user will be made aware of the boundaries of the grammar, by the incremental parser of the resource grammar. If the example can be parsed and the number of parse trees is greater than 1, then either the user can pick the right one, or the system can choose the most probable tree as a linearization. From here, the system will also generalize the tree by abstracting over the arguments that the function could have. Finally the resulting partial abstract syntax tree will be translated to an API tree and written as linearization for the given function.

    The key idea is based on parsing, followed by compilation to API and provides considerable benefits,especially for idiomatic grammars such as the Phrasebook, where the abstract syntax trees are considerably different. For example, when asking for a person's name in English the question "What is your name" would be written using API functions as:

      mkQCl (mkQCl whatSg_IP (mkVP (mkNP (mkQuant youSg_Pron) name_N)))
    

    which stands for the abstract syntax tree:

      UseQCl (TTAnt TPres ASimul) PPos (QuestVP whatSg_IP (UseComp 
        (CompNP (DetCN (DetQuant (PossPron youSg_Pron) NumSg) (UseN name_N)))))
    
    

    On the other hand, in French the question would be translated to "Comment t' appelles tu" (literally translated to "How do you call yourself") which is parsed to:

      UseQCl (TTAnt TPres ASimul) PPos (QuestIAdv how_IAdv 
        (PredVP (UsePron youSg_Pron) (UseV appeler_V)))
    

    and corresponds to the following API abstract tree:

      mkQS (mkQCl how_IAdv (mkCl p.name appeler_V)))
    

    Currently, steps are made to integrate the system with the Web Editor and in this way combine the example-based methods with traditional grammar writing. In this case the set of functions that can be linearized from example will be computed incrementally, depending on the state of the code.

    A similar procedure to the one that determines which functions can be linearized from example can be used to find the functions that can be tested - functions already linearized that can be learned from example. In this way, the functions linearized in the editor - manually or by example can be also tested by randomly generating an expression and linearizing it in the language that is under development and also in one or more languages for which a concrete grammar exists. In case the linearization is not correct, the user can proceed to ask for a new example, or to modify the linearization himself.

    Other plans for future work, in addition to integrate the method with the GF Web Editor, include a thorough evaluation of the utility of the method for larger grammars and with grammarians of different levels of GF skills. Moreover, we plan to include a handle for unknown words, that should make it easier for the user to build a small lexicon from examples.

    As a solution to this, we devised the example-based grammar learning system, that is meant to automate a significant part of the grammar writing process and ease grammar development.

    The two main usages of the system are to reduce the amount of GF programming necessary in developing a concrete grammar and the second and more important - to make possible learning certain features of a language for grammar development.

    In the last years, the GF community constantly increased and so did the number of languages from the resource library and the number of domain grammars using them. The writer of a concrete domain grammar is typically different than the writer of the resource grammar for the same language, has less GF skills and is most likely unaware of the almost 300 constructors that the resource grammars implement for building various syntactical constructions; see http://www.grammaticalframework.org/lib/doc/synopsis.html.

    6. Conclusion

    We have presented two IDE's for GF. The web-based IDE is a stable system, which makes it easy to develop multilingual applications in the cloud. The Eclipse plugin brings GF to one of the leading desktop environments of software development. It is already usable for simple tasks such as syntax highlighting and cross-modular references, but more functionalities are being added; the further development of the Eclipse plugin will be sensitive to the actual users in the other sites of the MOLTO project. In addition to the IDE's, we have introduced the technique of example-based grammar writing, which has already been implemented as a desktop shell program and within the web-based IDE.

    The IDE's are expected to make the use of GF more efficient for power users and more accessible for beginning users. The success in this will be monitored and evaluated in the case studies of the MOLTO project.

    D3.1 Translation Tools API


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D3.1 Translation Tools API
    Security (distribution level): Public
    Contractual date of delivery: M18
    Actual date of delivery: December 2011
    Type: Document
    Status & version: Final
    Author(s): L. Carlson
    Task responsible: UHEL
    Other contributors:


    Abstract

    Deliverable D3.1 explains the components of the MOLTO translation tool API. The intended audience for the tools are translators with no experience in grammar writing, but who are familiar with standard translation industry tools, such as translation memories and document managers. The API has two levels: the core API for a single author/translator, and the extended API for a community of authors, translators, and grammar engineers. We use the open source translation platform GlobalSight to handle the management.

    1. Introduction

    MOLTO promises a translation tool based on Grammatical Framework, a programming language for multilingual grammars. The Grammatical Framework (GF) is used in the MOLTO project to build translation systems for EU languages. The user of the MOLTO translation tool need not know how to write GF grammars. She is supposed to use domain specific grammars developed by others to translate documents in the domains covered. Basic domain language coverage does not guarantee that all terms and idioms in the translatable document are covered. The MOLTO translation tool should handle lexical gaps in a way that benefits and benefits from a wider community of translators. It should also provide fallback solutions when a text is not covered by the available grammar(s).

    This report explains the components of the MOLTO translation tool API. The API is the main value because end user translation environments and tools are many and change quickly with time, while MOLTO tools should have a more long lasting value. The components of the API include at least the following (The Core API):

    • translation editor
    • grammar manager
    • document manager
    • term manager

    Provisions for interfacing with further typical CAT tool components will be considered (The Extended API).

    WP3 design diagram:

    A model implementation of MOLTO translation tools based on the API will be demonstrated in the MOLTO prototype due as deliverable 3.2 in February 2012.

    2. Translation scenarios

    GF is a multilingual interlingual translation system geared toward multilingual generation. As a proof of concept, GF demos display immediate generation into dozens of languages from a tiny grammar. Extended to a more realistic case, this scenario could have a native-language editor producing text for more or less immediate multilingual distribution, for instance, a multilingual website. For this scenario to work, the translations should be acceptable as is without target language native revision.

    The GF approach as such suits best an authoring/pre-editing scenario, where an original author or authorised content editor can choose or edit the original message to conform to a domain specific constrained language grammar, which GF is then expected to blind translate reliably to a number of languages. In real life situations, the text to translate is likely to be at least partially new to the system, and no guarantee can be given that the translation is correct in all the generated languages. It is specifically such extensions of the original MOLTO "fridge magnet" demo scenario that this document tries to address.

    The current professional human translation scenario is quite different. It is a post-editing scenario. The roles of author (client) and translator are separated. The translator has quite restricted authority over the target text and almost none over the original, aside from obvious errors. The translation process is normally bilingual. This is because the translation is created or at least chosen by human, and human translators rarely have professional competence in more than one or two languages besides their native language.

    The preferred professional translation direction is from a second language to native language, because for humans language generation is more demanding than language understanding. In this direction, the translator can exploit external resources to understand the source text and use her native competence to check the quality of the target text. Even in this case, a native subject expert is usually needed to check the translation. The reviser need not know the source language or have the source text at hand.

    The extended MOLTO translation scenarios considered here spread between these two extremes. We may still assume that the translator has some authority over the text to produce, i.e. she is the author or is authorized to adapt the text to better satisfy the constraints of the translation grammar. The MOLTO pre-editor/translator should be native or at least fluent in the source language, and familiar with the domain or at least its special language in order to know how the message can be (para)phrased. Thus the extended MOLTO scenario retains an element of constrained-language authoring or pre-editing.

    But we may need to relax the blind generation assumption. Although the GF engine may give warnings or suggestions when it is unsure or knows the translation fails, there are likely to be cases where the translation is technically correct, but inadequate for human consumption. A native revision is then needed for one or more target language(s). As in the human translation case, the author/translator can at best serve as informant for one or a few target languages. For the rest, the translation needs to be distributed to a pool of revisers. In real life, a translation has to go out even if GF fails. There must be a way to override GF with human translations. If the translation were a one-off affair, that could end the process. However, in many real life scenarios, the same or very similar texts will come up for (re)translation, and in that case, the results of the revisions should get fed back to the translation cycle, to avoid making the same errors twice. In other words, we should make the MOLTO translation system consisting of GF and the human users an adaptive whole. This is the most demanding part to conceive here. Pre-editing MT has not been very successful in the past, probably partly just because not enough attention has been given to practical concerns like these.

    Translation industry standards

    As document automation progresses, professional translation is merging into localization, or the adaptation of software to a new locale (language and culture). Translation used to differ from localization in that translators were not expected to worry about formats or the document lifecycle. Translations were shipped to translators as raw text and returned as such. In an intermediate phase, a specialized localization industry developed to multilingualize software, preserving the source format. More recently, with multi channel publishing and document toolchains, there is again a push to separate form from content. The localization industry solution to these conflicting pressures is to separate content from form in a reversible fashion. Localization formats and tools like Gnu gettext and XLIFF make provisions for extracting the translatable text from a document in a way that allows embedding the target text in the same document form.

    The current GF translation engine as such is neutral about the format of the text it receives, but the existing resource grammars expect text to come in raw form. It should be technically possible to include document formatting in GF parsing and generation, and if suitably restricted, that might be the most efficient solution for the translation of inline tags. However, for the rest, it seems best to take advantage of existing content extraction technologies in translation industry. We propose to use XLIFF for MOLTO translatable document format in the extended API.

    XLIFF is one of the OASIS LISA OAXAL standards. As of 2011 February 28, the Localization Industry Standards Association (LISA) is insolvent. The LISA standards continue to be used by the industry. The OASIS Open Architecture for XML Authoring and Localization (OAXAL ) reference model, comprises the following open standards:

    • Unicode
    • XLIFF ‐ OASIS XML Localization Interchange File Format
    • SRX – LISA Segmentation Rules Exchange
    • TMX – LISA Translation Memory Exchange
    • GMX/V – LISA Word and Character Count Standard
    • W3C ITS – Internationalization Tag Set
    • xml:tm – LISA XML based text memory

    Translation tools survey

    For the extended scenario, we may add other industry standard CAT tools for MOLTO translators to use besides the core list above. There is a plethora of packages for CAT and translation project management/automation both commercial and open source. It seems best to borrow from existing open source packages that comply with translation industry standards, instead of reinventing the wheel. Examples of CAT packages are

    SwordFish and HeartSome are commercial. Examples of translation project management and workflow automation packages are

    Of the systems listed above, ProjectOpen and GlobalSight are open source, the rest are commercial.

    From existing open source projects we can shop for properties generally expected from TM (http://en.wikipedia.org/wiki/Translation_memory), CAT (http://en.wikipedia.org/wiki/Computer-assisted_translation), and translation project management software. Some commercial systems also have open interfaces, e.g. Across (http://en.wikipedia.org/wiki/Across_Systems). Here are some translation tools listings from the Web.

    For comparisons, see e.g. Wikipedia.

    3. Translation process

    Here we study variants of the machine assisted translation process, to develop a version that suits MOLTO.

    The translation industry workflow

    To have a point of comparison, we review the practices in the professional translation industry today. Going beyond the 90's single-user computer-assisted translation (CAT) setup with a translation editor, translation memory, and termbase, current translation management system (TMS) packages provide tools for managing complex translation industry projects involving clients, project managers, and a distributed pool of translators, reviewers, and subject experts. Many aspects of the workflow and the associated communication (notification, document transfer) can be automated in these systems. For an example of a translation industry workflow, we take the ]project-open[ Translation Workflow. Tasks typically covered by translation project and workflow management packages include

    • Analyzing source text using a translation memory
    • Generating quotes for the customer
    • Generating purchase orders for providers and freelancers
    • Executing the project
    • Monitoring the project
    • Generating invoices to the customer
    • Generating invoices for providers and freelancers

    Roles within the translation workflow

    In ]project-translation[ five user roles can be defined.

    • Translator: A person responsible to execute the translation and/or localization projects.
    • Project Manager: A person responsible for the successful execution of projects. Project Managers frequently act as Key Account managers in small translation agencies.
    • Senior Manager: Is responsible for the financial viability of the business and is the ultimate responsible for the relationships to customers
    • Key Account Manager: A person responsible for the relationship to a number of customers. We assume that customer project requests are handled by a Key Account Manager and then passed on to a project Manager for execution.
    • Resource Manager: A person responsible for the relationship with translation resources and providers.

    GlobalSight has yet more default roles:

    • Administrator
    • Customer
    • LocaleManager
    • LocalizationParticipant
    • ProjectManager
    • SuperAdministrator
    • SuperProjectManager
    • VendorAdmin
    • VendorManager

    New roles can be invented at will in GlobalSight. As discussed in the MOLTO requirements document (Deliverable 9.1), The role cast in MOLTO can have at least these roles:

    • Author
    • Editor
    • Translator
    • Reviewer (domain/language expert)
    • Ontologist
    • Terminologist
    • Grammarian

    The Project Cycle

    The figure below shows in a schematic way in which the workflow proceeds:

    Translation workflow

    • A document from the customer is passed to the project manager for translation (This action is represented by the arrow from documents in the upper left corner from the client to the project manager).
    • The Project Manager uploads the document into the ]project-open[ System
    • A translators downloads the file
    • The same translators uploads the translated files

    Similarly for editors and proofreaders. Finally, the project manager retrieves the document and sends it to the customer. Alternatively, the project manager can allow the customer to download the files directly. In addition to tracking the status of a project at every stage, the system allows the project manager to allocate projects to the most suitable team and streamline the freelancers’ job.

    GlobalSight

    GlobalSight (http://www.globalsight.com/) is an open source Translation Management System (TMS) released under the Apache License 2.0. Version 8.2. was released on Sept 15, 2011. As of version 7.1 it supports the TMX and SRX 2.0 Localization Industry Standards Association standards.[2] It was developed in the Java programming language and uses MySQL database and OpenLDAP directory software. GlobalSight also supports computer-assisted translation and machine translation.

    According to the documentation the software has the following features:

    • Customizable workflows, created and edited using graphical workflow editor
    • Support for both human translation and fully integrated machine translation (MT)
    • Automation of many traditionally manual steps in the localization process, including: filtering and segmentation, TM leveraging, analysis, costing, file handoffs, email notifications, TM update, target file generation
    • Translation Memory (TM) management and leveraging, including multilingual TMs, and the ability to leverage from multiple TMs
    • In Contect Exact matching, as well as exact and fuzzy matching
    • Terminology management and leveraging
    • Centralized and simplified Translation memory and terminology management
    • Full support for translation processes that utilize multiple Language Service Providers (LSPs)
    • Two online translation editors
    • Support for desktop Computer Aided Translation (CAT) tools such as Trados
    • Cost calculation based on configurable rates for each step of the localization process
    • Filters for dozens of filetypes, including Word, RTF, PowerPoint, Excel, XML, HTML, Javascript, PHP, ASP, JSP, Java Properties, Frame, InDesign, etc.
    • Concordance search
    • Alignment mechanism for generating Translation memory from previously translated documents
    • Reporting
    • Web services API for programmatic access to GlobalSight functionality and data
    • Integrated with Asia Online APIs for automated translation

    GlobalSight Web Services API

    GlobalSight provides a web services API (http://www.globalsight.com/wiki/index.php/GlobalSight_Web_Services_API). It is used to integrate external systems to GlobalSight in order to submit content to the localization/translation work-flow, and monitor its status. The Web services API allows any client to connect and exchange data with GlobalSight, regardless of its implementation technology or operating system. The web service provides methods for

    • Authentication
    • Content Import
    • Project management
    • Activity and task management
    • User management
    • Locale management
    • Job management
    • Content export
    • Documentum support
    • Translation Memory management
    • Term Base management
    • CVS support

    For convenience, we shall borrow parts of the MOLTO extended API from the GlobalSight translation management system.

    The web localization workflow

    The translation industry workflow is top-down controlled, built on email and file transfers. For a more collaborative bottom-up approach, we can look at web localization. Web platforms are getting localized by a collaborative translation workflow. Here, translation is typically crowdsourced to a pool of volunteers, who either translate manually online or download po files to work on with local tools. The website coordinates the effort. Different projects may have assigned managers that monitor the collaboration. It is exemplified by the Translate toolkit (http://en.wikipedia.org/wiki/Translate_Toolkit) used to collaboratively localize open source software packages.

    An instance of the Translate toolkit is Pootle http://en.wikipedia.org/wiki/Pootle, an online translation management tool with translation interface. It is written in the Python programming language using the Django framework and is free software originally developed and released by Translate.org.za in 2004. It was further developed as part of the WordForge project and the African Network for Localisation and is now maintained by Translate.org.za.

    Pootle is intended for use by free software translators but is usable in other situations. Its main focus is on localization of applications' graphical user interfaces as opposed to document translation. Pootle makes use of the Translate Toolkit for manipulating translation files. The Translate Toolkit also offers offline features that can be used to manage the translation of Mozilla Firefox and OpenOffice.org in Pootle. Some of Pootle's features include terminology extraction, translation memory, glossary management and matching, goal creation and user management.

    It can play various roles in the translation process. The simplest displays statistics for the body of translations hosted by the server. Its suggestion mode allows users to make translation suggestions and corrections for later review, thus it can act as a translation specific bug reporting system. It allows online translation with various translators and lastly it can operate as a management system where translators translate using an offline tool and use Pootle to manage the workflow of the translation.

    The Translate Toolkit API is documented at http://translate.sourceforge.net/doc/api/. It is open source subject to the GPL licence. The Google Translator Toolkit

    Google provides a free service for translating webpages by post-editing Google MT results. The toolkit allows users to

    • Upload and translate documents
    • Use documents from your desktop or the web.
    • Download and publish translations
    • Publish translations to Wikipedia™ or Knol.
    • Chat and share translations online
    • Collaborate online with other translators.
    • Use advanced tools like translation memories and multilingual glossaries.

    The Google Translator Toolkit Data API allows client applications to access and update translation-related data programmatically. This includes translation document, translation memory, and glossary data stored with Google Translator Toolkit. The Google Translator Toolkit API is now a restricted API (http://code.google.com/apis/gtt/).

    The MOLTO translation workflow

    We now consider the MOLTO translation scenario. The MOLTO translation demo editor (see figure further below) supports a one-person workflow where the same person is the author(ised editor) of the source and the translator. Technically we can extend this to a more collaborative scenario where more actors are involved as in the professional workflow above, by adding the usual project support tools to the toolkit. A more difficult part is to adjust the workflow so that the adaptivity goal above is satisfied. In the professional workflow, corrected translations accumulate in the translation memory, which helps translators avoid the same errors next time. In the MOLTO workflow, GF has an active role in generating translations, so it is GF that should learn from the corrections. Concretely, when a translator or reviser changes a wording, the correction should not go unnoticed, but should find its way to back to the GF grammar, preferably through a round of community checks.

    We next try a description of one round of the ideal MOLTO translation scenario.

    Although it is possible that an author is ready to create and translate in one go (especially in a hurry), it is more normal to have some document(s) to start from. The document/s might be created in a GF constrained language editor in the first place. In that case, the only remaining step is translation. If translation coverage and quality has been checked, nothing more is neeeded. But frequently, some changes are needed to a previously translated document, or a new one is to be created from existing pieces and some new material. Imaginably, some of the parts come from different domains, and need to be processed with different grammars. Some such complications might be handled with document composition techniques in the manner of Docbook or DITA toolchains.

    The strength of GF is that it ought to handle grammatical variation of existing sources well, so as to avoid manual patching of previous translations. Assume there is a previously GF translated document, and we want to produce a variant. Then it ought to be enough to load the document, make desired changes to it under the control of the GF grammar, and let GF generate the modified translations.

    Is it necessary to show the translations to the user? Not unless the translator knows the target language(s). We should distinguish two profiles: blind translation, where the author does not know or is not responsible for the target languages herself, but relies on outside revision, and plain translation, in which there is one or two target language known to the author/translator to translate to, who wants to check the translations as she goes.

    In the blind profile, the author has to rely on revisers, and the revision cycle is slower. The revisers can either notify the author that the source does not translate correctly in their language(s), or they may notify the grammar/lexicon developer(s) directly, or both. If there is a hurry, the reviser/s should provide a correct translation directly for the author/publisher to use as canned text. In addition, they should notify the grammar developer/s of the revisions needed to GF. The notification/s could happen through messages, or conveyed through a shared translation memory, or both. In this slower cycle, it may not be realistic to expect the author to change the source text and repeat the revision process many times over for the same source and possibly a multiplicity of languages to get everything translate right before publication.

    In the plain profile, a faster cycle of revision is called for. The author/translator can try a few variations of the input. If no variant seems to work, then she probably wants to use her own translation, but also to make sure that GF learns of and from the failure. The failure can be a personal preference, or a general fix that the community should profit from. If it is a personal preference, the user may want to save the corrected translation in her translation memory and/or glossary, but also she may want to tweak her GF grammar to handle this and similar cases to her liking next time. If it is just a lexical gap or missing fixed idiom, then there should be in GF translation API a service to modify the grammar without knowing GF. The modifications could happen at different levels of commitment. The most direct one would be to provide a modular PGF format which would allow advising the compiled user grammar on the fly. Such a runtime fix would make sure that the same error will not happen during the same translation session or subsequent ones at least until the domain grammar is recompiled.

    The next level of commitment to a change would be to generate new GF source, possibly from example translations provided by the author/translator, compile them, and add the changed or extra modules to the user's GF grammar. The cycle involved here might be too slow to do during translation, but it could happen between translation sessions. If fully automatic grammar revision is too error prone, the author/translator could just go on with canned translations in this session, and commit change requests to the grammar developer community. In this case, the changes would be carried out in good time, with regression tests and multilingual revision cycles, especially if the changes affect the domain semantics (abstract grammar) and thereby all translation directions.

    AttachmentSize
    workflow.png10.68 KB

    4. MOLTO Translation tools API

    The MOLTO Translation Tools API exposes the most important operations used in translating with GF in MOLTO. It makes them available for programmers who want to create alternative accesses to GF translation tools, besides the MOLTO web translation demo platform. The API is divided into a Core API basically answering the needs of a single author/translator, and an Extended API addressing the needs of a community of authors, translators, and grammar engineers.

    The components of the MOLTO TT Core API include at least the following:

    1. sign in
    2. grammar manager
    3. document manager
    4. term manager
    5. translation editor

    The components of the MOLTO TT Extended API include the following:

    1. user management
    2. grammar management
    3. document management
    4. lexical resources
    5. translation editing
    6. translation memory
    7. reviewing/feedback
    8. grammar engineering

    The first five are extensions of the corresponding facilities in the Core API. The lexical resources API borrows from TermFactory. The translation memory and the reviewing/commenting facilities are adapted from GlobalSight. The last item is based on the GF grammar development tools API.

    The MOLTO Translation Tools Core API

    The core API basically provides for the one-editor/translator scenario, where an editor/translator creates or edits a source document under constraints of a selected GF grammar in PGF form and generates translations for the source. For lexical gaps (out-of-vocabulary items) there is a simple term editor which allows looking up concepts and adding equivalents. The demo prototype translation editor

    This section describes the translation editor developed by K. Angelov at UGOT.

    To guide the development of a suitable translation editor API to support MOLTO translation needs, UGOT has created a prototype web-based translation editor. It is implemented in Google Web Toolkit. It is usable for authoring with small multilingual grammars. It doesn't require any downloads or use of command shells. All that is needed is a reasonably modern web browser.

    The editor runs entirely in the web browser, so once you have opened the web page and have documents and grammars loaded, you can continue translation editing while you are offline.

    Sign in

    Signing in should allow a user controlled access to her own and some (maybe not all) shared resources. Ideally, the same login should work throughout the different parts of the distributed toolkit. There should be some group scheme to set group level access restrictions.

    For basic sign in needs, the demo editor currently uses the Google authentication API.

    Web applications that need to access services protected by a user's Google or Google Apps (hosted) account can do so using the Google Authentication service. This service lets web applications get access without ever handling their users' account login information. Google offers two libraries for handling authentication: one using the OAuth open standard, and a second interface called AuthSub, developed prior to the release of the OAuth standard. Authentication and authorization for Google APIs allow third-party applications to get limited access to a user's Google accounts for certain types of activities.

    Grammar manager

    The demo editor has a simple grammar manager that retrieves the user's grammars from a mySQL database via a ContentService implemented in Haskell, subject to a successful login through Google.

    Available operations in ContentService:

    • login
    • update_grammar
    • delete_grammar
    • grammars (listing)
    • save
    • load (document from mysql db)
    • search
    • delete

    Document manager

    The demo editor has a simple file database manager that uploads and requests the user's documents from a mysql database using the same ContentService as the grammar manager.

    Term manager

    The demo editor has a simple treegrid editor for searching and editing translation correspondences from the web of data, including TermFactory services. It is not yet connected to the GF grammar back end. The management of lexical resources and ontologies is detailed in connection with the extended API below.

    Tree grid editor

    Editor

    The editor guides the text author by showing a set of fridge magnets and offers autocompletion to hint how a text can be continued within the limits of the current grammar. In the current version, there is a sign-in box and tabs for grammars, documents, editor, and terms, plus two to query and browse the loaded grammar.

    The prototype gives a first rough idea of how a web based GF translation editor could work. While this editor is implemented in JavaScript and runs entirely in the web browser, we do not expect to create a full implementation of the MOLTO translation tools that runs in the web browser, but let the editor communicate with outside servers, including a TMS server (Globalsight) and a GF server.

    The MOLTO Translation Tools Extended API

    User management

    For more flexibility (as well as vendor independence), an open source LDAP (The Lightweight Directory Access Protocol) based user management implementation can be used. There is one in GlobalSight. It allows distinguishing different roles and user groups, and controlling access to resources by roles.

    Document management

    The simple document manager of the demo editor will be complemented with a more sophisticated XLIFF based document manager built using the GlobalSight document management API. Document format conversions belong to the day's work in the translation business, and they can be assumed to be handled by the extended dcoument manager, using XLIFF as a fixpoint.

    XLIFF (XML Localisation Interchange File Format) is an XML-based format created to standardize localization. XLIFF was standardized by OASIS in 2002. Its current specification is v1.2[1] released on Feb-1-2008. The XLIFF Technical Committee is currently at work on XLIFF 2.0. The specification is aimed at the localization industry. It specifies elements and attributes to aid in localization.

    XLIFF cognizant open source editors and localization platforms include

    • Benten - an open source XLIFF editor written in Java.
    • OmegaT - a cross-platform and open source CAT tool.
    • Pootle - a web-based localisation platform.
    • Heartsome - a suite of cross-platform CAT (Computer-assisted translation) tools founded on open standards: XLIFF, TMX, TBX, SRX, XML, GMX. It also provides a free Lite version.
    • Swordfish III - a cross-platform CAT tool that uses XLIFF 1.2 as native format.
    • Virtaal - an open source CAT tool.
    • XTM - a highly collaborative web server based CAT environment with extensive support for XLIFF (1.0 through to 1.2) as well as an implementation of the OASIS OAXAL architecture.

    Examples of XLIFF Documents

    Example 1: A simple XLIFF file with strings extracted from a Windows RC file. Here the skeleton (the data needed to reconstruct the original file are) is stored in a separate file:
    
    <?xml version="1.0" encoding="windows-1252" ?>
    <xliff version="1.1" xml:lang='en'>
     <file source-language='en' target-language='fr' datatype="winres"
      original="Sample1.rc">
      <header>
       <skl><external-file href="Sample1.rc.skl"/></skl>
      </header>
      <body>
       <group restype="dialog" resname="IDD_DIALOG1">
        <trans-unit id="1" restype="caption">
         <source>Title</source>
        </trans-unit>
        <trans-unit id="2" restype="label" resname="IDC_STATIC">
         <source>&Path:</source>
        </trans-unit>
        <trans-unit id="3" restype="check" resname="IDC_CHECK1">
         <source>&Validate</source>
        </trans-unit>
        <trans-unit id="4" restype="button" resname="IDOK">
         <source>OK</source>
        </trans-unit>
        <trans-unit id="5" restype="button" resname="IDCANCEL">
         <source>Cancel</source>
        </trans-unit>
       </group>
      </body>
     </file>
    
    </xliff>
    

    Example 2: an XLIFF document storing text extracted from a Photoshop file (PSD file) and its translation in Japanese:

    <xliff version="1.2">
     <file original="Graphic Example.psd"
      source-language="en-US" target-language="ja-JP"
      tool="Rainbow" datatype="photoshop">
      <header>
       <skl>
        <external-file uid="3BB236513BB24732" href="Graphic Example.psd.skl"/>
       </skl>
       <phase-group>
        <phase phase-name="extract" process-name="extraction"
         tool="Rainbow" date="20010926T152258Z"
         company-name="NeverLand Inc." job-id="123"
         contact-name="Peter Pan" contact-email="ppan@xyzcorp.com">
         <note>Make sure to use the glossary I sent you yesterday.
          Thanks.</note>
        </phase>
       </phase-group>
      </header>
      <body>
       <trans-unit id="1" maxbytes="14">
        <source xml:lang="en-US">Quetzal</source>
        <target xml:lang="ja-JP">Quetzal</target>
       </trans-unit>
       <trans-unit id="3" maxbytes="114">
        <source xml:lang="en-US">An application to manipulate and 
         process XLIFF documents</source>
        <target xml:lang="ja-JP">XLIFF 文書を編集、または処理
         するアプリケーションです。</target>
       </trans-unit>
       <trans-unit id="4" maxbytes="36">
        <source xml:lang="en-US">XLIFF Data Manager</source>
        <target xml:lang="ja-JP">XLIFF データ・マネージャ</target>
       </trans-unit>
      </body>
     </file>
    
    </xliff>
    

    XLIFF is bilingual (each translation unit offers a and a elements). There are however ways to have multilingual XLIFF documents:

    • Each <file> element can have different source/target pairs.
    • The language of the data in the <alt-trans> element can be different from the main source/target languages. This allows alternative translations coming from a different language. For example: French (fr-FR) proposed translations could be offered when translating into French Canadian (fr-CA), and so forth.

    In the GF interlingual model, the source "language" can be the abstract syntax representation of a translation unit.

    The above considerations entail some requirements for translation-time document management in the MOLTO Translation tools API:

    Associated to the MOLTO Translation Tools API, there must be tools for extracting XLIFF content documents out of various types of original skeleta and putting translated content back to the skeleton. (These tools are outside of MOLTO proper on the because many such tools already exist and because it is up to the provider of a new document type to also provide XLIFF support for it.)
    There must be methods in the Molto Translation API for extracting raw text from XLIFF source elements, feeding it into GF and inserting the translation into the XLIFF target element. The GF translation API should also have methods for handling XLIFF coded inline tags. The best solution for that could be a special purpose GF grammar, because the correct placement of inline tags can depend on the translation of the content. 
    

    Lexical resources

    A key consideration for the usability of MOLTO translation is the ease with which its text coverage can be extended by a user community. We need to pay great attention to adaptability. The most important factor in extensibility is lexical coverage. Grammatical coverage can be developed and maintained with language engineering, and grammatical gaps can often be circumvented by paraphrasing. There are two cases to consider: either the abstract grammar misses concepts, or concrete grammars for some language/s are missing equivalents. In the first case, we need to extend the domain ontology and its abstract grammar. In the second case, we need to add terms.

    For ontology and term management, we propose to apport to MOLTO the TermFactory ontology based terminology management concept. TermFactory is a system of distributed multilingual term ontology repositories maintained by a network of collaborative management platforms. It has been described at length in the TermFactory Manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml.

    The user of the MOLTO translation editor has direct access to the treegrid editor for querying and editing term equivalents for concepts already in available ontologies, either already in TermFactory or 'raw' from the Web of Data, in particular, the OntoText services serving data from FactForge repository.

    Term management

    Say for instance there is no equivalent listed for cheese in some language's concrete grammar FooLang. The author/translator can use the treegrid editor to query for terms for the concept food:Cheese in TermFactory or do a search through OntoText services for candidate equivalents, or, if she knows the answer herself, submit equivalents through the treegrid editor. The new equivalent/s are saved in the user's own MOLTO lexicon, and submitted to TermFactory as term proposals for the community to evaluate.

    Ontologies

    If there is a conceptual gap not easily filled in through the treegrid editor, there is the option of forwarding the problem to an appropriate TermFactory collaborative platform. This route is slower, but the quality has a better guarantee in the longer run, as inconsistency or duplication of work may be avoided. Say there is no concept in the domain ontology for the new notion that occurs in the source text. In easy cases, new concepts can be added through the treegrid editor, subclassing some existing concept in the ontology. In more complex cases, where negotiations are needed in the community, an ontology extension proposal is submitted through a TermFactory wiki. TermFactory offers facilities for discussing and editing ontologies and their terms. In due time, them modified ontology gets implemented in a new release of the GF domain abstract grammar. Translation editing

    The translation editor demo is a good prototype, but different scenarios and platforms may call for different combinations of its features. One way to go is to extend the demo with further tabs and facilities for CAT tool support. But there is the also the opposite alternative to consider of calling MOLTO translation tool services from a third party editor. GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. It might just be feasible to embed MOLTO demo editor functionalities into the GlobalSight editor(s). In the Globalsight setup, there is already support for importing cut-and-dried MT translations from a MT service, but here we are talking about something rather more intricate.

    It is not immediately obvious which route would provide least resistance. From the point of view of GF usability, finding a neat way of embedding GF editing functions in third party translation editors could be a better sales position than trying to maintain a whole new MOLTO translation environment. (Unless of course, the new environment is clearly more attractive to targeted users than existing ones.) We may also try to have it both ways.

    Reviewing/feedback

    It was noted above that blind translation in the case of incomplete or inadequate coverage in resource grammars can occasion a round of reviewing and giving feedback on the translations before publication. This part of the process is in its main outlines familiar from the translation industry workflow, and can be implemented as a variation of it. In the MOLTO workflow, reviewer comments are not returned (just) to the human author/translator(s), but they should have repercussions in the ontology and grammar management workflows. This part requires modifying and extending the existing GlobalSight revisioning tools to communicate with the MOLTO lexical resources and grammar services. The GlobalSight revisioning tools now use email as the human-to-human communication channel. We probably want to use a webservice channel for machine-to-machine communication, and possibly some web commenting system as an alternative to email.

    Grammar engineering

    To the extent grammar engineering can be delegated to translation tool users, it must happen transparently without requiring knowledge of GF. One way to do this is through what is known as example-based grammar writing in GF. Example-based grammar writing is a new GF technique for backward-engineering GF source from example translations. It can play a significant role in the translation-to-grammar feedback cycle. This part of the TT API will be borrowed from the MOLTO Grammar Developer Tools API. See the last section of this document.

    5. Web services API for the MOLTO Translation Tools

    To develop the above outlined web-based translation environment further, or implement other usage scenarios, a web service interface to the MOLTO editor API will be useful. The interface consists of several parts.

    • Translation editor demo
    • Globalsight API (adapted for MOLTO)
    • TermFactory API (adapted for MOLTO)
    • OntoText API
    • MOLTO Grammar Tools API
    • Translation tools glue

    The editor demo is in the MOLTO darcs repository. The services provided by the GF server are outlined in the MOLTO Grammar Tools API document. The GlobalSight WS API was described above. The TermFactory web services are documented in the TF Manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml#Services .

    The translation tools glue connects the different parts of the whole. It includes at least:

    • treegrid editor back end: answers queries to populate the editor from TF and Ontotext repositories, and communicates user additions to TF and the GF grammar editing services (see next section).
    • linking between translation editor/s, translation leveraging tools (TM/termbank) and GF services
    • linking between the MOLTO (GlobalSight) TT reviewing system and TermFactory
    • linking between the MOLTO (GlobalSight) TT reviewing system and GF grammar editing services
    • conversion between TermFactory ontology format and GF abstract grammar format

    Here is a figure showing some of the connections in the design.

    Requirements on the GF grammar and translation APIs

    The above design generates a wishlist of requirements on the GF grammar and translation API.

    Assume the GF translation goes to a reviser, working with or without another copy of the MOLTO translation tool. The corrected translation, in XLIFF form, should be brought to GF's attention. This calls for a new functionality from the GF grammar API: one which corrects the grammar and lexicon software to produce the output required by the corrected translation. This functionality is to be built on the GF example-based grammar writing methodology.

    In order for the corrections to converge, revised translations must accumulate so that the newest corrections do not falsify earlier ones. The collection of manual corrections may become ambiguous or inconsistent, which situation should also be recognised and brought to the attention of a grammar engineer. Again, it is important to pay attention to user roles and rights.

    We may want to provide ways to override GF translations with canned translations. At the translation tools level, this can happen by preferring TM translations over GF. We should also consider ways to override compositional translations on GF grammar level.

    Another requirement is translation time update of grammar, at least the lexicon, so that translator's on the fly lexicon additions are possible.

    If we want to support translating formatted documents using XLIFF, the minimum requirement is that the GF translation API handles XLIFF inline tags.

    D3.2 MOLTO translation tools prototype


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D3.2 MOLTO translation tools prototype
    Security (distribution level): Public
    Contractual date of delivery: M24
    Actual date of delivery: March 2012
    Type: Prototype
    Status & version: Final
    Author(s): Lauri Carlson
    Task responsible: UHEL
    Other contributors: Thomas Hallgren, Krasimir Angelov, Seppo Nyrkkö, Lauri Alanko, Chunxiang Li, Inari Listenmaa


    Abstract

    Deliverable D3.2 consists of a prototype of the MOLTO translation tools, documentation of the translation scenario and instructions on the download and installation of the prototype.

    1. Introduction

    MOLTO promises a translation tool based on Grammatical Framework, a programming language for multilingual grammars. The Grammatical Framework (GF) is used in the MOLTO project to build translation systems for EU languages. The user of the MOLTO translation tool need not know how to write GF grammars. She is supposed to use domain specific grammars developed by others to translate documents in the domains covered by the grammars. GF resource grammars offer basic grammar coverage in dozens of languages. A domain specialist is supposed to write an abstract grammar for a domain based on an ontology of the domain that provides the key concepts and their relationships. Language specific grammar engineers are supposed to map the common abstract grammar to the different resource grammars. Basic domain language and coverage does not guarantee that all terms and idioms found in a translatable document are covered. To be really usable, the MOLTO translation tool should handle lexical gaps in a way that benefits and benefits from a wider community of translators. It should also provide fallback solutions when a text is not covered by the available grammar(s).

    This document builds on the Translation Tools API document, which lays out the translation scenario/s addressed by the prototype and describes the various programming APIs available for building the prototype. This document explains the installation, use, and limitations of the MOLTO translation tool prototype. The prototype integrates some but not yet all aspects of the MOLTO translation tools design in the TT API document. As in the API document, we single out a set of core tools for a standalone translator used by one translator, from an extended set of tools that are designed to support MOLTO translation communities.

    The core MOLTO Translation Tool (TT) consists of these parts.

    • translation editor
    • grammar manager
    • document manager
    • term manager

    The core TT editor can be used standalone. It is being integrated, though in the prototype deliverable, only just embedded in the rest of MOLTO translation and ontology/terminology maintenance tools (the extended prototype), in particular, the GlobalSight TMS and the TermFactory term ontology management. Eventually, they both shall play together with grammar GF maintenance and development machinery. To help orientation, we recapitulate the intended workflow from the Translation Tools API document.

    1.1 The MOLTO Translation Workflow

    The MOLTO TT (translation tools) editor supports a one-person workflow where the same person is the author(ised editor) of the source and the translator. The adoption of the GlobalSight TMS to MOLTO allows embedding it in a more collaborative scenario where more actors are involved as in the professional workflow described in the API document, by adding traditional CAT and translation project support tools to the toolkit. A more difficult part is to adjust the workflow so that the adaptivity goal is satisfied. In the professional workflow, corrected translations accumulate in the translation memory, which helps translators avoid the same errors next time. In the MOLTO workflow, GF has an active role in generating translations, so it is GF that should learn from the corrections. Concretely, when a translator or reviser changes a wording, the correction should not go unnoticed, but should find its way to back to GF, preferably through a round of community checks. More generally improvements should be shared by the community, so that the whole community acts adaptively.

    We next try a description of one round of the ideal MOLTO translation scenario.

    Although it is possible that an author is ready to create and translate in one go (especially in a hurry), it is more normal to have some document(s) to start from. The document/s might be created in a GF constrained language editor in the first place. In that case, the only remaining step is translation. If translation coverage and quality has been checked, nothing more is neeeded. But frequently, some changes are needed to a previously translated document, or a new one is to be created from existing pieces and some new material. Imaginably, some of the parts come from different domains, and need to be processed with different grammars. Some such complications might be handled with document composition techniques in the manner of Docboook or DITA toolchains.

    The strength of GF is that it ought to handle grammatical variation of existing sources well, so as to avoid manual patching of previous translations. Assume there is a previously GF translated document, and we want to produce a variant. Then it ought to be enough to load the document, make desired changes to it under the control of the GF grammar, and let GF generate the modified translations.

    Is it necessary to show the translations to the user? Not unless the translator knows the target language(s). We should distinguish two profiles: blind translation, where the author does not know or is not responsible for the target languages herself, but relies on outside revision, and plain translation, in which there is one or two target language known to the author/translator to translate to, who wants to check the translations as she goes.

    In the blind profile, the author has to rely on revisers, and the revision cycle is slower. The revisers can either notify the author that the source does not translate correctly in their language(s), or they may notify the grammar/lexicon developer(s) directly, or both. If there is a hurry, the reviser/s should provide a correct translation directly for the author/publisher to use as canned text. In addition, they should notify the grammar developer/s of the revisions needed to GF. The notification/s could happen through messages, or conveyed through a shared translation memory, or both. In this slower cycle, it may not be realistic to expect the author to change the source text and repeat the revision process many times over for the same source and possibly a multiplicity of languages to get everything translate right before publication.

    In the plain profile, a faster cycle of revision is called for. The author/translator can try a few variations of the input. If no variant seems to work, then she probably wants to use her own translation, but also to make sure that GF learns of and from the failure. The failure can be a personal preference, or a general fix that the community should profit from. If it is a personal preference, the user may want to save the corrected translation in her translation memory and/or glossary, but also she may want to tweak her GF grammar to handle this and similar cases to her liking next time. If it is just a lexical gap or missing fixed idiom, then there should be in GF translation API a service to modify the grammar without knowing GF. The modifications could happen at different levels of commitment. The most direct one would be to provide a modular PGF format which would allow advising the compiled user grammar on the fly. Such a runtime fix would make sure that the same error will not happen during the same translation session or subsequent ones at least until the domain grammar is recompiled.

    The next level of commitment to a change would be to generate new GF source, possibly from example translations provided by the author/translator, compile them, and add the changed or extra modules to the user's GF grammar. The cycle involved here might be too slow to do during translation, but it could happen between translation sessions. If fully automatic grammar revision is too error prone, the author/translator could just go on with canned translations in this session, and commit change requests to the grammar developer community. In this case, the changes would be carried out in good time, with regression tests and multilingual revision cycles, especially if the changes affect the domain semantics (abstract grammar) and thereby all translation directions.

    Here is a figure of the overall design.

    Molto WP3 design

    2. The MOLTO Translation Tools Architecture

    The MOLTO Translation Tools architecture is recapitulated here briefly. It consists of many largely independent components. There is a core basically answering the needs of a single author/translator, and an Extended API addressing the needs of a community of authors, translators, and grammar engineers.

    The components of the MOLTO TT editor prototype currently include the following:

    1. sign in
    2. grammar manager
    3. document manager
    4. term manager
    5. translation editor

    The components of the MOLTO TT extended prototype include the following:

    1. user management
    2. grammar management
    3. document management
    4. lexical resources
    5. translation editing
    6. translation memory
    7. reviewing/feedback
    8. grammar engineering

    The first five are extensions of the corresponding facilities in the core. The lexical resources API borrows from TermFactory. The translation memory and the reviewing/commenting facilities are adapted from GlobalSight. The last item is based on the GF grammar development tools API.

    3. The MOLTO Translation Tools Prototype

    This section describes the code constituting the prototype. The code base of the translation tools extended prototype currently consists of the following parts.

    • MOLTO TT editor
    • GlobalSight (adapted for MOLTO)
    • TermFactory (adapted for MOLTO)
    • OntoText API
    • MOLTO Grammar Tools API

    This document describes the prototype's software packages, their installation, use and current limitations. The last two components are not discussed further in this document, because they are described in other MOLTO deliverables. The services currently provided by the GF server are outlined in the MOLTO Grammar Tools API document. The GlobalSight WS API was described in the MOLTO Translation Tools API document. TermFactory is documented at length in the TermFactory manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml .

    3.1 The MOLTO Translation Tools (TT) Editor

    This section describes the GF translation editor originally developed by Bringert and Angelov at UGOT and reworked at UHEL.

    To guide the development of a suitable translation editor API to support MOLTO translation needs, UGOT created a prototype web-based translation editor. It is implemented using the Google Web Toolkit and usable for authoring with small multilingual grammars. To use it from the web, all that is needed is a reasonably modern web browser. To install it locally, one needs in addition a web server, MySQL database and GF services.

    The editor runs entirely in the web browser, so once you have opened the web page and have documents and grammars loaded, you can continue translation editing while you are offline.

    3.1.1 Software requirements

    In order to install the editor, you need to have the following components:

    1. The editor code itself (in the eclipse package)
    2. For developer version only:
      • Eclipse Helios JEE (3.6)
      • Google Web Toolkit plugin (tested with version 2.3.1)
    3. Web server
      • Apache (tested with 2.2.14 on Ubuntu)
      • FastCGI (libapache2-mod-fastcgi)
    4. Database
      • HSQL (tested with version 1.8.1)
      • HSQL-MySQL (1.8.1) -- a slightly modified version: hsql-mysql-1.8.1-molto.zip
      • MySQL server (tested with 5.1.54 and 5.1.62)
    5. GF server
      • GF (tested with 3.3.3)
      • Haskell (tested with ghc 7.0.4, cabal-install 0.10.2)

    In this section we assume that the user has Apache, MySQL and GF server configurations done. Please see Appendix for instructions on background settings.

    3.1.2 Installation

    3.1.2.1 Developer version

    The prototype TT editor code is packaged as an Eclipse project archive http://tfs.cc/molto/molto-tt-0.9-linux-eclipse-20120529.zip ready for import in Eclipse (Helios).

    Import the project in Eclipse. You should have Google Web Toolkit plugin (tested with version 2.3.1). The runtime editor files are found in TT-0.9/www/editor/. To install the runtime, the following files are placed under Apache2 server root (here /var/www) as shown.

    /var/www/editor$ ls 
    grammars  index.html  org.grammaticalframework.ui.gwt.EditorApp  WEB-INF
    

    When you have placed the files under /var/www, then you can launch the project in Eclipse. Choose from the menu Run -> Run configurations -> Web Application -> (new configuration). In the tab Server untick Run built-in server. If you have put the files in directory /var/www/editor, then the launch address will be 127.0.0.1:8888/editor/index.html?gwt.codesvr=127.0.0.1:9997.

    Web server: Apache2 fastcgi and action modules must be enabled for the services. See installation notes at the end for a sample Apache2 virtual host below to handle the services from port 8888 (the default).

    GF server: The editor requires also an installation of GF server. The server binaries are content-service (for authentication and simple mysql database management) and pgf-service (for gf grammars). When compiling, the cabal option --global should be used; then the GF service binaries get installed in /usr/local/bin. They can be copied/linked under webserver (by default Apache2) fcgi-bin directory as follows.

    /var/www/fcgi-bin$ ls -l 
    content-service -> /usr/local/bin/content-service
    pgf-service -> /usr/local/bin/pgf-service
    

    Database: The TT editor back end requires an installation of MySQL, HSQL and a Haskell library hsql-mysql by Krasimir Angelov. Further instructions how to create a database for MOLTO TT tools are in the installation notes.

    The content service needs to read mysql database connection parameters from file /usr/local/bin/fpath. It should be in the same directory as content-service and contain four tokens, the mysql host and database names and the database owner credentials.

    /usr/local/bin$ cat fpath
    localhost moltodb moltouser moltopass
    

    Then, the database is created by typing the following:

    /usr/local/bin$ ./content-service fpath
    

  • login
  • update_grammar (grammar cache)
  • delete_grammar
  • grammars (listing)
  • save (document to mysql db)
  • load (document)
  • search (documents)
  • delete (document)
  • -->

    Sign in: The prototype editor currently uses the Google authentication API for sign in. Authentication and authorization for Google APIs allow third-party applications to get limited access to a user's Google accounts for certain types of activities. A user needs to have a Google account to sign in to the application.

    3.1.2.2 User version

    All back-end requirements are needed also for the user version. Now, instead of opening the package in Eclipse, the only thing needed is to place the following files under Apache2 server root (here /var/www) as shown.

    /var/www/editor$ ls 
    grammars  index.html  org.grammaticalframework.ui.gwt.EditorApp  WEB-INF
    

    Then, to run the editor, just type the address 127.0.0.1:8888/editor/index.html?gwt.codesvr=127.0.0.1:9997 into browser.

    3.1.2.3 Limitations

    Ideally, the same login should work throughout the different parts of the distributed toolkit. There should be some group scheme to set group level access restrictions. Eventually, we may want to provide MOLTO single-sign-on as a replacement for Google authentication.

    3.1.3 Grammar manager

    The prototype editor has a simple grammar manager that is supposed to allow a user to upload her grammars to the editor's grammar cache under her name. The cache kept is on the editor server for reasons of speed and xss restrictions. The user chooses the current grammar from among the cached grammars using a drop-down list.

    3.1.3.1 Limitations

    The grammar manager is not yet completed.

    3.1.4 Document manager

    The prototype editor has a simple document manager that saves a translated document in and retrieves one from from the mysql database using ContentService. The current document is saved in the database using a diskette icon on the editor page. The Documents tab shows the currently saved documents and allows the user to load a selected document for continued translation.

    3.1.4.1 Limitations

    Naming of documents is not yet supported. Both the grammar manager and document manager remain to be linked to the TMS.

    3.1.5 Term manager

    The TT editor includes a simple tabular equivalents editor for searching and editing translation correspondences from the web of data, including TermFactory services. The equivalents editor is an independent web application that may also be used standalone or as a plugin to other applications. When complete, the equivalents editor lets the user extend their GF grammars with terms entered in the term editor and/or upload them as term proposals to TermFactory.

    3.1.5.1 Installation

    The equivalents editor was built with the ExtJS javascript library. It can be downloaded from http://tfs.cc/molto/molto-term-editor.tgz. Unpack it and put the whole molto_term_editor directory under /var/www/ (or wherever your web server wants them, for example in Windows the path is probably C:\Program Files\Apache\htdocs). Open the file editor_sparql.html in a browser.

    Note that this is also included in the complete editor as one of the tabs. As for function, the versions are identical. The screenshot below is from the standalone version.

    3.1.5.2 Use

    The term editor consists of two tabular grids. In the first (left side) grid, enter a term in the text input and opt for wider or narrower concepts. In the latter case (the default) the editor shows on the right another grid of concepts that are classed narrower than the search term in the data source (by default, OntoText FactForge) and their designations in a predefined selection of languages. In the former case, the editor fills out the left side grid with concepts that are classed in the data source as wider than the search term. Clicking on one of them does a search for its subconcepts and terms, shown in the right side grid.

    The term grid is editable and the editor remembers the user's edits to the cells in the grid.

    3.1.5.3 Limitations

    The data source and choice of languages are not yet user definable. The editor is not yet connected to the TermFactory or GF grammar back ends.

    3.1.6 Editor

    In the current version, there is a sign-in box and tabs for grammars, documents, editor, and terms, plus two to query and browse the loaded grammar. The latter services are familiar from other GF front ends and based on the GF grammar Web API.

    3.1.6.1 Use

    After sign in, the editor calls content-service to show the logged in user's grammars from the grammarusers mysql table in the grammar list. The user chooses a domain grammar. This brings to view the initial vocabulary known by the grammar as fridge magnets to choose from. Alternatively, the user can type or paste text in the editor window. At every new input, the active translation unit is sent to the back end for translation, and the set of fridge magnets is updated. When a translation unit is complete and translatable, it is simultaneously translated to all the available languages and the translations are shown on the screen (in blue). If an input is not parsable, the editor underlines the unparsable part. The user can back off to the point of deviation using backspace. In addition, There is a button for clearing the input.

    The editor guides the text author by showing a set of fridge magnets and offers autocompletion to hint how a text can be continued within the limits of the current grammar.

    3.1.6.2 Limitations

    The prototype gives a first rough idea of how a web based GF translation editor could work. At present, however, it remains oriented to a very small vocabulary (fridge magnets are not apt to work well with thousands of words). It is also doubtful that the setup is fast enough for the amount of interactivity caused at speeds involved in professional translation. A reconsideration how the editor and the back end best play together is indicated. A related limitation is the strict left-to-right orientation of the parsing. UGOT seems to be working on a robust parser which allows other manners of combining parsing and editing. The proper disposition of the translation result is not worked out yet.

    3.2 The extended translation tools prototype

    We now move on to the extended prototype. We first recapitulate how the extended translation tools extend the one-translation scenario to a community of translators collaboratively using and maintaining MOLTO translation tools.

    3.2.1 User management

    For more flexibility (as well as vendor independence), the open source LDAP (The Lightweight Directory Access Protocol) based user management implementation from GlobalSight has been adapted for MOLTO. It allows distinguishing different roles and user groups, and controlling access to resources by roles. The GlobalSight user management solution has been conservatively extended for the needs of MOLTO TermFactory users. The following screenshot displays a user's roles as an ontology editor.

    Term ontology management roles are defined per domain, where a domain is represented by a regular expression on ontology URIs. The MOLTO GlobalSight user management system lets a company project administrator create users and grant them MOLTO TermFactory ontology read and write permissions. The TermFactory back end GateService reads the permissions off the GlobalSight LDAP directory and database and controls access to TermFactory content accordingly. If a user's credentials are not sufficient, TermFactory Gate will not permit term ontology queries or commits. The MOLTO permissions come over and above any constraints that ontology endpoints may impose on the content they manage. They enable fine grained project level control on who is allowed to do what to shared or restricted TermFactory resources.

    3.2.2 Document management

    The simple document manager of the prototype editor remains to be upgraded to a more sophisticated XLIFF based document manager built using the GlobalSight document management API. See the MOLTO TT API document for more detail.

    3.2.3 Lexical resources

    A key consideration for the usability of MOLTO translation is the ease with which its text coverage can be extended by a user community. We need to pay great attention to adaptability. The most important factor in extensibility is lexical coverage. Grammatical coverage can be developed and maintained with language engineering, and grammatical gaps can often be circumvented by paraphrasing. In contrast, paraphrasing is not a real option for special domain terms. There are two cases to consider: either the abstract grammar misses concepts, or concrete grammars for some language/s are missing equivalents. In the first case, we need to extend the domain ontology and its abstract grammar. In the second case, we need to add terms.

    For both ontology and term management, we apport to MOLTO the TermFactory ontology based terminology management concept. TermFactory is a system of distributed multilingual term ontology repositories maintained by a network of collaborative management platforms. It has been described at length in the TermFactory Manual at http://www.helsinki.fi/~lcarlson/CF/TF/doc/TFManual_en.xhtml.

    The user of the MOLTO translation editor has direct access through the equivalents editor to querying and editing term equivalents for concepts already in available ontologies, either already in TermFactory or 'raw' from the Web of Data, in particular, the OntoText services serving data from FactForge repository.

    3.2.3.1 Term management

    Say for instance there is no equivalent listed for cheese in some language's concrete grammar FooLang. The author/translator can use the equivalents editor to query for terms for the concept food:Cheese in TermFactory or do a search through OntoText services for candidate equivalents, or, if she knows the answer herself, submit equivalents through the equivalents editor. The new equivalent/s are saved in the user's own MOLTO lexicon, and submitted to TermFactory as term proposals for the community to evaluate.

    3.2.3.2 Ontologies

    If there is a conceptual gap not easily filled in through the equivalents editor, there is the option of forwarding the problem to an appropriate TermFactory collaborative platform. This route is slower, but the quality has a better guarantee in the longer run, as inconsistency or duplication of work may be avoided. Say there is no concept in the domain ontology for the new notion that occurs in the source text. In easy cases, new concepts can be added through the equivalents editor, subclassing some existing concept in the ontology. In more complex cases, where negotiations are needed in the community, an ontology extension proposal is submitted through a TermFactory wiki. TermFactory offers facilities for discussing and editing ontologies and their terms. In due time, them modified ontology gets implemented in a new release of the GF domain abstract grammar.

    3.2.3.3 Ontology-grammar interface

    TermFactory ontologies are extensible and support reasoning. Instead of implementing domain ontology-to-grammar bridges over and again for every new domain and application, it seems more promising to take advantage of the semantic network structure of (term) ontologies. Suppose verbalizations are already defined for a selection of upper or middle level ontologies. Special domain ontologies can subclass them and thereby also inherit the verbalizations that go with the superclasses and properties. UHEL is currently looking at the generalization of the MOLTO museum case ontology-to-grammar mapping in this direction.

    3.2.4 Translation editing

    The TT translation editor is just a prototype. Different scenarios and platforms may call for different combinations of its features. One way to go is to extend the prototype with further tabs and facilities for CAT tool support. But there is the also the opposite alternative to consider of calling MOLTO translation tool services from a third party editor. GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. It might just be feasible to embed MOLTO prototype editor functionalities into the GlobalSight editor(s). In the Globalsight setup, there is already support for importing cut-and-dried MT translations from a MT service, but here we are talking about something rather more intricate.

    It is not immediately obvious which route would provide least resistance. From the point of view of GF usability, finding a neat way of embedding GF editing functions in third party translation editors could be a better sales position than trying to maintain a whole new MOLTO translation environment. (Unless of course, the new environment is clearly more attractive to targeted users than existing ones.) We may also try to have it both ways.

    3.2.5 Reviewing/feedback

    It was noted above that blind translation in the case of incomplete or inadequate coverage in resource grammars can occasion a round of reviewing and giving feedback on the translations before publication. This part of the process is in its main outlines familiar from the translation industry workflow, and can be implemented as a variation of it. In the MOLTO workflow, reviewer comments are not returned (just) to the human author/translator(s), but they should have repercussions in the ontology and grammar management workflows. This part requires modifying and extending the existing GlobalSight revisioning tools to communicate with the MOLTO lexical resources and grammar services. The GlobalSight revisioning tools now use email as the human-to-human communication channel. We probably want to use a webservice channel for machine-to-machine communication, and possibly some web commenting system as an alternative to email.

    3.2.6 Grammar engineering

    To the extent grammar engineering can be delegated to translation tool users, it must happen transparently without requiring knowledge of GF. One way to do this is through what is known as example-based grammar writing in GF. Example-based grammar writing is a new GF technique for backward-engineering GF source from example translations. It can play a significant role in the translation-to-grammar feedback cycle. This part of the TT API will be borrowed from the MOLTO Grammar Developer Tools API.

    The following sections describe what parts of the above list are already in place in the prototype and what remains to do.

    3.3 GlobalSight

    GlobalSight (http://www.globalsight.com/) is an open source Translation Management System (TMS) released under the Apache License 2.0. Version 8.2. was released on Sept 15, 2011. As of version 7.1 it supports the TMX and SRX 2.0 Localization Industry Standards Association standards.[2] It was developed in the Java programming language and uses MySQL database and OpenLDAP directory software. GlobalSight also supports computer-assisted translation and machine translation.

    According to the documentation, GlobalSight has the following features:

    • Customizable workflows, created and edited using graphical workflow editor
    • Support for both human translation and fully integrated machine translation (MT)
    • Automation of many traditionally manual steps in the localization process, including: filtering and segmentation, TM leveraging, analysis, costing, file handoffs, email notifications, TM update, target file generation
    • Translation Memory (TM) management and leveraging, including multilingual TMs, and the ability to leverage from multiple TMs
    • In Contect Exact matching, as well as exact and fuzzy matching
    • Terminology management and leveraging
    • Centralized and simplified Translation memory and terminology management
    • Full support for translation processes that utilize multiple Language Service Providers (LSPs)
    • Two online translation editors
    • Support for desktop Computer Aided Translation (CAT) tools such as Trados
    • Cost calculation based on configurable rates for each step of the localization process
    • Filters for dozens of filetypes, including Word, RTF, PowerPoint, Excel, XML, HTML, Javascript, PHP, ASP, JSP, Java Properties, Frame, InDesign, etc.
    • Concordance search
    • Alignment mechanism for generating Translation memory from previously translated documents
    • Reporting
    • Web services API for programmatic access to GlobalSight functionality and data
    • Integrated with Asia Online APIs for automated translation

    3.3.1 Installation

    The latest full Linux install version of GlobalSight is 7.1.0.x . It can be updated to the current version 8.2.2.0 using publicly available upgrade packages. The GlobalSight 7.2.0.0 base version and the upgrade packages are available from SourceForge. (Copies are available from tfs.cc under /srv/GlobalSight_backup/upgrade. More detailed install instructions, including scripts to install LDAP for GlobalSight can be found at http://tfs.cc/globalsight-molto-install/. A fully functional GlobalSight site also needs access to email services.

    To upgrade from a working install of GlobalSight 8.2.2.0 to MOLTO GlobalSight, download, unpack and run http://tfs.cc/molto/GlobalSight_Installer_8.2.2.1.zip.

    There is also a complete MOLTO GlobalSight eclipse project archive at http://tfs.cc/molto/molto-globalsight-8.2.2.1-linux-eclipse-20120529.zip containing the source as well as the runtime.

    3.3.2 MOLTO GlobalSight

    MOLTO GlobalSight differs from GlobalSight out of the box in two ways. First, MOLTO GlobalSight extends MOLTO user roles to terminology editing. It will be discussed in more detail below in connection with TermFactory. Second, GlobalSight has two built in translation editors, called popup editor and inline editor. The popup editor is a Trados TagEditor lookalike, while the inline editor has something of the look and feel of old Trados versions running WYSIWYG on Microsoft Word. The inline editor has been implemented in javascript using the FCKEditor library. MOLTO GlobalSight extends the selection by embedding the MOLTO TT editor as a third option on the editor menu:

    Clicking the option opens the Molto TT Editor in another window.

    3.3.2.1 Limitations

    As yet, content from the document under translation is not automatically imported into the MOLTO TT editor. Content can be cut and pasted into the MOLTO TT editor.

    3.4 TermFactory

    The MOLTO TermFactory prototype consists of the generic TermFactory codebase plus MOLTO related ontology content. At present, such content comprises the English-Finnish WordNet ontology. Integration of the TermFactory back-end with the MOLTO KRI over JMS is underway.

    The TermFactory codebase consists of

    • a term ontology query/editing back end run as an Axis2 Tomcat web service
    • a Tomcat webapp that provides standalone term ontology query form and editor
    • a MediaWiki installation with a TermFactory editor extension
    • a link to the Disqus comment system

    TermFactory is an architecture and a workflow for Semantic Web based, multilingual, collaborative terminology work. What this means in practice is that it applies Semantic Web and other document and language technology standards to the representation of multilingual special language terms and the related concepts, and provides a plan for how such terminologies can be collected, updated, and agreed about by professionals, not only terminology professionals, all over the globe, during their everyday work on virtual work platforms over the web. As a whole, TF could be termed a semantic web framework for multilingual terminology work.

    TF provides

    • ontology and terminology formats
    • format conversions
    • query and edit tools
    • repositories
    • web services

    for people to work on terms jointly or separately, building on the results of the work of others, while maintaining quality and consistency between the different contributions.

    As a prototype, there is a MediaWiki platform for human to human collaboration on collectiong terminological data plus a TF editor plugin for conveying the results of the collaboration into TermFactory ontology format. Here is a snapshot of a random MOLTO TF concept in the Wiki.

    MOLTO Wiki TF Editor page

    3.4.1 Use

    MOLTO TermFactory Mediawiki is used in the usual way a wiki works. In the demo prototype, it has been populated with the Finnish-English Wordnet (ca. 100K concepts, 2 languages, ca. 200K terms per language). The pages are generated automatically on demand. A Wordnet page currently only consists of a set of iframes and links to related lexical resources on the web. In actual use, each category (Wordnet is one) may generate its own boilerplate page design to help users describe and discuss the concepts of a category and their designations in different languages. A commenting system is in place that can be shared between different platforms and applications. The discussion threads are indexed by the URI of the relevant resource.

    The TermFactory ontology content related to a resource can be queried and edited on the Mediawiki platform using a TermFactory ontology editor extension, shown on top of the page as the Entry Editor tab. Below is a snapshot showing the TF editor opened to the TermFactory entry corresponding to the chosen WordNet term.

    MOLTO Wiki TF Editor page

    Instead of going by way of fill-in forms, the TermFactory approach is to support direct WYSIWYG editing of localized ontology triples in a HTML textarea editor. The TermFactory editor application uses the CKEditor javascript textarea editor for this purpose. TF adds to the CKEditor standard release a special purpose plugin that adds TermFactory specific action buttons and a menu to the standard issue.

    While staying conceptually close to the original RDF format of the data, the TermFactory editor layout is quite versatile. With suitable parameters, it can be tweaked to show ontology content editable in shapes already familiar to professional terminologists. There is a customisable, schema-aware insertion menu to help inserting relevant content, plus customisable input and output layout templates. The editor is not limited to TermFactory ontologies, as it is built on a general purpose textarea editor using a generic RDF to HTML mapping.

    A specialty of TermFactory is that it supports terminological reflexion. The metaterminology used in the editor is not fixed, but can be changed by giving it a TF term ontology as parameter. Using TF localization and bridge ontologies, not only the editor interface, but also the content shown can be localized to a user community's conceptualization, language and terminology. Here is the same editor page fetched after setting Mediawiki language settings set to Finnish. Note how the terminological metalanguage used in the entry is now shown in Finnish. (The localization is not complete, because the current localization ontology's coverage has some gaps.)

    MOLTO Wiki TF Editor page

    3.4.2 Installation

    The TermFactory source code is on svn at svn.it.helsinki.fi/repos/termfactory. A username and password on the repository server is needed for checkout. To check out a path, choose installation directory, go to it and do svn checkout https://<username>@svn.it.helsinki.fi/repos/termfactory/path .

    The compiled web archive files for TF are

    io/lib/tf-io.jar            The core library (offline tools)
    ws/service/TFServices.aar   The Axis2 webservice archive
    ws/servlet/TermFactory.war  The Tomcat webapp archive
    

    These three archives should be enough for deployment of TF in Linux from binaries on Tomcat running Axis2. Installations of mysql and Jena TDB are needed for persistent storage of ontologies on the TermFactory server. File upload services require prior installation of WebDAV. Detailed TF source build and install instructions are available on request.

    TermFactory MediaWiki is MediaWiki out of the box plus the TermFactory MediaWiki extension, downloadable from the TermFactory svn path fe/TermFactory. The extension requires installing TF back end, of course.

    • Install MediaWiki 1.16 (or newer)
    • Put everything under extensions/TermFactory
    • Add require("$IP/extensions/TermFactory/TermFactory.php"); to LocalSettings.php in the main directory
    • go to page Special:EditTerm

    3.4.3 Limitations

    User management between MediaWiki, TermFactory services, and TermFactory WebDAV is not fully in synch yet.

    4. Integration

    This section comments on the current status of the integration the different parts.

    • Linking between the MOLTO TT translation editor and GlobalSight document management and translation process This linking is important for usability, but is closely connected to the further development of the TT editor and the MOLTO GF back end. Alternatives range from cut and paste (already supported) to automatic processing of XML/XLIFF document elements marked for MOLTO translation.
    • Linking between the equivalents editor and TermFactory, in particular
      • TermFactory back end to answer equivalents editor queries to populate the editor from TF repositories
      • TermFactory back end to allow equivalents editor to communicate user additions to TF
      A JSON format i/o API for converting term equivalent tables to/from TermFactory to term ontology format is in place in TermFactory, so this item is almost complete.
    • Communication between the equivalents editor, TermFactory and the GF grammar services, in particular,
      • conversion of lexical entries between TermFactory ontology format and GF abstract grammar format
      • updating of GF grammars with user defined entries
      A mapping between MOLTO grammars and ontology formats has been defined by UGOT on the Museum case. UHEL has started generalizing the solution.
    • Linking between translation editor/s, translation leveraging tools (TM/termbank) and GF services, in particular
      • linking between the MOLTO (GlobalSight) comment/review system and TermFactory
      • linking between the MOLTO (GlobalSight) comment/review system and GF grammar editing services
      These steps are less urgent, so they will be left last.

    Here is a figure showing some of the connections in the design.

    Missing links

    5. Requirements on the GF grammar and translation APIs

    This section repeats the wishlist of requirements from Translation Tools on the GF grammar and translation API.

    Assume the GF translation goes to a reviser, working with or without another copy of the MOLTO translation tool. The corrected translation, in XLIFF form, should be brought to GF's attention. This calls for a new functionality from the GF grammar API: one which corrects the grammar and lexicon software to produce the output required by the corrected translation. This functionality is to be built on the GF example-based grammar writing methodology.

    In order for the corrections to converge, revised translations must accumulate so that the newest corrections do not falsify earlier ones. The collection of manual corrections may become ambiguous or inconsistent, which situation should also be recognised and brought to the attention of a grammar engineer. Again, it is important to pay attention to user roles and rights.

    We may want to provide ways to override GF translations with canned translations. At the translation tools level, this can happen by preferring TM translations over GF. We should also consider ways to override compositional translations on GF grammar level.

    Another requirement is translation time update of grammar, at least the lexicon, so that translator's on the fly lexicon additions are possible.

    If we want to support translating formatted documents using XLIFF, the minimum requirement is that the GF translation API handles XLIFF inline tags.

    Appendix: Developer notes

    The complete MOLTO TT prototype editor code is downloadable as an eclipse (Helios) project archive http://tfs.cc/molto/molto-tt-0.9-linux-eclipse-20120529.zip. The TT editor's database back-end Haskell source code is packaged as http://tfs.cc/molto/molto-tt-server-0.9-linux-20120529.zip.

    Apache web server settings

    Install apache and fastcgi with apt-get: sudo apt-get install apache2 libapache2-mod-fastcgi

    Here is a sample Apache2 virtual host below to handle the MOLTO TT back end services from port 8888 (the default). The back end server is supposed to be in the same domain as the editor to avoid cross-domain scripting violations. Copy the text below to /etc/apache2/sites-available/default

    <VirtualHost *:8888>
        ServerAdmin webmaster@localhost
    
        DocumentRoot /var/www
        AddDefaultCharset UTF-8
    
        <Directory />
            Options FollowSymLinks
            AllowOverride None
        </Directory>
        <Directory /var/www/>
            Options Indexes FollowSymLinks MultiViews
            AllowOverride None
            Order allow,deny
            allow from all
        </Directory>
    
            # Allow fastcgi services from fcgi-bin. 
    
        <Directory "/var/www/fcgi-bin/">
            Options +ExecCGI 
            AddDefaultCharset UTF-8
            SetHandler fastcgi-script
        </Directory>
    
            # Identify pgf-service as a fastcgi server
            FastCgiServer /var/www/fcgi-bin/pgf-service 
    
            # Identify content-service as a fastcgi server
            FastCgiServer /var/www/fcgi-bin/content-service 
    
            # Make action pgf-service handle pgf files
        Action pgf-service /fcgi-bin/pgf-service
        AddHandler pgf-service .pgf
        AddCharset UTF-8 .pgf
    </VirtualHost>
    

    After you have copied the above to /etc/apache2/sites-available/default, activate the changes:

    sudo a2enmod fastcgi
    sudo a2enmod actions
    

    Finally, restart apache by typing sudo service apache2 restart.

    Some attested problems

    1. Server doesn't start, error message is just "fail". If you copy and paste conf files, make sure they don't override any previous confs you might have.
    2. "Permission denied" error messages when trying to access (chmod 775) files in /var/www.
      Solution was to change the ownership group from root to all users: sudo chown -R root:www-data /var/www/ (the name of the group might vary, you can see yours by seeing which group has /var/www; do grep "/var/www" /etc/passwd).

    GF server settings

    Get the latest sources for GF from darcs repository. The instructions are here: http://www.grammaticalframework.org/download/index.html

    Assuming you have the source files, go to src/server and type sudo cabal install -f content --global. It is important to use the option --global, because by default they are installed in the home directory, and that doesn't go well with Apache. The binaries will be installed in /usr/local/bin. Because of Apache, you need to set their owner group to the apache group (depending on machine, e.g. www-data or apache).

    The next step is to link the binaries to /var/www/fcgi-bin.

    /var/www$ sudo ln -s /usr/local/bin/content-service fcgi-bin/
    /var/www$ sudo ln -s /usr/local/bin/pgf-service fcgi-bin/
    

    Some attested problems

    1. gf-server-1.0 depends on fastcgi-3001.0.2.3 which failed to install.
      If you get the following error, you must first have C library fcgi. First install libfcgi-dev (sudo apt-get install libfcgi-dev), then try again to install PGF service and content service: sudo cabal install -f content --global

    Developer version settings

    This applies only if you want to use the editor from Eclipse. Otherwise you don't need any of this.

    To test the TT editor under eclipse using GWT devMode, we found it necessary to recompile content-service to add the gwt code server port parameter to page URLs. To do so activate the following lines in ContentService.hs . We have been using eclipse 3.6 JEE with Google Web Toolkit version 2.3.1.

    -- devModeScriptName = (liftM2 (++)) (getVarWithDefault "SCRIPT_NAME" "") (return "?gwt.codesvr=127.0.0.1:9997") 
    -- path <- devModeScriptName
    

    A corresponding change is neeeded in the client code. Activate the following line in TT-0.9/src/org/grammaticalframework/ui/gwt/client/SettingsPanel.java

    // String defaultUrl = "/fcgi-bin/content-service?gwt.codesvr=127.0.0.1:9997";
    

    Launch settings for building and testing under devMode under eclipse (in $HOME/workspace/.metadata/.plugins/org.eclipse.debug.core/.launches):

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <launchConfiguration type="com.google.gdt.eclipse.suite.webapp">
    <booleanAttribute key="com.google.gdt.eclipse.core.RUN_SERVER" 
                      value="false"/>
    <stringAttribute key="com.google.gdt.eclipse.core.SERVER_PORT"
                      value="80"/>
    <stringAttribute key="com.google.gdt.eclipse.suiteMainTypeProcessor.
    PREVIOUSLY_SET_MAIN_TYPE_NAME" value="com.google.gwt.dev.DevMode"/>
    <booleanAttribute key="com.google.gdt.eclipse.
    suiteWarArgumentProcessor.IS_WAR_FROM_PROJECT_PROPERTIES" value="true"/>
    <listAttribute key="com.google.gwt.eclipse.core.ENTRY_POINT_MODULES">
    <listEntry value="org.grammaticalframework.ui.gwt.EditorApp"/>
    </listAttribute>
    <stringAttribute key="com.google.gwt.eclipse.core.URL"
                     value="editor"/>
    <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_PATHS">
    <listEntry value="/TT-0.9-ORIGINAL"/>
    </listAttribute>
    <listAttribute key="org.eclipse.debug.core.MAPPED_RESOURCE_TYPES">
    <listEntry value="4"/>
    </listAttribute>
    <stringAttribute key="org.eclipse.jdt.launching.CLASSPATH_PROVIDER"
                     value="com.google.gwt.eclipse.core.moduleClasspathProvider"/>
    <stringAttribute key="org.eclipse.jdt.launching.MAIN_TYPE"
                     value="com.google.gwt.dev.DevMode"/>
    <stringAttribute key="org.eclipse.jdt.launching.PROGRAM_ARGUMENTS"
                     value="-startupUrl editor -war $HOME/workspace/TT-0.9/www/editor \
                     -noserver -remoteUI "${gwt_remote_ui_server_port}:${unique_id}" \
                     -logLevel INFO -codeServerPort 9997 \
                      org.grammaticalframework.ui.gwt.EditorApp"/>
    <stringAttribute key="org.eclipse.jdt.launching.PROJECT_ATTR" value="TT-0.9"/>
    <stringAttribute key="org.eclipse.jdt.launching.VM_ARGUMENTS" value="-Xmx512m"/>
    </launchConfiguration>
    

    Database settings

    HSQL and HSQL-MySQL installation

    First you need to install HSQL. It's in hackage, so it can be installed by typing sudo cabal install hsql --global.

    The public version of the Haskell MySQL package hsql-mysql-1.8.1 used by the TT content service appears to have a bug that prevents multiple successive mysql procedure calls. A debugged version of the package can be found at http://tfs.cc/molto/hsql-mysql-1.8.1-molto.zip.

    Install the debugged version:

    • Download and extract the files in hsql-mysql-1.8.1-molto.zip
    • Go to the top level directory, where you can find hsql-mysql.cabal
    • Install by typing sudo cabal install --global

    Some attested problems

    1. HSQL 1.8.2 doesn't install. Solution: try HSQL 1.8.1. You can choose it by typing sudo cabal install hsql-1.8.1 --global or downloading the version from hackage.

    MySQL server settings

    In addition to the above, you need a mysql server. You can install one by typing sudo apt-get install mysql-server.

    Content-service uses a mysql database to store users, grammars and documents.

    To create the database connection you need to do the following steps:

    1. create the database for molto content service
    2. create the content service database owner user
    3. give that user all rights for the database
    4. create key file fpath somewhere with the unholy quartet host db user pwd in format known to haskell readFile.
    5. call service with fpath as the only argument to create user, grammar and document tables:
      content-service fpath

    Now create the database:

    $ mysql -u root -p
    Enter password: 
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    ...
    
    mysql> CREATE DATABASE moltodb;
    CREATE USER moltouser IDENTIFIED BY 'moltopass';
    GRANT ALL on moltodb.* to moltouser;
    
    Query OK, 1 row affected (0.02 sec)
    
    mysql> Query OK, 0 rows affected (0.00 sec)
    
    mysql> Query OK, 0 rows affected (0.00 sec)
    
    mysql> show databases;
    +--------------------+
    | Database           |
    +--------------------+
    | information_schema |
    | moltodb            |
    +--------------------+
    2 rows in set (0.00 sec)
    
    mysql> quit
    Bye
    

    Next, create the database files.

    /usr/local/bin$ ./content-service fpath

    And then log in to mysql with the user moltouser.

    mysql -u moltouser -p
    Enter password: 
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    
    mysql> use moltodb;
    Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A
    
    Database changed
    mysql> show tables;
    +-------------------+
    | Tables_in_moltodb |
    +-------------------+
    | Documents         |
    | GrammarUsers      |
    | Grammars          |
    | Users             |
    +-------------------+
    
    4 rows in set (0.00 sec)
    
    

    If so the tables got created ok. There is nothing yet in the tables. After you first sign in with your google account, then your user account will be in the table Users. You can query any of the tables by writing select * from <table>.

    Tabular editor (stand-alone version)

    To install the tabular equivalents editor source, do this:

    1. checkout the whole source code directory from https://svn.it.helsinki.fi/repos/molto/trunk/molto_term_editor/ and put it in any place where the apache server can reach, i.e. /var/www/.
    2. download the extjs-4.0.2a package from http://extjs.cachefly.net/ext-4.0.2a-gpl.zip, and uncompress it as extjs-4.0.3 under the directory molto_term_editor/.
    3. then, you can open the link http://localhost/molto_term_editor/editor_sparql.html, if you put the source code under /var/www/.

    GlobalSight

    The GlobalSight installation values are kept in $HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight/install/data/installValues.properties. The following shows settings used for a MOLTO GlobalSight eclipse installation (with $HOME replacing the installation directory and PASSWORD the password/s) :

    #Mon May 28 22:52:03 EEST 2012
    mailserver=mail.domain.com
    system_log_directory_forwardslash=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight/logs
    install_data_dir_forwardslash=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight/install/data
    server_host=localhost
    database_password=PASSWORD
    database_server=localhost
    database_username=globalsight
    ldap_password=PASSWORD
    ldap_install_dir=/var/lib/ldap
    server_port=9090
    ldap_host=localhost
    gs_home=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight
    ldap_username=ldap_connection
    admin_email=WelocalizeAdmin@domain.com
    system4_admin_username=gsAdmin
    ldap_base=globalsight.com
    ldap_port=389
    GS_HOME=$HOME/workspace/GS-8.2.2.1/main6/tools/build/dist/GlobalSight
    cap_login_url=http\://127.0.1.1\:9090/globalsight
    

    D3.3 MOLTO translation tools / workflow manual


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D3.3 MOLTO translation tools / workflow manual
    Security (distribution level): Public
    Contractual date of delivery: M31
    Actual date of delivery: March 2013
    Type: Manual
    Status & version: Draft
    Author(s): Inari Listenmaa, Jussi Rautio
    Task responsible: UHEL
    Other contributors: Lauri Alanko, John Camilleri, Thomas Hallgren


    Abstract

    Deliverable D3.3 consists of a manual and a description of the workflow of the MOLTO translation tools. The document introduces the components: the open-source translation management system Pootle, the Simple Translation Tool, which supports many different translation methods and the Syntax Editor, which allows to modify text by manipulating abstract syntax trees.

    The document presents two translation workflows. The first scenario integrates MOLTO tools in a professional translation on fixed source, using Pootle. MOLTO translations with GF grammars are added in machine translation options. Another direction taken is the population of translation memory with GF generated data. In the second workflow, the translator is authorised to do changes to source. The tools used in this scenario are the Simple Translation Tool and the Syntax Editor.

    1. Introduction

    This deliverable D3.3, is a manual and a description of workflow for the translation tools produced within WP3. As stated in the previous deliverables 3.2 and 3.1, the user of the translation tools is not required to know how to write GF grammars. They are either translators, whose job is to translate from fixed source, or authorized to modify the source text in order to fit into the structures covered by the domain-specific grammar(s). This document presents workflows for both scenarios.

    1.1 Outline of the document

    In section 2, we present the components: the open-source translation management system Pootle, the Simple Translation Tool and the Syntax Editor. In section 3, we present the translation workflows. (Technical details, where?) In section 4 we talk about future work (and failures? e.g. professional scenario hard to adjust to the idea of non-fixed source. TF not really in use, so lexicon adding did not work as planned.).

    1.2 Changes from Deliverables 3.1 and 3.2

    We have changed the plan from deliverables D3.1 (translation tools API) and D3.2 (prototype). The previous deliverables use GlobalSight, a translation management system, and an external editor that supports GF.

    Due to the changes, it is not necessary to include a summary. The sections of this document are self-contained; assuming that a reader is in general familiar with MOLTO project.

    2. Components

    2.1 Simple translation tool

    The Simple translation tool (http://cloud.grammaticalframework.org/translator) is a translator's editor that supports manual and automatic translation. Documents consist of a sequence of segments that are translated independently. The user can import text in the source language and obtain automatically translated text in the target language. Imported text can be segmented based on punctuation. Optionally, one can also use line breaks or blank lines to indicate segmentation in imported text. Text can be edited after it has been imported.

    In the image below, the translator chooses the source and target languages and uploads a text in the source language.

    Upload and choose language

    The text can be displayed as parallel texts or segment by segment, as shown below.

    Show original and translation as parallel texts

    Show original and translation segment by segment

    The translator can choose a translation method for the whole document, or when needed, for each segment. The translation methods include various GF grammars, transfer-based machine translation by Apertium and manual translation. Other machine translation options can be added as well. The choice of grammars is shown in the picture above. Choosing a different grammar for different segments is relevant, if the body of the text is unrestricted, but there are passages where precision is required, such as formulas in a patent application. Then the unrestricted text can be translated with a method that has more coverage but less precision, and the formulaic parts with a specialised grammar.

    2.2 Syntax Editor

    The Syntax Editor (http://cloud.grammaticalframework.org/syntax-editor/editor.html), written by John Camilleri, is a tool for building and manipulating abstract syntax trees in GF.

    The image below shows an initial view of the Syntax Editor. The chosen grammar is Phrasebook and the level of the construction is Action; a verb phrase whose tense and polarity are not fixed yet. This excludes some of the possibilities to construct a sentence in Phrasebook, for instance greetings and other fixed phrases.

    Initial view of the Syntax Editor, start constructing an Action

    A syntax tree can also be imported as raw text, as shown in the image below.

    Importing an abstract syntax tree to the Syntax Editor

    The editor maintains the structure of the abstract syntax tree (AST) and outputs linearizations in the languages of the concrete syntaxes. In the following three images, the first shows a change of an argument in the AST and the consequent change in the linearizations. Son is changed to a daughter, in English only the word changes, but in French and Bulgarian, changing the gender of the argument also affects the agreement of the possessive pronoun.

    Changing an argument

    The second image shows the change of the head of the phrase. Both AKnowPerson and ALove take two arguments of type Person, so from the point of view of the AST, they are interchangeable.

    Changing the verb

    The third image manipulates the polarity of the sentence. As opposed to the previous two examples, this AST is complete, belonging to the start category of the grammar, that is, Phrase.

    Changing the polarity

    Finally, a tree created or modified in the editor can be exported as raw text.

    Exporting the AST as text

    Future work includes integration to the Simple Translation Tool. Further details about the plans in 4.X.

    2.3 Pootle as an example of GF integration

    Pootle (http://tfs.cc/pootle LINK TODO) is an open-source translation and project management platform that is completely implemented as a Web service. The platform accepts most of the formats used in translation industry like XLIFF, TMX and PO.

    Integration of GF into Pootle

    Pootle has a support for both standard translation memories and machine translation. Google Translate and Apertium MT systems are available in the standard installation, and Web server queries to other MT systems can be sent by modifying the source code of Pootle. GF translation via GF Web API has been added to the Pootle translation environment as a proof of concept.

    Unlike Google Translate and Apertium which only require the translatable segment and the source and target language codes as input, the GF system also needs the name of the resource grammar(s) to be used. The selection of the grammar has been added to the Pootle project administration dialog, where the project manager normally selects the languages, file formats, translation memories and other resources to be used:

    Pootle administration screen

    The translator user interface shows the buttons for different MT systems above the edit box (GF, Google and Apertium buttons are seen on the screenshot below).

    Pootle editing screen

    When the GF button is clicked, the web browser sends the source segment to the GF server, which parses and linearizes it into the target language using the given grammar(s). The translated text is then sent back to the browser. The translation suggestion can then be edited, after which the segment is added to the project translation memory.

    This illustrates that the GF translation system can be relatively easily added into various translation editing applications via the Web API. The GF Web API can be modified to comply with any standard Web API like the one proposed by TAUS (https://tauslabs.com/interoperability/taus-translation-api)

    3. Workflows

    This part consists of two separate workflow descriptions. The first workflow is that of a traditional professional translation, using the translation platform Pootle, with GF integrated in machine translation and translation memory. The second workflow describes a case where the translator is authorised to do changes to source. The tools of choice are the Simple Translation Tool and the Syntax Editor.

    3.1. Translation of a fixed source

    The workflow of a professional translation is often fairly complex, including roles such as project manager, translator and reviewers of both content and language. Machine translation is used as the translator's aid, along with other tools such as dictionaries and translation memories. This is an established practice in computer-assisted translation; one of the main objectives of WP3 is to demonstrate that MOLTO tools can be adapted in a traditional translation workflow.

    The translator is not allowed to modify the source text, which is a serious limitation for the MOLTO translation, precision at the cost of coverage. However, in this scenario we assume that the translator necessarily knows both source and target languages. The role of the machine translation is not to provide publication quality text for blind translation, but to help the translator to produce translations.

    When does it make sense to use MOLTO tools? With free text of unrestricted domains, the most common case is that GF grammars do not produce any translation at all, due to missing words or constructions. When we have a professional translator post-editing GF grammars beat general-purpose MT in situations where the structure is crucial. Any formulaic parts within unrestricted text, such as mathematical constructions (case study X) and chemical formulas (case study Y). A construction such as (2S)-2-[(4S)-4-(2,2-difluorovinyl)-2-oxopyrrolidinyl]butanamide is constructed by elaborate rules, which can be expressed precisely with a GF grammar. However, statistical machine translation fails to capture the structure, and the result is worthless for post-editing; a change, addition or omission of even one element is enough to change the formula completely.

    Thus, we have integrated GF as one of the translation options in Pootle. The technical side is handled by GF Web API calls, explained in more detail in section 2.3.

    3.1.1 The Pootle workflow

    The Pootle translation enviroment implements the now industry-standard workflow where the translatable material, the translation memories and the editing tools reside on the same Web server. The system also includes rel-time word-count reporting, user management and terminology asset handling. This greatly reduces the effort needed for a translation project, as all the tools and resources are centralized. As the translations are updated into a shared translation memory in real time, the need to create, update and document the memories after the project is unnecessary. Pootle also allows the local downloading of necessary resources in cases where the translator does not have an always-on internet connection.

    3.1.2 Translation project manager

    The translation project manager can upload the files to be translated to the Pootle server and define the language pairs, translation memories, glossaries (either general or project specific) and file formats to be used in the translation. The systems allows the use of standard translation file formats like XLIFF (for source material), TMX (translation memories) and TBX (terminology). In our GF machine translation enabled version of Pootle, the PM can also select the GF grammar or grammars used for translation (see Section 2.3 for an example).

    When the translation assets have been configured, the PM can give the necessary access rights to them for the translator s and reviewers. The material can also be translated as crowd-sourcing, so anyone with an access to the Pootle server can participate in the translation. This method has been used in many open-source localization projects, for example in the OpenOffice suite.

    3.1.3 Translators

    The translators can then log in with their credentials onto the Pootle server, and see all the translation tasks assigned to them by the PM:

    Projects View

    After clicking a project name, the Languages page shows the target languages the translator has been assigned to:

    Languages View

    Clicking the language name opens the editing environment:

    Translation with Pootle

    Any exact or fuzzy match ("Translation suggestion" in Pootle terminology) found in the translation memory can be selected and edited for a translation. As explained in the previous section, the translator can use machine translation services (including GF) by clicking the relevant button. The translator is thus able to use either the sugggestions from the translation memory, a choice of machine translations or a translate the segment from scratch.

    The Pootle editing tool includes automatic checking for quality issues, for example missing tags, variables or numbers, wrong capitalisation, punctuation and so on, so the translator gets instant feedback on possible formatting errors in the translation. The translator is also able to include comments to reviewers and PMs as separate field in the tool and add or review terms in the terminology.

    3.1.4 Review and post-processing

    During the translation, the PM can follow the progress of the project on the Projects page. When the translation of a component is ready, the PM gets a notification, and the translation can be sent to reviewers, who then check and correct the translations by accepting or rejecting the suggestions. After the review process, the PM is again nofified, and the translated and reviewed file can be downloaded for post-processing.

    3.2. Translation of an editable source

    We hold on to the assumptions stated in D3.1:

    • Translator is the author or is authorized to adapt the text to better satisfy the constraints of the translation grammar,
    • Translator is native or fluent in the source language, and familiar with the domain or at least its special language in order to know how the message can be paraphrased,
    • Translator is not required to know (all of) the target languages.

    In a case of at least partially blind translation, the quality of MT needs to be excellent. External revisers can be added to this scenario as well, but we assume the quality to be in general good, errors are due to bugs in grammars and grammar writers are correcting them. A concrete scenario could be a multilingual website, where the authorized users can create content in any of the languages, and it is updated simultaneously into all of them. Assuming there are users for every language, they can work themselves as reviewers, providing feedback in case there is an error in the grammar. Then a grammar writer fixes the grammar, and the all structures that had the same problem will be updated.

    There might be a source document or it can be created from scratch. In any case, there is a need for guided authoring, to ensure that the produced text is recognized by the grammar. This is not currently implemented, but planned by UGOT and explained further in section 4.2.

    The Simple Translation Tool (STT) offers the functionalities for pre- and post-editing of MT. When needed, machine translation can be completely overridden by manual translation. The functionalities of STT are demonstrated with a toy text about pizzas. More complex grammars produced within MOLTO include the patent grammars in WP7 and the mathematical grammar library in WP6, but they are not integrated to STT at the moment. We plan to produce a video demo with some real use cases.

    In the first image, the text is uploaded into STT and a default translation method is chosen.

    Loading a document and choosing default translation method

    In the second image, we see three errors. The first one, indicated with number 1, is an error in the grammar; an unidiomatic word choice. This type of error is easiest to fix just by modifying the target -- followed by a bug report to a grammar writer. Of course, spotting this type of error requires that the translator knows the target language. In cases where not, they just need to assume that input is correct. Three errors, correction of the first one

    The second error manifests as no translation. The solution here is to paraphrase the source; in this case, changing the modifier "really" to "very", that is supported by the lexicon. This example is very simplified; in any realistic situation, the possible changes are numerous. Either we need to assume that the translator/author has a good documentation on the allowed constructions in the restricted language, or the program needs to guide the translator. The latter is a planned feature, the first depends on the individual use case.

    The third error also shows no translation, but it is due to the segment being totally different domain. In the example document of 5 phrases, the first is a commentary, written on completely unrestricted text, and the four remaining phrases are the sort of restricted language that translates with our chosen GF grammar. In a realistic situation, the first phrase could be generic instructions and the latter ones could be mathematical formulas, in order for the scenario to make sense. In any case, the error is corrected by changing the translation method for that segment. Instead of any GF grammar, we choose Apertium, with more coverage but quality not quaranteed. In case the translator spots errors in generic MT option, there is the source post-editing option. Correction of 2nd and 3rd error

    Finally, all three errors have been corrected. The user can view the texts parallel and save the project.

    Complete result, saving the project

    4. Future work

    4.1 Translation memory generation

    UHEL will conduct experiments on creating translation memory data with GF grammars. Grammars of a given domain are used to generate bilingual aligned data, which can be converted into a translation memory. Then, when translating new material, the translation memory can provide fuzzy matches for cases where the constructions are similar, but words are different. This is one way to compensate for the lack of lexicon in a situation where adding new vocabulary is hard.

    Jonson (2006) describes an experiment on synthesizing a corpus with GF for training speech recognition models. The idea is similar: use a grammar to generate reliable data for a data-driven approach.

    By using GF translation suggestions in a translation memory it is also possible to use standard translation tools like Trados to generate pre-translation reports of exact and fuzzy matches. These percentages of different matches are easier to demonstrate to translation industry stakeholders, as the scientific metrics used in MT evaluation (BLEU, NIST, Rouge and so on) are not generally used or well understood within the industry.

    4.2 GF cloud

    Services in the GF cloud will be linked to each other by UGOT. Syntax editor (http://cloud.grammaticalframework.org/syntax-editor/editor.html) will be used within the Simple Translation Tool (http://cloud.grammaticalframework.org/translator/). As described in the DoW, there will be a mode for editing source text, where structural changes to the document can be made by manipulating abstract syntax trees. This functionality will be added to the Simple Translation Tool.

    Simple Translation Tool will be extended to a bilingual, controlled language document authoring tool, with useful ways to enter and edit the source text too. Additions include a text input guided by word completion and syntax tree editing, by invoking the syntax editor (see section 2.2) on a source segment.

    plan to submit a system demo paper to the MT Summit, http://www.mtsummit2013.info/ deadline 22 April. About a unique MT platform,

    • scalable from controlled language to open domain systems
    • incremental parsing and syntax editing
    • visualizations
    • web interface
    • free

    4.3 Grammar editing

    Grammar editing in the translator's tools is still an open question. One of the main shortcomings of a MOLTO type translation is the limited coverage, and that's why it is important that a translator can easily extend and modify the lexicon. Just importing raw lexicon data, with or without TermFactory, is described in Listenmaa (2012). This is a question of adding more content, but another question is modifying existing grammars, usually in a case of an error.

    There are some steps taken to further this issue. D11.2 presents a multilingual semantic wiki, where it is possible for every user to modify the grammar behind the wiki. This is still an expert work, as the editing is done with raw GF code, but there are methods for guided grammar editing, such as the cloud-based IDE (see documentation). This environment offers an easy way of multilingual grammar writing and editing.

    D4.1 Knowledge Representation Infrastructure

    Contract No.:FP7-ICT-247914
    Project full title:MOLTO - Multilingual Online Translation
    Deliverable:D4.1 Knowledge Representation Infrastructure
    Security (distribution level):Public
    Contractual date of delivery:1 Nov 2010
    Actual date of delivery:1 Nov 2010
    Type:Regular Publication
    Status & version:Final
    Author(s):Petar Mitankin, Atanas Ilchev
    Task responsible:ONTO ( WP4 )
    Other contributors:Borislav Popov, Reneta Popova, and Gergana Petkova


    Abstract

    This document presents the specification of the Knowledge Representation Infrastructure (KRI), which is based on pre-existing products. The KRI ensures a mature basis for storage and retrieval of structured knowledge and content. The document provides a description of the technology building blocks, overall architecture, standards used, query languages and inference rules.

    AttachmentSize
    D4.1_reviewed.pdf1.07 MB

    1. Introduction

    The purpose of this document is to describe the knowledge representation infrastructure in MOLTO. It clarifies the expectations concerning the back-end infrastructure, serving the various MOLTO knowledge engineering tasks. It is based on the summary and analysis of the requirements gathered from the case studies, from grammar development, and from the partners. The scope of the deliverable covers presentation of the requirements, specification and description of the MOLTO Knowledge Representation Infrastructure (KRI).

    Blending these expectations with previous experience in knowledge engineering, and adding a pinch of common sense, we come up with the specification of the MOLTO Knowledge Representation Infrastructure. This KRI is the data modeling and manipulation backbone of the entire project, aiming to serve semi-automatic creation of abstract grammars from ontologies; deriving ontologies from grammars, and instance level knowledge from NL. In terms of retrieval, NL queries will be transformed to semantic queries and the resulting knowledge, expressed back in NL.

    The KRI is based on pre-existing products, and ensures a mature basis for storage and retrieval of both knowledge and content, covering all modalities of the data. This document provide descriptions of the technology building blocks, overall architecture, standards used, query languages and inference rules.

    2. Requirements

    The KRI should allow for:

    • Building the conceptual models and knowledge bases needed for grammar development and the use cases of MOLTO - one base set and three specialized knowledge sets for the use cases.
    • Extending of the PROTON ontology with a large coverage knowledge base focused on named entities and a thesaurus. The specialized sets will include the necessary domain specific models and instances, e.g. multi-lingual patent classification taxonomies, museum ontology and instance base, etc.
    • Using a semantic alignment methodology paired with a set of data source transformation tools for each of the structured data sources.

    3. KRI Specification

    The objective of this section is to introduce the KRI specification - the technology building blocks, overall architecture, standards used, query languages and inference rules. It describes how to modify the KRI, how to change the default underlying ontology and database, and how to adjust the inference rule-set of the OWLIM semantic repository.

    A demo of the KRI is running on http://molto.ontotext.com.

    The KRI is responsible for the storage and retrieval of content metadata, background knowledge, upper-level ontology, and other possible data, if available (users and communities), and exposes convenient methods for interoperability between the stored knowledge and the toolkit for rendering natural language to machine readable semantic models (ontologies) and vice versa.

    The KRI includes:

    • OWLIM — a semantic repository that stores all structured data such as ontologies, background knowledge, etc., and provides SPARQL query mechanism and reasoning.
      For more details about the OWLIM architecture and function, please see the OWLIM section.
    • RDFDB — an API that provides a remote access to the stored structured data via JMS.
      For more details about the RDFDB architecture and function, please see the RDFDB section.
    • PROTON Ontology — a light-weight upper-level ontology, which defines about 300 classes and 100 properties, covering most of the upper-level concepts, necessary for semantic annotation, indexing and retrieval.
    • KRI Web UI — a UI that accesses OWLIM through the RDFDB layer. The web UI gives the user the possibility to browse the ontologies and the database, to execute SPARQL queries, etc.
      For more details about the KRI Web UI, please see the KRI Web UI section.

    3.1 OWLIM

    The major component of the KRI is OWLIM - a semantic repository, based on full materialization and providing support for a fraction of OWL. It is implemented in Java and packaged as a database in Storage and Inference Layer (SAIL) for the Sesame RDF database. Following is a detailed description of its architecture and supported semantics.

    3.1.1 Overview

    Semantic Repositories are tools that combine the characteristics of database management systems (DBMS) and inference engines. Their major functionality is to support efficient storage, querying and management of structured data. One major difference to DBMS is that Semantic Repositories work with generic physical data models (e.g. graphs). This allows them to easily adopt updates and extensions in the schemata, i.e. in the structure of the data. Another difference is that Semantic Repositories use ontologies as semantic schemata, which allows them to automatically reason about the data.

    The two principle strategies for rule-based reasoning are:

    • Forward-chaining: to start from the known facts and to perform inference in an inductive fashion. The goals of such reasoning can vary: to answer a particular query or to infer a particular sort of knowledge (e.g. the class taxonomy).
    • Backward-chaining: to start from a particular fact or a query and to verify it or get all possible results, using deductive reasoning. In essence, the reasoner decomposes (or transforms) the query (or the fact) into simpler (or alternative) facts, which are available in the KB or can be proven through further recursive transformations.

    Imagine a repository, which performs total forward-chaining, i.e. it tries to make sure that after each update to the KB, the inferred closure is computed and made available for query evaluation or retrieval. This strategy is generally known as materialization.

    3.1.2 Sesame

    Sesame is a framework for storing, querying and reasoning with RDF data. It is implemented in Java as an open source project by Aduna and includes various storage back-ends (memory, file, database), query languages, reasoners and client-server protocols.

    There are essentially two ways to use Sesame:

    • as a standalone server;
    • embedded in an application as a Java library.

    Sesame supports the W3Cs SPARQL query language and Adunas own query language SeRQL. It also supports most popular RDF file formats and query result formats. Sesame offers a JBDC-like user API, streamlined system APIs and a RESTful HTTP interface. Various extensions are available or are being developed by third parties. From version 2.0 onwards, Sesame requires a Java 1.5 virtual machine. All APIs use Java 5 features such as typed collections and iterators. Sesame version 2.1 added support for storing RDF data in relational databases. The supported relational databases are MySQL, PostgreSQL, MS SQL Server, and Oracle. As of version 2.2, Sesame also includes support for Mulgara (a native RDF database).

    Sesame Architecture

    A schematic representation of Sesame's architecture is shown in Figure 1 below. Following is a brief overview of the main components.

    Figure 1 - Sesame Architecture

    The Sesame framework is as a loosely coupled set of components, where alternative implementations can be exchanged easily. Sesame comes with a variety of Storage And Inference Layer (SAIL) implementations that a user can select for the desired behavior (in memory storage, file-system, relational database, etc). OWLIM is a plug-in SAIL component for the Sesame framework.

    Applications will normally communicate with Sesame through the Repository API. This provides a high enough level of abstraction so that the details of particular underlying components remain hidden, i.e. different components can be swapped in without requiring modification of the application.

    The Repository API has several implementations, one of which uses HTTP to communicate with a remote repository that exposes the Repository API via HTTP.

    The SAIL API and OWLIM

    The SAIL API is a set of Java interfaces that support the storage and retrieval of RDF statements. The main characteristics of the SAIL API are:

    • It is the basic interface for storing/retrieving/deleting RDF data;
    • It is used to abstract from the actual storage mechanism, e.g. an implementation can use relational databases, file systems, in-memory storage, etc;
    • It is flexible enough to support small and simple implementations, but also offers sufficient freedom for optimizations that huge amounts of data can be handled efficiently on enterprise-level machines;
    • It is extendable to other RDF-based languages;
    • It supports stacking of SAILs, where the SAIL at the top can perform some action when the modules make calls to it, and then forward these calls to the SAIL beneath it. This process continues until one of the SAILs finally handles the actual retrieval request, propagating the result back up again;
    • It handles concurrency and takes care of read and write locks on repositories. This setup allows for supporting concurrency control for any type of repository.

    Other proposals for RDF APIs are currently under development. The most prominent of these are the Jena toolkit and the Redland Application Framework. The SAIL shares many characteristics with both approaches, however an important difference between these two proposals and SAIL, is that the SAIL API specifically deals with RDFS on the retrieval side: it offers methods for querying class and property subsumption, and domain and range restrictions. In contrast, both Jena and Redland focus exclusively on the RDF triple set, leaving interpretation of these triples to the user application. In SAIL, these RDFS inferencing tasks are handled internally. The main reason for this is that there is a strong relationship between the efficiency of inference and the actual storage model being used. Since any particular SAIL implementation has a complete understanding of the storage model (e.g. the database schema in the case of an RDBMS), this knowledge can be exploited to infer, for example, class subsumption more efficiently.

    Another difference between SAIL and other RDF APIs is that SAIL is considerably more lightweight: only four basic interfaces are provided, offering basic storage and retrieval functionality and transaction support. This minimal set of interfaces promotes flexibility and looser coupling between components.

    The current Sesame framework offers several implementations of the SAIL API. The most important of these is the SQL92SAIL, which is a generic implementation for SQL92, ISO99 [1]. This allows for connecting to any RDBMS without having to re-implement a lot of code. In the SQL92SAIL, only the definitions of the data-types (which are not part of the SQL92 standard) have to be changed when switching to a different database platform. The SQL92SAIL features an inferencing module for RDFS, based on the RDFS entailment rules as specified in the RDF Model Theory [2]. This inferencing module computes the closure of the data schema and asserts these implications as derived statements. For example, whenever a statement of the form (foo, rdfs:domain, bar) is encountered, the inferencing module asserts that (foo, rdf:type, property) is an implied statement. The SQL92SAIL has been tested in use with several DBMSs, including PostgreSQL8 and MySQL9 [3].

    OWLIM is a high-performance semantic repository, implemented in Java and packaged as a Storage and Inference Layer (SAIL) for the Sesame RDF database. OWLIM is based on Ontotexts Triple Reasoning and Rule Entailment Engine (TRREE) - a native RDF rule-entailment engine. The supported semantics can be configured through the definition of rule-sets. The most expressive pre-defined rule-set combines unconstrained RDFS and OWL-Lite. Custom rule-sets allow tuning for optimal performance and expressivity. OWLIM supports RDFS, OWL DLP, OWL Horst, most of OWL Lite and OWL2 RL.

    The two editions of OWLIM are SwiftOWLIM and BigOWLIM. In SwiftOWLIM, reasoning and query evaluation are performed in-memory, while, at the same time, a reliable persistence strategy assures data preservation, consistency, and integrity. BigOWLIM is the high-performance "enterprise" edition that scales to massive quantities of data. Typically, SwiftOWLIM can manage millions of explicit statements on desktop hardware, whereas BigOWLIM can manage billions of statements and multiple simultaneous user sessions.

    The KRI in MOLTO uses BigOWLIM Version 3.3.

    AttachmentSize
    sesamearch.png115.61 KB

    3.1.3 OWLIM Interoperability and Architecture

    OWLIM version 3.X is packaged as a Storage and Inference Layer (SAIL) for Sesame version 2.x and makes extensive use of the features and infrastructure of Sesame, especially the RDF model, RDF parsers and query engines.

    Inference is performed by the TRREE engine, where the explicit and inferred statements are stored in highly-optimized data structures that are kept in-memory for query evaluation and further inference. The inferred closure is updated through inference at the end of each transaction that modifies the repository.

    Figure 2 - OWLIM Usage and Relations to Sesame and TRREE

    OWLIM implements the Sesame SAIL interface so that it can be integrated with the rest of the Sesame framework, e.g. the query engines and the web UI. A user application can be designed to use OWLIM directly through the Sesame SAIL API or via the higher-level functional interfaces such as RDFDB. When an OWLIM repository is exposed using the Sesame HTTP Server, users can manage the repository through the Sesame Workbench Web application, or with other tools integrated with Sesame, e.g. ontology editors like Protege and TopBraid Composer.

    AttachmentSize
    OWLIMarch.png233.19 KB

    3.1.4 The TRREE Engine

    OWLIM is implemented on top of the TRREE engine. TRREE stands for "Triple Reasoning and Rule Entailment Engine". The TRREE performs reasoning based on forward-chaining of entailment rules over RDF triple patterns with variables. TRREEs reasoning strategy is total materialization, although various optimizations are used as described in the following sections.

    The semantics used is based on R-entailment [4], with the following differences:

    • Free variables in the head of a rule (without a binding in the body) are treated as blank nodes. This feature can be considered "syntactic sugar";
    • Variable inequality constraints can be specified in the body of the rules, in addition to the triple patterns. This leads to lower complexity as compared to R-entailment;
    • The "cut" operator can be associated with rule premises, the TRREE compiler interprets it like the "!" operator in Prolog;
    • Two types of inconsistency checks are supported. Checks without any consequences indicate a consistency violation if the body can be satisfied. Consistency checks with consequences indicate a consistency violation if the inferred statements do not exist in the repository;
    • Axioms can be provided as a set of statements, although those are not modeled as rules with empty bodies.

    Further details of the rule language can be found in the corresponding OWLIM user guides. The TRREE can be con- figured via the rule-sets parameter, that identifies a file containing the entailment rules, consistency checks and axiomatic triples. The implementation of TRREE relies on a compile stage, during which custom rule-sets are compiled into Java code that is further compiled and merged in to the inference engine.

    The edition of TRREE used in SwiftOWLIM is referred to as "SwiftTRREE" and performs reasoning and query evaluation in-memory. The edition of TRREE used in BigOWLIM is referred to as "BigTRREE" and utilizes data structures backed by the file-system. These data structures are organized to allow query optimizations that dramatically improve performance with large data-sets, e.g. with one of the standard tests BigOWLIM evaluates queries against 7 million statements three times faster than SwiftOWLIM, although it takes between two and three times more time to initially load the data.

    3.1.5 Supported Semantics

    OWLIM offers several pre-defined semantics by way of standard rule-sets (files), but can also be configured to use custom rule-sets with semantics better tuned to the particular domain. The required semantics can be specified through the rule-set for each specific repository instance. Applications, which do not need the complexity of the most expressive supported semantics, can choose one of the less complex, which will result in faster inference.

    Pre-defined Rule-Sets

    The pre-defined rule-sets are layered such that each one extends the preceding one. The following list is ordered by increasing expressivity:

    • empty: no reasoning, i.e. OWLIM operates as a plain RDF store;
    • rdfs: supports standard RDFS semantics;
    • owl-horst: OWL dialect close to OWL Horst; the differences are discussed below;
    • owl-max: a combination of most of OWL-Lite with RDFS.

    Furthermore, the OWL2 RL profile [5], is supported as follows:

    • owl2-rl-conf: Fully conformant except for D-Entailment, i.e. reasoning about data types;
    • owl2-rl-reduced: As above, but with the troublesome prp-key rule removed (this rule causes serious scalability problems).

    Custom Rule-Sets

    OWLIM has an internal rule compiler that can be used to configure the TRREE with a custom set of inference rules and axioms. The user may define a custom rule-set in a *.pie file (e.g. MySemantics.pie). The easiest way to do this is to start modifying one of the .pie files that were used to build the pre-compiled rule-sets all pre-defined .pie files are included in the distribution. The syntax of the .pie files is easy to follow.

    OWL Compliance

    OWL compliance, OWLIM supports several OWL like dialects: OWL Horst [4], (owl-horst), OWL Max (owl-max) that covers most of OWL-Lite and RDFS, and OWL2 RL (owl2-rl-conf and owl2-rl- reduced).

    With the owl-max rule-set, which is is represented in Figure 3, OWLIM supports the following semantics:

    • full RDFS semantics without constraints or limitations, apart from the entailments related to typed literals (known as D-entailment). For instance, meta-classes (and any arbitrary mixture of class, property, and individual) can be combined with the supported OWL semantics;
    • most of OWL-Lite;
    • all of OWL DLP.

    The differences between OWL Horst [4], and the OWL dialects supported by OWLIM (owl-horst and owl-max) can be summarized as follows:

    • OWLIM does not provide the extended support for typed literals, introduced with the D*-entailment extension of the RDFS semantics. Although such support is conceptually clear and easy to implement, it is our understanding that the performance penalty is too high for most applications. One can easily implement the rules defined for this purpose by ter Horst and add them to a custom rule-set;
    • There are no inconsistency rules by default;
    • A few more OWL primitives are supported by OWLIM (rule-set owl-max). These are listed in the OWLIM user guides;
    • There is extended support for schema-level (T-Box) reasoning in OWLIM.

    Even though the concrete rules pre-defined in OWLIM differ from those defined in OWL Horst, the complexity and decidability results reported for R-entailment are relevant for TRREE and OWLIM. To put it more precisely, the rules in the owl-host rule-set do not introduce new B-Nodes, which means that R-entailment with respect to them takes polynomial time. In KR terms, this means that the owl-horst inference within OWLIM is tractable.

    Inference using owl-horst is of a lesser complexity compared to other formalisms that combine DL formalisms with rules. In addition, it puts no constraints with respect to meta-modeling.

    The correctness of the support for OWL semantics (for those primitives that are supported) is checked against the normative Positive- and Negative-entailment OWL test cases [6]. These tests are provided in the OWLIM distribution and documented in the OWLIM user guides.

    Figure 3 - Owl-max and Other OWL Dialects

    AttachmentSize
    owlmax.png207.68 KB

    3.2 RDFDB

    The RDFDB stores all knowledge artifacts - such as ontologies, knowledge bases, and other data if available - in the RDF form. It is the MOLTO store and query service. The RDFDB has the following features:

    • Default repository for structured data, which uses ORDI-SG integration layer over OWLIM – the fastest and most scalable Sesame Sail implementation;
    • Storage and query APIs for the default repository, accessible over JMS;
    • Storage and query APIs for the default repository, accessible through Sesame HTTP Server on port 9090.
    AttachmentSize
    ordi.png28.05 KB

    3.2.1 Overview

    Using the RDF model to represent all system knowledge allows an easy interoperability between the stored data and the conceptual models and instances. Therefore, if the latter are enriched, extended or partially replaced, it is not necessary to change the implementation considerably. However, the requirements for tracking of provenance, versioning and stored knowledge meta-data make the use of RDF triples insufficient. Therefore, we use RDF quintuples and a repository model that supports them.

    We will expose an open API, based on the ORDI SG data model that can be implemented for integration with alternative back-ends. The ORDI SG model is presented here in further detail, as it is the basis for the RDFDB API.

    The atomic entity of the ORDI SG tripleset model is a quintuple. It consists of the RDF data type primitives - URI, blank node and literal, as follows:

    {P, O, C, {TS1, ..., TSn}, where:

    • S is the subject of a statement; of the type URI or blank node;
    • P is the predicate of a statement; of the type URI;
    • O is the object of a statement; of the type URI, blank node or literal;
    • C is the context of a statement (i.e. Named Graph); of the type URI or blank node;
    • {TS1, ..., TSn} is an unordered set of identifiers of the triplesets to which the statement is associated; of the type URI or blank node.

    The ORDI data model is realized as a directed labeled multi-graph. For backward compatibility with RDF, SPARQL and other RDF-based specifications, a new kind of information is introduced to the RDF graph. The tripleset model is a simple extension of the RDF graph enabling an easy way for adding meta-data to the statements.

    It is a new element in the RDF statement, previously expressed as a triple or a quadruple, to describe the relation between the statement and an identifiable group of statements. This new term is introduced to distinguish the new model from several similar, already existing, RDF extensions and the terms associated with them:

    • Context, as defined in several RDF APIs like Sesame 2.0, YARS and others;
    • Datasets, as defined in SPARQL;
    • Named-graphs, as introduced at http://www.w3.org/2004/03/trix/, implemented in Jena and used in SPARQL.

    The following is true for the tripleset model:

    • Each contextualized triple {S, P, O, C} could belong to multiple triplesets;
    • The triples are not "owned" by the contexts; a triple can be disassociated from a tripleset without being deleted;
    • When a triple is associated with several triplesets it should be counted as a single statement, e.g. a single arc in the graph;
    • A tripleset can contain triples from different contexts;
    • Each tripleset can be converted to a multi-set of triples, as one triple can correspond to multiple contextualized triples, belonging to the tripleset.

    Figure 4 below is a diagram of the relationship between the major elements in the ORDI data model.

    Figure 4 - Entity-Relationship Diagram of the ORDI Data Model

    3.2.2 RDFDB Use

    The RDFDB service is already available in the distribution and one could use it either by generating a JMS client or through the OpenRDF Sesame API.

    Using the RDFDB by Generating a JMS Client

    Generating a Client

    Using your shell, navigate to the bin directory of the deployed platform and invoke the following commands:

    mkproxy -proxy com.ontotext.platform.qsg.ServiceClass $PATH_TO_EXAMPLES/target/classes/

    mkproxy -client com.ontotext.platform.qsg.ServiceClass $PATH_TO_EXAMPLES/target/classes/

    Both commands dump output to sysout. Get the code, clean it as appropriate and put it in your project's source code. Build the project and the client is located in your project's target directory. The client implements the interface of the service.

    Generating RDF-DB Clients

    In order to use RDF-DB clients, the following services must be generated:

    com.ontotext.rdfdb.ordi.OrdiService

    com.ontotext.rdfdb.ordi.RdfStoreService

    com.ontotext.rdfdb.ordi.RdfQueryService

    Using the RDFDB through the OpenRDF Sesame API

    By using the OpenRDF Sesame API one could manage and query the default repository through the Sesame Workbench, or operate over it using the HTTP Repository. The OpenRDF Sesame is integrated in the RDFDB. For more details about how to use it, please see the OpenRDF Sesame User Guide.

    3.3 Data Sources

    This section describes the conceptual models, ontologies and knowledge bases, used in the MOLTO KRI as a context background in the RDFDB component.

    • PROTON ontology - a lightweight upper-level ontology defining about 300 classes and 100 properties in OWL Light.
    • The default KRI knowledge base - common world knowledge that contains information about people, organizations, locations, and job titles.

    Most applications of the KRI require extending the conceptual models with domain ontologies and the underlying knowledge base with domain specific entities and facts.

    3.3.1 PROTON Ontology

    PROTON ontology provides coverage of the most general concepts, with focus on named entities (people, locations, organizations) and concrete domains (numbers, dates, etc.).

    The design principles can be summarized as follows:

    • domain-independence;
    • light-weight logical definitions;
    • alignment with popular standards;
    • good coverage of named entities and concrete domains (i.e. people, organizations, locations, numbers, dates, addresses).

    The ontology is originally encoded in a fragment of OWL Lite and split into four modules: System, Top, Upper, and KM (Knowledge Management), shown on Figure 5 below.

    Figure 5 - PROTON Ontology

    System module

    The System module consists of a few meta-level primitives (5 classes and 5 properties). It introduces the notion of 'entity', which can have aliases. The primitives at this level are usually the few things that have to be hard-coded in ontology-based applications. Within this document and in general, the System module of PROTON is referred to via the "protons:" prefix.

    Top module

    The Top module is the highest, most general, conceptual level, consisting of about 20 classes. These ensure a good balance of utility, domain independence, and ease of understanding and usage. The top layer is usually the best level to establish alignment to other ontologies and schemata. Within this document and in general, the Top module of PROTON is referred to via the "protont:" prefix.

    Upper module

    The Upper module has over 200 general classes of entities, which often appear in multiple domains (e.g. various sorts of organizations, a comprehensive range of locations, etc.). Within this document and in general, the Upper module of PROTON is referred to via the "protonu:" prefix.

    Knowledge Management (KM) module

    The KM module has 38 classes of slightly specialized entities that are specific for typical Knowledge Management tasks and applications. Within this document and in general, the PROTON KM module is referred to via the "protonkm:" prefix.

    3.3.2 The Default KRI Knowledge Base

    The default KB contains numerous instances of PROTON Upper Module classes like: Public Company, Company, Bank, IndustrySector, HomePage, etc. It covers the most popular entities in the world such as:

    • Locations: mountains, cities, roads, etc.
    • Organizations, all important sorts of: business, international, political, government, sport, academic, etc.
    • Specific people

    Content

    • collected from various sources, like geographical and business intelligence gazetteers
    • predefined - KRI uses entities only from trusted sources
    • can be enriched, replaced

    Entity Description

    The NE-s are represented with their Semantic Descriptions via:

    • Aliases (Florida & FL);
    • Relations with other entities (Person hasPosition Position);
    • Attributes (latitude & longitude of geographic entities);
    • the proper Class of the NE

    The last build of the KRI KB contained 29104 named entities: 6006 persons, 8259 organizations, 12219 locations and 2620 job titles.

    3.4 KRI Web UI

    Although the KRI presented in this document comes directly from preexisting products that have not been developed specially for the needs of the MOLTO project, it provides the basic semantic functionality required by some MOLTO-specific applications. As an illustration, we present the following screen-shots of the KRI web UI. It allows the user to enter a natural language query, as shown in Figure 6. The natural language query is converted into a SPARQL query and the SPARQL query is executed by OWLIM through the RDFDB layer.

    The conversion of the natural language query into a SPARQL query is out of the scope of this document, but will be described in details in Deliverable 4.3 Grammar-Ontology Interoperability.

    Figure 6 - Natural Language Query

    The web UI shows the results of the executed SPARQL query, as shown in Figure 7 below.

    Figure 7 - Results from the SPARQL Query

    The user can see and edit the SPARQL query or enter a new SPARQL query. The KRI web UI also gives the user the possibility to browse the underlying ontology and database.

    AttachmentSize
    nlq.png26.84 KB
    results.png36.41 KB

    4. KRI as a Virtual Image

    A virtual image of the KRI is available on sftp://ftp.ontotext.com. Table 1 below presents some of its basic characteristics.

    vmdk filesVM/MOLTO.tar
    operating systemUbuntu 10.04.1
    user nameonto
    passwordguest
    rdfdb folder/home/onto/rdfdb
    knowledge base folder/home/onto/rdfdb
    owlim configuration file/home/onto/rdfdb/config/rdfdb.ttl

    Table 1 - KRI as a Virtual Image

    The user starts the RDFDB (and respectively OWLIM) by /home/onto/rdfdb/bin/rdfdb start.sh and stops it by /home/onto/rdfdb/bin/rdfdb stop.sh. To change the knowledge base one has to:

    • stop the RDFDB
    • delete the folder /home/onto/rdfdb/bin/populated
    • change the content of the knoledge base folder

    The reasoning rule-set used by OWLIM is set in the owlim configuration file as the value of the owlim:ruleset parameter. The default rule-set is owl-horst but one could change it to owl-max, for example. If the user needs a custom rule-set, then one has to specify it in a *.pie file, which is a part of the knowledge base. The KRI uses internally a lighttpd server. It is started by cd /home/onto/gf-src-3.1.6/src/server followed by lighttpd -f lighttpd.conf.

    The KRI web UI is accessible via a tomcat server on port 8080. The tomcat server is started by sudo/etc/init.d/tomcat6 start and stopped by sudo /etc/init.d/tomcat6 stop.

    5. Conclusion

    In this deliverable we have presented the requirements for the MOLTO KRI and defined its specification starting with the architecture of the major KRI component - the OWLIM semantic repository. We have continued with the presentation of the RDFDB, which provides a remote access to the ORDI-SG layer over OWLIM via JMS, and mainly emphasized its use. We have also described the default KRI data sources.

    The Knowledge Representation Infrastructure will enable MOLTO's baseline and use case driven knowledge modeling with the necessary expressivity of metadata-about-metadata descriptions for provenance of the diverse sources of structured knowledge (upper-level, domain specific and derived (from grammars) ontologies; thesauri; domain knowledge bases; content and metadata).

    6. References

    [1] ISO. Information Technology-Database Language SQL. Standard No. ISO/IEC 9075:1999, International Organization for Standardization (ISO), 1999. (Available from American National Standards Institute, New York, NY 10036, (212) 642-4900.).

    [2] HAYES, P. RDF Model Theory. Working draft, World Wide Web Consortium. September 2001. Please, see http://www.w3.org/TR/rdf-mt/.

    [3] BROEKSTRA, J; Kampman, A; van Harmelen, F. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. International Semantic Web Conference, Sardinia, Italy, 2002.

    [4] TER HORST, H J. Combining RDF and Part of OWL with Rules: Semantics, Decidability, Complexity. In Proc. of ISWC 2005, Galway, Ireland, Nov. 6-10, 2005. LNCS 3729, pp. 668-684.

    [5] MOTIK, B.; GRAU, B., C.; HORROCKS, I; WU, Z.; FOKOUE, A.; LUTZ, K. Owl 2 Web Ontology Language, 2009. Please, see http://www.w3.org/TR/owl2-overview/.

    [6] CARROLL, J. J; DE ROO, J. OWL Web Ontology Language: Test Cases. W3C Recommendation 10 Feb. 2004. Please, see http://www.w3.org/TR/owl-test/.

    A1. Abbreviations

    API Application Programming Interface

    DBMS Database Management System

    JMS Java Message Service

    KRI Knowledge Representation Infrastructure

    RDF Resource Description Framework

    SAIL Storage and Inference Layer

    SPARQL SPARQL Protocol and RDF Query Language

    TRREE Triple Reasoning and Rule Entailment Engine

    UI User interface

    D4.3 Grammar-Ontology Interoperability

    Draft -

    The following three tasks can be based on this interoperability:

    1. Natural Language Generation from ontology
    2. Translation of natural language queries to SPARQL
    3. Information extraction

    Natural Language Generation from an Ontology

    We could follow the scheme:

    • Determine small ontology subgraphs.
    • For each ontology subgraph, define natural language patterns that describe it.
    • Define GF grammars that can parse all natural language patterns.

    Ramona Enache has applied a similar approach for SUMO using the patterns that go with this ontology.

    Translation of Natural Language Queries to SPARQL

    It is possible to apply the pattern approach for translation of natural language queries to SPARQL, but the resulting natural language will be very restrictive. I developed in JAVA a demo example for such very restrictive natural language for PROTON. The small ontology subgraphs encode information about persons, locations, organizations and job titles. The table below shows some examples from the demo. The left column represents the natural language queries. The right column represents their SPARQL translation.

    NATURAL LANGUAGE SPARQL
    Give me all persons associated with the organization X.
    select distinct 
    ?person where { ?s rdf:type prt:Organization . 
                    ?s rdfs:label "X" . 
                    ?j prt:withinOrganization ?s . 
                    ?person prt:hasPosition ?j . }
    
    Give me all persons and related job titles associated with the organization X.
    select distinct 
    ?person ?job_title where { ?s rdf:type prt:Organization . 
                               ?s rdfs:label "X" . 
                               ?j prt:withinOrganization ?s . 
                               ?j prt:holder ?person . 
                               ?j pru:hasTitle ?job_title . }
    
    Give me all organizations associated with the location North America.
    select distinct 
    ?organization where { ?s rdf:type prt:Location . 
                          ?s rdfs:label "North America" .
                          ?organization prt:locatedIn ?s . 
                          ?organization rdf:type prt:Organization . }
    
    Give me all organizations associated with the person X.
    select distinct 
    ?organization where { ?s rdf:type prt:Person . 
                          ?s rdfs:label "X" . 
                          ?j prt:holder ?s . 
                          ?j prt:withinOrganization ?organization . }
    
    Give me all job titles associated with the person X.
    select distinct 
    ?job_title where { ?s rdf:type prt:Person .
                       ?s rdfs:label "X" . 
                       ?j prt:holder ?s . 
                       ?j pru:hasTitle ?job_title . }
    
    Give me all job titles and related organizations associated with the person X.
    select distinct 
    ?job_title ?organization where { ?s rdf:type prt:Person . 
                                     ?s rdfs:label "X" . 
                                     ?j prt:holder ?s . 
                                     ?j pru:hasTitle ?job_title . 
                                     ?j prt:withinOrganization ?organization . }
    
    Give me all organizations and related job titles associated with the person X
    select distinct 
    ?organization ?job_title where { ?s rdf:type prt:Person . 
                                     ?s rdfs:label "X" . 
                                     ?j prt:holder ?s . 
                                     ?j pru:hasTitle ?job_title . 
                                     ?j prt:withinOrganization ?organization . }
    

    The demo example I developed does not use GF, because the available GF resource API grammars are too restrictive and cannot parse the desired sentences. Of course, most robust solution is needed. For this aim we need suitable GF grammars. If we have them, it is possible to handle the correspondence between the ontology subgraphs and the trees that result from parsing the input queries with GF.

    Information Extraction

    The same holds for information extraction. If we have suitable GF grammars, it is possible to handle the correspondence between the ontology subgraphs and the trees that result from parsing the input queries with GF.

    D4.3A Annex to Grammar-Ontology Interoperability

    Contract No.:FP7-ICT-247914
    Project full title:MOLTO - Multilingual Online Translation
    Deliverable:D4.3A Appendix to D4.3 Grammar ontology interoperability
    Security (distribution level):Public
    Contractual date of delivery:April 2013
    Actual date of delivery:
    Type:Regular Publication
    Status & version:Draft
    Author(s):Maria Mateva, Laura Tolosi, Ramona Enache, Aarne Ranta, Inari Listenmaa
    Task responsible:Ontotext
    Other contributors:


    Abstract

    During the review on March 20, 2012, an appendix was requested to document the heuristics, namely the rules expressing the interoperability, underlying the automated tools. Documentation on how to retrieve the software tools, their limitations, their usage is described in a Appendix to D4.3. In 2012 we have provided a renewed version of the D4.3.

    In 2013, Ontotext has decided to deliver an annex to D4.3 that would extend it and summarize our overall experience and the experience of our consortium partners on the grammar-ontology interoperability. Also, the document will address the reviewers' remarks and recommendations from the M24 MOLTO review report, for example on possible steps of integration of Term Factory and KRI and the degrees of automation achieved in the field within MOLTO.

    Next, this annex will provide brief summary on the techniques we used to build our MOLTO prototypes. It will aim to give the required technical details. It will also present the Grammar Ontology helper that was built as part of the GF Eclipse Plugin in the scope of WP2. Finally, we will give a short summary of the present NL to Ontology approaches.

    D6.1 Simple drill grammar library


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D6.1. Simple Drill Grammar Library
    Security (distribution level): Public
    Contractual date of delivery: M18
    Actual date of delivery: September 2011
    Type: Prototype
    Status & version: Final (evolving document)
    Author(s): J. Saludes, et al.
    Task responsible: UPC
    Other contributors:


    Abstract

    The present paper is the cover of deliverable D6.1 as of WP6. It gives installation instructions for the Mathematical Grammar Library and a short manual.

    1. How to get it

    The living end of the library is publicly available using subversion as:

         svn co svn://molto-project.eu/mgl
    

    A stable version can be found at:

        svn co svn://molto-project.eu/tags/D6.1
    

    2. Library structure

    The mgl library consists on the following files and directories:

    • One directory per language
    • abstract directory: For the abstract modules of the library
    • resources directory: Containing the general resource modules, incomplete concrete modules and generic lexicon.
    • server: Code for the mathbar demo.
    • test: Testing facilities and data
    • transfer: Haskell transfer modules.

    2.1 Logical structure

    At the same time, the library can be organized in three layers of increasing complexity:

    • Ground layer: it contains basic and atomic elements. Modules Ground and Variables.
    • OpenMath layer: the bulk of the library resides here: There is a module for each targeted OpenMath Content Dictionary, namely: Arith1, Arith2, Calculus1, Complex1, PlanGeo1, Fns1, Integer1, Integer2, Interval1, Limit1, LinAlg1, LinAlg2, Logic1, MinMax1, Nums1, Relation1, Rounding1, Set1, SetName1, Transc1, VecCalc1 and Quant1
    • Operations layer: the top layer is for expressing simple mathematical drills by combining an imperative (Compute, Prove, Find, etc.) with the productions of the OpenMath layer. There is also possible to express a sequence of simple computations and to set pre-conditions.

    3. Compiling the library

    Inside the mgl directory:

        make
    

    will compile the top (Operations) layer and produce Test.pgf. To compile only the OpenMath layer:

        make om
    

    4. Demo

    An online version of the mathbar demo is http://www.grammaticalframework.org/demos/minibar/mathbar.html.

    5. Testing

    The library compiles for the following EU languages: Bulgarian, Catalan, English, Finnish, French, German, Italian, Polish, Romanian, Spanish, Swedish.

    Regression testing of the OpenMath productions is possible through a treebank containing about 140 productions from this layer. At the present moment it contains linearizations for English, German, Polish and Spanish. At the time of writing this report, the entries of these languages (except for Polish) had been corrected by fluent speakers of the respective language. To allow for discrepancy, earlier corrections are also stored in the treebank, tagged with author and revision number.

    The structure of the treebank is described in the evaluating document.

    To test the library, make sure you have an up-to-date OpenMath.pgf. You can recreate it by issuing:

         make om
    

    and then, on the test directory:

         ./tbm table
    

    That will make a table indexed by treebank entry and testing language (English, German an Spanish), showing the number of differences between the actual linearization and the corrected one.

    Each time a new revision is committed to the repository, the output of this command is saved into test/table. Comparing different revisions of this file allows to measure the progress of the bug-fixing effort.

    To review the current defects for language L:

         ./tbm review -lL
    

    It will walk all the defects showing the differences, the stored corrected concretes, the abstract and the current linearization. For a list of available sub-commands press

    h

    .

    6. Acknowledgments

    Krasimir Angelov, Olga Caprotti, Ramona Enache, Thomas Hallgren, Alba Hierro, Inari Listenmma, Aarne Ranta, Ares Ribo, Adam Slaski, Shafqat Virk and Sebastian Xambó.

    A1. Current differences

    English

    • (46) there {is → exists} x in C such that y divides x
    • (84) is x equal to x{→ ?}
    • (87) {it is →} not {true that →} p
    • (115) the set {with → whose} elements{→ are}y and z

    German

    • (0) der inverse hyperbolische Kosinus des Produkts über {Gamma → Gamma,} wobei x von der Differenz von x und {von →} y bis dem Arcuskosinus von z läuft
    • (5) der absolute Wert des Bruches {von →} x über z
    • (25) das kartesische Produkt von A und {von →} B
    • (26) die Summe über x{→ aufgerundet,}wobei x {über →} die Menge B durchläuft {aufgerundet →}
    • (28) die komplexe Zahl mit polaren Koordinaten dem Quadrat von z und dem Produkt über {z → z,} wobei z von x bis y läuft
    • (38) das Integral des Arcussinus {auf dem → über das} Intervall {aus → von} dem Kubus von Pi {nach → bis} minus Gamma
    • (46) es gibt x in C so dass{→ x}y {durch x dividiert → teilt}
    • (53) für alle z {, → gilt} p
    • (54) für alle z in A {, → gilt} p
    • (55) der größte gemeinsame Teiler von x und {von →} y
    • (61) der Durchschnitt von A und {von →} B
    • (62) die {Funktion aus → Funktion, welche} y {nach der → auf die} Differenz von y und {von →} z{→ abbildet}
    • (63) das kleinste gemeinsame Vielfache von y und {von →} z
    • (65) die {Links-Inverse → linksinverse} Funktion der {Rechts-Inversen → rechtinversen} Funktion des hyperbolischen Kosinus
    • (69) die Menge {Werte → von Werten} der Form der Fakultät von {x → x,} so dass y in x in {A → A,} so dass r ist
    • (74) das maximale Element der Differenz von A und {von →} B
    • (75) der Mittelwert von z und {von →} y
    • (76) der Median von x , {von →} y und {von →} z
    • (78) die Differenz der ganzzahligen Division von x und {von →} z und der Summe über {Pi → Pi,} wobei x {über →} die Menge A durchläuft
    • (85) der Modus von x , {von →} y und {von →} z
    • (86) das siebte Moment von x , {von →} y und {von →} z {an der → über die} Differenz von Pi und {von →} x
    • (87) es ist nicht {wahr → wahr,} daß p
    • (96) die Summe von x und {von →} z
    • (97) y hoch die Summe über {z → z,} wobei z von x bis y läuft
    • (99) der Kubus des {Produkts über z → Produktes von z,} wobei z von x bis dem Kosekans von x läuft
    • (100) das Produkt über der Quadratwurzel von {z → z,} wobei z {über die Menge →} das linksseitige {geschlossene → abgeschlossene} Intervall von Gamma bis y durchläuft
    • (101) das{→ stetige}Intervall von Pi bis x ist eine echte Teilmenge des Definitionsbereiches des {Kosekans → Kosecans}
    • (102) die ganzzahlige Division von x und der Summe über {z → z,} wobei z von dem Argument von z bis dem Rest von x dividiert durch y läuft
    • (105) der Rest der ganzzahligen Division von Gamma und {von →} z dividiert durch Pi
    • (107) die {Rechts-Inverse → rechtinverse} Funktion der Ableitung des Tangens
    • (112) die standarde Abweichung von y und {von →} z
    • (116) der Quotient von x und {von →} Pi ist ein Element des {geschlossenen → abgeschlossenen} Intervalls von x bis z
    • (118) die Differenz der Differenz von A und {von →} B und des offenen Intervalls von z bis y
    • (127) die Größe des linksseitigen {geschlossenen → abgeschlossenen} Intervalls von y bis x
    • (130) die Summe über der fünften Wurzel von {x → x,} wobei x {über die Menge →} das ganzzahlige Intervall von z bis Pi durchläuft
    • (133) das Produkt von x und {von →} y
    • (136) die Summe über {y → y,} wobei y von der Kubikwurzel von x bis der inversen Zahl von Pi {läuft → läuft,} abgeschnitten
    • (138) die Vereinigung von A und {von →} B
    • (139) die Varianz von z und {von →} x
    • (143) das vektorielle Produkt des {Vektoren → Vektors} mit einzigen Komponente Gamma und {von →} v

    Spanish

    • (5) el valor absoluto de la fracción {de →} x entre z
    • (25) el producto cartesiano de A y {de →} B
    • (26) el redondeo hacia arriba del sumatorio de x cuando x varía en {los elementos de →} B
    • (27) el número complejo con coordenadas cartesianas la fracción {de →} z entre y y el truncamiento de e
    • (37) la integral de la arcosecante sobre el intervalo abierto por la izquierda {de → desde} z {a → hasta} y
    • (38) la integral del arcoseno {sobre → desde} el {intervalo del →} cubo de pi {al → hasta el} opuesto de Gama
    • (42) el cociente {de → entre} x y {de →} la raíz cuadrada de y
    • (46) {hay → existe} x en C tal que y divide a x
    • (55) el máximo común divisor de x{→ e}y {de y →}
    • (61) la intersección de A y {de →} B
    • (62) la función {de → desde} y {a → hasta} la diferencia entre y {y → e} z
    • (63) el mínimo común múltiplo de y y {de →} z
    • (70) la matriz con una fila {con → de} componentes x e y y una fila {con → de} componentes y y x
    • (71) la matriz con una fila {con → de componente} única {componente →} x
    • (72) una fila {con → de} componentes x e y
    • (73) una fila {con → de componente} única {componente →} el elemento máximo de A
    • (74) el elemento máximo de la diferencia de A y {de →} B
    • (75) la media de z{→ e}y {de y →}
    • (76) la mediana de x , {de →} y y {de →} z
    • (77) el elemento mínimo del intervalo cerrado {de → desde} x {a → hasta} y
    • (78) la diferencia entre la división entera de x {y de → entre} z y el sumatorio de pi cuando x varía en {los elementos de →} A
    • (82) el intervalo abierto {de → desde} x {a → hasta} y no es un subconjunto propio de A
    • (83) el intervalo abierto por la izquierda {de → desde} y {a → hasta} x no es un subconjunto del intervalo abierto {de → desde} x {a → hasta} y
    • (85) la moda de x , {de →} y y {de →} z
    • (86) el séptimo momento de x , {de →} y y {de →} z en la diferencia entre pi y x
    • (96) la suma de x y {de →} z
    • (100) el producto de la raíz cuadrada de z cuando z varía en {los elementos del → el} intervalo cerrado por la izquierda {de → desde} Gama {a → hasta} y
    • (101) el intervalo {de → continuo desde} pi {a → hasta} x es un subconjunto propio del dominio de la cosecante
    • (102) la división entera de x {y del → entre el} sumatorio de z cuando z varía desde el argumento de z hasta el resto de x dividido por y
    • (105) el resto de la división entera de Gama {y de → entre} z dividida por pi
    • (112) la desviación estándar de y y {de →} z
    • (115) el conjunto {compuesto por los → de} elementos y y z
    • (116) el cociente {de → entre} x y {de →} pi es un elemento del intervalo cerrado {de → desde} x {a → hasta} z
    • (118) la diferencia {de → entre} la diferencia {de → entre} A y {de →} B y {del → el} intervalo abierto {de → desde} z {a → hasta} y
    • (127) el cardinal del intervalo cerrado por la izquierda {de → desde} y {a → hasta} x
    • (128) el intervalo abierto {de → desde} y {a → hasta} x es un subconjunto del intervalo abierto por la izquierda {de → desde} x {a → hasta} y
    • (129) z en {el → un} conjunto {con único elemento → de componente} x tal que r
    • (130) el sumatorio de la raíz quinta de x cuando x varía en {los elementos del → el} intervalo entero {de → desde} z {a → hasta} pi
    • (133) el producto de x{→ e}y {de y →}
    • (138) la unión de A y {de →} B
    • (139) la varianza de z y {de →} x
    • (140) el vector {con → de} componentes y y x
    • (141) el vector {con → de componente} única {componente →} el cardinal de A
    • (143) el producto vectorial del vector {con → de componente} única {componente →} Gama y {de →} v

    A2. RGL tickets for the MGL

    #2: ! with imperative (completed)

    Imperative mode forces "!" at the end?

    Not what we want for exercises.


    Test> l DoComputeF DefineV (Var2Fun f)

    define f !


    #21: x {hoch,gleigh} y (completed)

    We want to express:

    "x gleich y"

    or

    "x hoch y"


    #38: Ger mkAdA (wont-fix)

    mkAdA : Str → AdA


    It doesn't exist

    #39: Spa from_Prep must be "desde" (wont-fix)

    #40: gilt/holds -- gilt nicht/does not hold (open)

    Example

    for all z , r , it isn't true that r and if p , then , it isn't true that r

    für alle z, r , ist es nicht wahr daß r und wenn p dann ist es nicht wahr daß r


    I think it would be better to write "gilt", "gilt nicht" (english "holds", "it does not hold") instead of "es ist nicht wahr", "es ist wahr", "it is true", "it isn't true":


    for all z , r , r does not hold and if p , then , r does not hold

    für alle z, r , r gilt nicht und wenn p dann r gilt nicht


    #54: there is/exists (46) (wont-fix)

    Abstract

    exist (BaseVarNum x) (Var2Set C) (mkProp (divides (Var2Num y) (Var2Num x)))

    Difference

    there {is → exists} x in C such that y divides x

    #56: set of values of the form ... (69) (completed)

    Abstract

    map y (factorial (Var2Num x)) (suchthat (Var2Set A) x r)

    Difference

    the set of values of the form {the →} factorial of x {such that → for} y {is → ranging} in{→ the set of elements}x {in → of} A such that r

    #62: set whose elements (115) (wont-fix)

    Abstract

    set (BaseValNum (Var2Num y) (Var2Num z))

    Difference

    the set {of components → whose elements are} y and z

    #67: hay → existe , divida (wont-fix)

    l exist (BaseVarNum x) (Var2Set C) (mkProp (divides (Var2Num y) (Var2Num x)))

    hay x en C tal que y divida a x


    hay→existe

    divida → divide

    #74: part_Prep before vowel in Cat (assigned)

    el conjunt amb element únic el cub de pi


    DefNPwithbaseElem : CN → MathObj → MathObj =

    \cn,o → DefSgNP (mkCN cn (prepAdv with_Prep (mkNP (mkCN (mkCN (mkA "únic") element_CN) o)))) ;


    Problema:

    No puc escriure "d'element únic" perquè si canvio el with_Prep per un possess_Prep o un part_Prep (of) , omet la preposició! Perquè?


    #78: de a y b, no de a y de b (25) (wont-fix)

    Abstract

    cartesian_product (BaseValSet (Var2Set A) (Var2Set B))

    Difference

    el producto cartesiano de A y {de →} B

    #88: Imaginärteil (58) (completed)

    Abstract

    imaginary (Var2Num y)

    Difference

    der {imaginäre Teil → Imaginärteil} von y

    #90: kleinstes gemeinsames Vielfaches (63) (wont-fix)

    Abstract

    lcm (BaseValNum (Var2Num y) (Var2Num z))

    Difference

    das {am wenigstene gemeine → kleinstes gemeinsames} Vielfaches von y und z

    #96: dem reele Teil (109) (completed)

    Abstract

    root2 (real (Var2Num x))

    Difference

    die quadratische Wurzel von dem {reellen → reele} Teil von x

    #99: and_Conj "y" in spanish does not include the case "e" (wont-fix)

    Problem: and_Conj in spanish does not include the case "e", for example for "x e y". It should be

    and_Conj = {s1 = [] ; s2 = etConj.s ; n = Pl} ;

    For the moment, we have created a new

    myAnd_Conj=and_Conj; 
    at MathI.gf and redefined it as
    and_Conj = {s1 = [] ; s2 = etConj.s ; n = Pl} ;
    at MathSpa.gf

    This should be fixed at StructuralSpa.gf

    D6.2 Prototype of comanding CAS


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D6.2. Prototype of comanding CAS
    Security (distribution level): Public
    Contractual date of delivery: M23
    Actual date of delivery: February 2012
    Type: Prototype
    Status & version: Final (evolving document)
    Author(s): Jordi Saludes, Ares Ribó
    Task responsible: UPC
    Other contributors:


    Abstract

    The present paper is the cover of deliverable D6.2 as of WP6. It gives description and installation instructions for the executables included in this deliverable.

    Dependencies

    The following table describes whats is needed in order to use the executables. In all case you'll need GF and Sage.

    gfsage is the simple dialog executable, shell denotes the component that allow using natural language inside Sage and shell-complete is the same with auto-completion of commands.

    Component O. S. Extra requirement Spoken output autocompletion
    gfsage Mac OS X, Linux Ubuntu ghc, curl OSX1, Linux yes
    shell all2 no
    shell-complete Linux gf python bindings yes

    1. 10.7 

    2. Not tested on Windows, but in this case Sage runs inside of a Linux virtual box. 

    Installation

    Depending on your permission settings you might have to run some of these command as sudo. For all of these first you have to checkout the Mathematics Grammar Library from:

    svn co svn://molto-project.eu/mgl
    

    Be warned that develoopment will continue for some time in this HEAD branch. For a frozen version of it, checkout from:

    svn co svn://molto-project.eu/tags/D6.2
    

    You'll find detailed instructions for installing each executable in the following pages. For the moment, note that it is necessary to modify some files in your Sage files, for these executables to run. Usually, we have to make these changes just once: The first time, the installation procedure will warn you about it:

    Please add 'sage.nlgf' to /usr/local/sage-4.7.2/devel/sage/setup.py
    

    Since ours is not a regular Sage package, we must add a package reference manually by tweaking setup.py given above (Notice that yours may have a different path). This is a python file that Sage reads to configure the system using the command setup. Please find it in the file, mine is at line 882 and looks like this:

    code = setup(name = 'sage',
    

    The setup command lists several items; Please locate packages (which is a python list) and add 'sage.nlgf' (quotes included) among the other packages listed there. Python is picky about indentation and doesn't like to have spaces and tabs mixed. Please check that you're using the same spacing as the rest of the file.

    The installation has been tested on Sage 4.7.1, 4.7.2 and 4.8

    gfsage: a natural language interface for Sage

    The goal of this work is to develop a command-line tool able to take commands in natural language and have them executed by Sage, a collection of Computer Algebra packages presented in a uniform way. We present here instructions on how to build the interface and examples of its intended use.

    Building the executable

    You'll need:

    • ghc with cabal, as in Haskell platform
    • curl
    • a way to call Sage on a terminal (usually sage command. It assumes it's in your PATH)
    • A POSIX system
    • The source version of GF.

    You can get this source version by:

    cabal install gf
    

    We can install the other dependencies too by:

    cabal install json curl
    

    Checkout the mathematics grammar library from:

     svn co svn://molto-project.eu/mgl
    

    This is the active branch. For the fixed one use:

    svn co svn://molto-project.eu/tags/D6.2
    

    Go into the mgl/sage directory (D6.2/sage if you're using the fixed branch) and make it:

    cd mgl/sage
    make
    

    The first time you make it will fail, asking you to make modifications in the Sage installation. Please refer to the installation page.

    Now try to build gfsage again. All these build operations will ask Sage to "rebuild" itself. Be warned that the first rebuild takes some time:

    make
    

    The system as been tested in Mac (OS X 10.7) and Linux (Ubuntu).

    Usage

    Run the tool as:

    ./gfsage english
    

    giving the input language as argument. It will take some seconds to start the server. After that it will reply with some server information and will show the prompt:

        sage>
    

    You can then enter your query:

        sage> compute the product of the octal number 12 and the binary number 100.
        (3) 40
        answer: it is 40 .
    

    To show that a CAS is actually behind the scene, let's try something symbolic:

        sage> compute the greatest common divisor of x and the product of x and y.
        (4) x
        answer: it is x .
    

    and compare it with:

        sage> compute the greatest common divisor of x and the sum of x and y.
        (5) 1
        answer: it is 1 .
    

    Sage does the right thing in both cases, x and y being unbound numeric variables.

        sage> compute the second iterated derivative of the cosine at pi.
        (6) 1
        answer: it is 1 .
    

    Exiting

    Exit the session by issuing CRTL+D: This way the server exits cleanly.

    Just another example in a different language:

        ./gfsage spanish
        Login into localhost at port 9000
        Session ID is c1ef10dfd49e4fdb3214fa6d3a3b9c92
        waiting... EmptyBlock 2
        finished handshake. Session is c1ef10dfd49e4fdb3214fa6d3a3b9c92
        sage> calcula la parte imaginaria  de la derivada de la exponencial en pi.
        (4) 0
        answer: es 0 .
    

    More recent examples involving integer literals and integration:

        sage> compute the sum of 1, 2, 3, 4 and 5.
        (3) 15
        answer: it is 15 .
       
        sage> compute the summation of x when x ranges from 1 to 100.
        (4) 5050
        answer: it is 5050 .
    
        sage> compute the integral of the cosine from 0 to the quotient of pi and 2.
        waiting... (5) 1
        answer: it is 1 .
    
        sage> compute the integral of the function mapping x to the square root of x from 1 to 2.
        (6) 4/3*sqrt(2) - 2/3
        answer: it is 4 over 3 times the square root of 2 minus the quotient of 2 and 3 .
    

    Other invocation options

    Use english:

    gfsage      
    

    Use LANGUAGE:

    gfsage LANGUAGE
    

    General invocation:

    gfsage [OPTIONS]
    

    where OPTIONS are:

    short form long form description
    -h --help Print usage page
    -i LANGUAGE --input-lang=LANGUAGE Make queries in LANGUAGE
    -o LANGUAGE --output-lang=LANGUAGE Give answers in LANGUAGE
    -V LEVEL --verbose=LEVEL Set the verbosity LEVEL
    -t FILE --test=FILE Test samples in FILE
    -v[VOICE] --voice[=VOICE] Use voice output. To list voices use ? as VOICE.
    -F --with-feedback Restate the query when answering.

    Limitations

    • On Darwin (OS X 10.6 and 10.7) a bug in the Sage part makes the system unresponsive after some computations (between 7 and 10)
    • On some machines, it takes time for the Sage server to respond.

    This condition is signaled by the message:

    gfsage: Connecting CurlCouldntConnect 
    

    I used a Linux virtual machine to reproduce this condition and find that, sometimes, it takes about 10 retries for the server to catch, but then it stays running ok for hours. My guess is that is related to some timeout limit in the server. Killing the orphaned python processes from the previous retries might help too (killall python).

    Realsets

    realsets.py is a Sage module to support subsets of the real field consisting of intervals and isolated points and was developed to demonstrate set operations of the MGL Set1 module.

    It is based of previous work from Interval1Sage adding integration on real sets and real intervals.

    An object in this module consists of a list of disjoint open intervals plus a list of isolated points (not belonging to these intervals). Notice that Infinite is acceptable as interval bound. Therefore, one can define:

    • All sort of real intervals: open, close and half-open
    • Finite sets
    • Unbounded intervals
    • And combinations of these by union, intersection and taking complements.

    Represent a set that can be the union of some intervals and isolated points. It consists of:

    • A list of disjoint open non-empty intervals.
    • A list of points. Each of these points belongs at most to one interval.

    Examples

    A closed interval:

    ? RealSet.cc_interval(1,4); 
    [ 1 :: 4 ]
    

    A single point:

    ? RealSet.singleton(1)
    {1}
    

    Union

    Union is supported with intervals and can be nested :

    ? I = RealSet.co_interval(1, 4)
    ? J = RealSet.co_interval(4, 5)
    ? M = RealSet.oc_interval(7, 8)
    ? I.union(J).union(M)
    [ 1 :: 5 [ ∪ ] 7 :: 8 ]
    

    Intersection

    ? I.intersection(J)
    ()
    ? I.intersection(RealSet.cc_interval(2,5))
    [ 2 :: 4 [
    

    Queries

    Is a point in the set?

    ? I = RealSet.oo_interval(1, 3)
    ? 2 in I
    True
    ? 3 in I
    False
    

    Is a set discrete (i.e: does not contain intervals)?

    ? RealSet.oo_interval(0,1).discrete
    False
    ? RealSet(points=(1,2,3)).discrete
    True
    

    Size of a discrete is the number of points:

    ? RealSet(points=range(5)).size
    5
    ? RealSet.oo_interval(0,3).size
    +Infinity
    

    A is subset of B

    ? A = RealSet.oo_interval(0,1)
    ? B = RealSet.cc_interval(0,1)
    ? RealSet().subset(A)
    True
    ? B.subset(A)
    False
    ? A.subset(B)
    True
    ? A.subset(A)
    True
    ? A.subset(A, proper=True)
    False
    

    Return the infimum (greatest lower bound)

    ? RealSet(points=range(3)).infimum()
    0
    ? RealSet.oo_interval(1,3).infimum()
    1
    

    The opposite of a set: –A = {-x | x ∈ A}

    ? -RealSet.oo_interval(1,2)
    ] -2 :: -1 [
    

    Return the supremum (least upper bound)

    ? RealSet(points=range(3)).supremum()
    2
    ? RealSet.oo_interval(1,3).supremum()
    3
    

    The complementary of a set:

    ? RealSet.oo_interval(2,3).complement()
    ] -Infinity :: 2 ] ∪ [ 3 :: +Infinity [
    ? RealSet(points=range(3)).complement()
    ] 0 :: 1 [ ∪ ] 1 :: 2 [ ∪ ] 2 :: +Infinity [ ∪ ] -Infinity :: 0 [
    

    The set difference of A and B: \{x \in A, x\notin B\}

    ? I = RealSet.oo_interval(2,+Infinity)
    ? J = RealSet.oo_interval(-Infinity, 5)
    ? I.setdiff(J)
    [ 5 :: +Infinity [
    ? J.setdiff(I)
    ] -Infinity :: 2 ]
    

    gfsage internal workings

    gfsage is a prototype to demonstrate two-way natural language communication between a user and a Sage system.

    When you invoke the gfsage command interactively:

    • A Sage process is started in the background, listening for incoming http requests;
    • A GF pgf module is read and set to mediate between the user and the Sage process;

    The details of these components are given below.

    The GF side

    A GF module acts as a post office translating messages between the different parties (nodes) composing a dialog. This section is more a description of a proposed design strategy for a generic postoffice interface based on GF. The actual code implements ideas of this design, but, for instance, it contains no edges or nodes as explicit entities.

    Nodes and edges

    gfsage deals with just 2 agents:

    1. The user
    2. The Sage system

    in the case whether the input language is different of the output language, we may consider a third node (the output user).

    There is a unique pgf module containing all GF information for the dialog system to work: Commands.pgf. Each node has a language (a GF concrete module) assigned: the user uses a natural language (i.e., ComandsEng for English).

    A node reacts to received messages by sending a reply. The chain of messages between two nodes is called a dialog. An active node as the user can start a dialog by sending a message. A passive node, like the Sage system here, just replies to the received messages.

    A node can receive:

    • A regular message from another node: This is a GF linearization in the receptor language.
    • A no_parse message from the postoffice telling that a previous outgoing message cannot be parsed.
    • An is_ambiguous message from the postoffice related to a previous message sent by the node, specifying that it was ambiguous and carrying additional info for the node to decide among the possible meanings. To respond to this, the node must send a disambiguate message to the postoffice (see below).

    A node can send:

    • A regular message to another node: This is a parseable string for the emitter language.
    • A disambiguate message sent in response to an ambiguous message. In this message the node chooses one of the options or aborts the transaction.

    A regular message between two given nodes corresponds to a fixed GF category. In the case of gfsage it is Command for messages traveling from User to Sage and Answer for messages going the other way.

    Up and Down pipeline

    A regular message from node N1 to node N2 goes through the following steps:

    1. Input string is lexed, that is: separated into parse-able units (tokens);
    2. It is then parsed using the node N1 language and edge category (i.e. node N1 to node N2) into a set of GF abstract trees;
    3. This set is, hopefully, reduced by paraphrasing the trees and removing duplicates (it is the compute step);
    4. Now, If the resulting set is empty, a no_parse message is sent back to the sending node. If it contains more than one entry, an is_ambiguous message is sent. In the previous cases, the process stops here; Only when the computed set contains just an entry, is this pushed downstream to the node N2.
    5. The abstract tree is linearized using the node N2 language;
    6. The result is unlexed, that is: assembled into a string that is delivered to the receiving node.

    The Sage side

    For Sage to work alongside GF, we need a http sever listening to Sage commands and some scripts to set up the environment and respond to the type of queries that can be expressed in the Mathematics Grammar Library, MGL.

    The Sage server

    A Sage process is started in the background by the start-nb.py script in -python mode. This script starts a Sage notebook, as described in Simple server API, listening on port 9000 and up to requests in http format. It also installs a handler for cleanly disposing of the notebook object whenever the parent process terminates.

    The parent process sends then an initial request to load some functions and variables that we'll need in the dialog system defined in prelude.sage and goes into the main evaluation loop.

    Sage scripts

    realsets.py
    is a Sage module developed to support set operations as described in the Set1 module of the MGL. (See the page about it)
    prelude.sage
    defines Sage functions to implement derivation on the style of the MGL and state storing for numbers, sets, functions and sets to support anaphora in the dialog.

    Adding voice output to gfsage

    Description

    OS X has voice output buit-in, usable from the shell by way of the say command. You can use several voices in English or download more for other languages.

    Usage

    1. You must build the system on mgl/sage as described previously.
    2. Check that you have at least one voice for your prefered languages: Go to System Preferences > Speech and click on System Voice
    3. See that you have the right ones. If not, click Customize on the pop-up
    4. Select the ones for you and click Ok. When downloading terminates, you may run the tool.
    5. You can call gfsage in 3 different ways, but for voiced output you must use the one with OPTIONS:
         gfsage Use english
         gfsage LANGUAGE Use this language   
         gfsage [OPTIONS] where OPTIONS are:
         -h --help print this page
         -i INPUT --input-lang=INPUT Make queries in LANGUAGE
         -o OUTPUT --output-lang=OUTPUT Give answers in LANGUAGE
         -v[VOICE] --voice[=VOICE] use voice output. To list voices use ? as VOICE.
         -F --with-feedback Restate the query when answering.
    

    The options relevant here are -v and -F. Use the first to select voice output. With no argument it will pick the first available voice for the OUTPUT voice selected:

    ./gfsage -i english -v
    Voiced by Agnes
    

    ... It will use Agnes as English voice. Notice that if you do not give a -o option, the OUTPUT language is assume to be the same as the INPUT language.

    To list the available voices use:

    ./gfsage -i english -v?
    Agnes, Albert, Alex, Bahh, Bells, Boing, Bruce, Bubbles, Cellos, Daniel, Deranged, Fred, Hysterical, Junior, Kathy, Princess, Ralph, Trinoids, Vicki, Victoria, Whisper, Zarvox
    

    It will list the English voices. To use a specific voice write:

    ./gfsage -i german -vYannick
    Voiced by Yannick
    

    The option -F is to make the system paraphrase your query on answering. First, get a simple answer:

    ./gfsage -i english
    Login into localhost at port 9000
    Session ID is df7ad7c769f2faac68b6bb9489bb97e2
    waiting... EmptyBlock 3
    sage&gt; compute the factorial of 5.
    (4) 120
    answer: it is 120 .
    

    ... and now the same with paraphrasing:

    ./gfsage -i english -F
    Login into localhost at port 9000
    Session ID is 88549994a28940fe0657eb9e506a5e84
    waiting... EmptyBlock 3
    sage&gt; compute the factorial of 5.
    (4) 120
    answer: the factorial of 5 is 120 .
    

    So, to experience voice output in its full glory you have to use both -v and -F.

    Experiences with Google voice

    Following a suggestion from Aarne, I found some Google service for speech input, but the experiments are not encouraging:

    1. I recorded Compute this into a mp4 file using QuickTime Player on the mac

    2. Converted it to flac using:

      sox compute.m4a compute.flac rate 16k

    3. And get into the service by:

      curl -H "Content-Type:audio/x-flac; rate=16000" "https://www.google.com/speech-api/v1/recognize?xjerr=1&amp;client=chromium&amp;lang=en-US" -F "myfile=@compute.flac

    But got:

     `{"status":0,"id":"56bdb158dd66b25fc2e221364004e620-1","hypotheses":[{"utterance":"coffee lol","confidence":0.46219563}]}`
    

    Other examples:

    • "I like pickles" ⇒ "I like turtles"

    • "The determinant of x" ⇒ "new york" (with confidence 0.88!)

    • "Compute this" ⇒ "coffee lol"

    Of course I'm not a native English speaker, but I expected a better performance.

    Adding tests to gfsage

    To help with regression testing I recently added a test option to gfsage for batch-testing the system by reading dialog samples from a file.

    The samples must be in a text file and consist in a sequence of dialogs which are sequences of query/responses to the Sage system. Notice that a dialog might carry a state in the form of assumptions that are asserted or variables that are assigned. In the same way, each dialog is completely independent of the others.

    Each dialog starts with a BEGIN or BEGIN language line. It specifies the beginning of dialog triplets and the natural language for these triplets. The dialog runs until an END line. The language specified becomes the current language. Dialogs with no given languages are assumed to be in the current language. At the start of a testing suite, the current language is English.

    A triplet is a sequence of 3 lines:

    • The query passed to Sage in the current language
    • The Sage response in sage language
    • This response translated to the current language.

    Example of a test suite

    BEGIN spanish
    calcula el factorial del número octal 11.
    362880 es 36280 . END BEGIN english let x be 4 .
    compute the sum of x and 5 .
    9
    it is 9 .
    compute the sum of it and 5 .
    14
    it is 14 .
    END
    

    Notice that blank lines are relevant: they mark that Sage responded nothing to the query. Therefore, it is not allowed to insert blank lines neither between triplets nor dialogs.

    Usage

    gfsage --test 

    will test the dialogs in and tell about the differences. You got a summary of the results:

    Dialog 'compute Gamma....' failed
    18 out of 19 dialogs successful.
    

    Using natural language inside Sage

    By defining new Sage interfaces we can command the Sage shell and notebook server using natural language.

    Installation

    Move to the sage directory and build sage-shell:

    cd mgl/sage
    make sage-shell
    

    The first time you build it, you may run into a warning as in the installation section of the front page, or:

    Please add nlgf components to the interfaces list in /usr/local/sage-4.7.2/devel/sage/sage/interfaces/all.py
    

    We must inform Sage that there are some new interfaces for it: We open interfaces/all.py (Notice that your actual path might be different), go to the end of the file and add something like this:

    from nlgf import english, spanish
    interfaces.extend(['english', 'spanish'])
    

    The first line asks the system to load the interfaces for commanding Sage using English and Spanish. The next line add these to the list of available interfaces.

    Now retry building:

    make sage-shell
    

    At the time of writing, the module nlgf provides catalan, english, german, and spanish interfaces.

    Sage shell with command auto-completion

    In some systems you can have the commands Sage shell auto-completed by pressing the tab key. This is experimental and you have to make the installation completely by hand.

    First you have to build the Python bindings for GF which, for the moment, only work in Linux. You'll find there a shared library called gf.so. Copy or move it into one of the directories that Python scans when resolving imports. Note that it may be the case that the Python instance run by Sage be different of the one your machine runs by default; To be sure, do as follows:

    sage -python -c 'import sys; print sys.path'
    

    it will list all the directories that Sage/python scans.

    You'll know it's all right when:

    sage -python -c 'import gf'
    

    exits with no complain: The next time you enter into the Sage shell you'll have autocompletion for the GF interfaces.

    Usage

    Shell interface

    Start a Sage shell:

    sage
    

    and switch to one of the defined natural language interfaces:

    sage: %english
    

    will reply with:

    --&gt; Switching to Gf &lt;-- 
    

    If you didn't install autocompletion (which is the usual case, auto-completion being experimental), a warning will appear:

    No autocompletion available
    

    Now you're ready to issue sage commands in English:

    english: compute the summation of x when x ranges from 1 to 100.
    5050
    english: add 3 to it.
    5053
    english: let x be the factorial of 6.
    720
    english: let y be the factorial of 5.
    120
    english: compute the greatest common divisor of x and y.
    120
    english: compute the least common multiple of x and y.
    720
    

    Go back to the standard interface by typing ctrl+D or typing quit.

    Notebook interface

    Sage has a notebook interface that gives a more flexible way to interact with it. To use it, start the shell as above and then:

    sage: notebook(secure=true, interface='')
    The notebook files are stored in: sage_notebook.sagenb
    ****************************************************
    *                                                  *
    * Open your web browser to https://localhost:8000  *
    *                                                  *
    ****************************************************
    There is an admin account.  If you do not remember the password,
    quit the notebook and type notebook(reset=True).
    2012-02-13 12:48:19+0100 [-] Log opened.
    ...
    

    In some systems a browser will open simultaneously. Now you can use Sage from the browser.

    Click on New Worksheet. You'll be asked to rename the worksheet (this is optional). A single cell will be ready for your input. Write your command and press evaluate. Notice that a cell can contain more than one command, separated by newlines.

    Start a new cell by writing:

    %english
    

    and add one or more new lines with commands in English.

    AttachmentSize
    sage-notebook.jpg95.69 KB

    D6.3 Assistant for solving word problems


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: Assistant for solving word problems
    Security (distribution level): Public
    Contractual date of delivery: December 2012
    Actual date of delivery: May 2013
    Type: Prototype
    Status & version: Final
    Author(s): Jordi Saludes
    Task responsible: Jordi Saludes
    Other contributors:


    Abstract

    We will introduce a prototype for dealing with simple arithmetical problems involving concepts of the physical world (word problems). The first software component allows an author to state a word problem by writing sentences in several languages and converting it into Prolog code. The second component takes this code and presents the problem in the student's language. Then it provides step-by-step assistance in natural language into writing equations that correctly model the given problem.

    1. Introduction

    This software deliverable is a prototype of a word problem solver, namely a system that interactively poses a word problem, (in many languages), then constructs a solution and a reasoning context for it. The overall architecture is based on the usage of third-party, open-source software components to provide the reasoning infrastructure for the system and are not distributed in this deliverable.

    This document describes:

    • how to install the prototype,

    • how to create word problems involving simple arithmetic;

    • how to assisting a student into finding the equations related to it.

    The first component is a Scala library (http://www.scala-lang.org) to be used inside the Scala Interpreter shell (in a Read-Evaluate-Print Loop), while the second component is a dialog system which runs in the command line. Both components were developed within the framework of the MOLTO project.

    2. Installation

    The source code for this deliverable can be downloaded from the MOLTO svn repository by:

      svn co svn://molto-project.eu/mgl/wproblems
    

    It will appear into the wproblems directory. But prior to building the system for the natural language interpretation of the word problems, you need to install the external components that are handling the computational aspects of the system.

    Software requirements

    Third-party software components provide the following functionalities in the system:

    • SWI prolog; (in our architectures, (1) SWI-Prolog version 6.2.2 for i386-darwin11.3.0, (2) SWI-Prolog (Multi-threaded, 64 bits, Version 6.2.6). Set the environment variable SWIPL_LIBDIR to the path to the swi-prolog library. The prototype employs Prolog as domain reasoner for certain schemata of word problems.
    • Scala; (in our architectures, (1) 2.9.2, (2)Scala version 2.10.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43) - Scala is a general purpose programming language designed to express common programming patterns using both object-oriented and functional paradigms. Conciseness is a key feature of Scala (scalable language). The prototype employs Scala for constructing, from natural language input, a Prolog program that can solve the problem.

    In addition, the system requires the availability of the jpl library for accessing Prolog from Java code. It is installed by the prolog installer, check that the jpl.jar exists and write down the path (It will be needed for the configuration step below).

    Project-related software components:

    • GF; (in our architecture, version version 3.4 ). GF is used to provide the natural language parsing and generation in the Dialog system.

    • gf-java to use the GF web services from Java; (distributed under lib, version gf-java-0.8.1.jar ).

    Install all the components as directed. Now configure by passing SWIPL_LIBDIR for the path to the swi-prolog library and JPL_LIBDIR for the path to the directory containing jpl.jar. In our case:

      ./configure SWIPL_LIBDIR=/opt/local/lib/swipl-6.2.2/lib/i386-darwin11.3.0/ JPL_LIBDIR=/opt/local/lib/swipl-6.2.2/lib/
    

    and then build the system

      make
    

    3. Quick examples of usage

    Having finished the installation step, we are now ready to use the system1.

    In this document word problem means a mathematical problem requiring writing the equations describing all the relevant information needed to get the solution. We we'll split the solution of such a problem into:

    Modeling:
    Finding out the equations describing all the relevant information needed to get the solution. This requires the student to use common sense reasonong abot the world and the ability to write these relations in the mathematical language.
    Solving:
    Determining the solution by handling the equations in a pure formal way.

    There are a lot of applications for helping students with solving step but only a few for the modeling step. We will present here a prototype for addressing this step for problems requiring just elementary arithmetics.

    The system allows two modes of usage, for authors (teachers) and for students:

    • creation of a word problem
    • tutorial dialog for solving a word problem.

    The first application runs inside the Scala REPL, and consists in a library implementing the class Problem with resources for constructing problems from natural language sentences. Problems are saved as Prolog clauses with comments used to reconstruct the originating sentences. The second application is a Scala executable that loads a saved problem and engages the student in a nalural lenguage dialog conducting to have the problem correctly modeled.

    Both applications use a Prolog database to reason about the problem. The basic difference is that for the author tool, the system constructs the model automatically in order to check if the problem is consistent (it does not contain contradictions) and complete (it has enough information to give a single solution), while for the student tool, the model construction is driven by the sentences proposed by the student. The system leads the student through several discovering steps (see next section) and checks that the proposed sentences are correct and relevant.

    Writing a word problem

    Invoke the author tool by:

    ./create
    

    Create a new problem to be saved into file fruit.pl

    scala: val p = new Problem("fruit.pl")
    p: wp.Problem = Problem with 0 statements
    

    We could use the Statement class to add new statements to the problem with the += operator. However, it is more convenient to define a statement factory for entering them in natural language (denoted by its 3-letter ISO code):

    scala: val en = new StatementFactory("Eng")
    

    We can now use a predictive parser to enter a new sentence into the problem:

    scala: p += en.read
    Eng: John has seven fruit .
    

    Notice the final period. We can keep track of how many statements our problem has by:

    scala: p
    res1: wp.Problem = Problem with 1 statements
    

    Let us add some more facts:

    scala: p += en.read
    Eng: John has two apples , some oranges and three bananas .
    scala: p += en.read
    Eng: how many oranges does John have ?
    

    To take a look to the internal representation of the problem, use print:

    scala: p.print
    

    We can check if the problem is consistent (it does not contain contradictory statements) or complete (it has a single solution) by using the methods consistent and complete:

    scala: p.complete
    res3: Boolean = true
    

    Remember to save the problem:

    scala: p.save()
    Saved to 'fruit.pl'
    

    and now we can exit:

    :q
    

    Solving the problem

    We can now try to solve our problem, by calling model with the file containing the problem:

     ./model fruit.pl 
    

    It shows us the statement of the problem:

    John has seven fruit .
    John has two apples , some oranges and three bananas .
    how many oranges does John have ?
    

    and displays the prompt:

     ?
    

    We can always press return at the prompt (or type help) for the system to suggest the proper action:

    you must assign a variable to the oranges that John has .
    

    But we do not know how to assign variables. Let us ask for an example:

    ? give me an example
    let $x$ denote the animals that Mary has
    

    Using this template we can now compose a definition for the variable x:

    ? let x denote the oranges that Mary has
    you must assign a variable to the oranges that John has .
    

    I forgot that we were dealing with John's fruit, not Mary's:

    ? let x denote the oranges that John has
    it is right .
    

    Press again return for the next suggestion:

    you must split the fruit that John has .
    

    This means that we have to specify how John's fruit are split in different classes:

     ? the fruit that John has are the apples that John has and the bananas that John has
     you must consider oranges .
    

    Yes, there are oranges too. Let us correct it:

     ? the fruit that John has are the apples that John has , the bananas that John has and the oranges that John has
     it is right .
    

    Good. Next suggestion:

    you must write an equation which says that the fruit that John has are the bananas that John has , 
    the oranges that John has and the apples that John has .
    

    What about this?

    ? y plus 2 plus 3 is equal to 7
    it doesn't follow .
    

    This means that the proposed equation can not be deduced from the statement of the problem. Let us see what is wrong with the variable y:

    ? tell me about y
    nothing is known about it .
    

    Perhaps we used a different variable to denote the amount of oranges:

    ? tell me about the oranges that John has
    the oranges that John has are $x$ oranges .
    

    So we used x for it. Just to confirm it:

      ? tell me about x
      $x$ denotes the oranges that John has .
    

    We rewrite the equation using x:

    ? x plus 2 plus 3 is equal to 7
    it is right .
    

    Now the problem is correctly modeled. The next action will give us the solution:

    the oranges that John has are two oranges .
    

    Going multilingual

    To run the same problem but in Spanish, add the 3-letter-ISO code of the language as second argument:

    ./model examples/fruit.pl spa
    ...
    Juan tiene siete frutas .
    Juan tiene dos manzanas , algunas naranjas y tres plátanos .
    ¿ cuantas naranjas tiene Juan ?
    

    Asking for help:

    ? 
    debes asignar una variable a las naranjas que Juan tiene .
    

    Asking for an example:

    ? dame un ejemplo
    denota las cartas que María tiene por $z$
    

    1. The system will start/stop the GF-java service for you, but if you run into trouble you can check the state of the service by: bin/wpserver status and stop it by: bin/wpserver stop

    4. Reasoning aspects of word problems

    Word problem schemata

    The current prototype allows to state word problems of the following form:

       John|Mary has|owns  one|two|...|seven|some fruit|apples|oranges|bananas|animals|rabbits|cows
    

    in the languages: English, Catalan, Swedish and Spanish.

    Amounts

    The building block for the reasoning is the amount: a relaxed version of a set in which one does not have access to the composing elements, but can know the number of elements in it.

    An amount is constructed by:

    • Giving the cardinal and the class of its elements (i. e. three oranges). Notice that the cardinal may be undefined (i. e. some oranges);

    • The own predicate binding an individual and a class (i. e. the apples that John has);

    • Disjoint unions of these constructions (three apples and two oranges)

    Propositions

    Available sentences express the equality between two amounts (i. e. The fruit that John has are two apples and some oranges)

    The modeling process implies transforming a set of propositions into another set in which the numerical interpretation is evident.

    Setting the problem model

    We consider two grammars to express these facts:

    • The plain language is for direct communication with the user;

    • The core language is for the reasoner to work with.

    This is how we express the amount John apples in plain (Prolog concrete):

    own(john, apple)
    

    while in core:

    p(X, apple, own(john,X))
    

    The latter is more suited to reasoning with it.

    Another step into normalizing (making it core) an amount is to disaggregate sums. In this way a statement like John has three apples and six bananas is converted into: John has three apples and John has six bananas.

    Another case is to convert questions as how many apples does Mary have? which are represented in plain as:

    find(own(mary,apple))
    

    into the core expression:

    find(X, apple, own(mary,X))
    

    A set of statements in core language is what is needed to process a word problem. This is what the create tool saves: A Prolog file consisting of:

    • A GF abstract tree for the plain sentence of a problem. This is written as a Prolog comment.

    • Core statements in Prolog format that correspond to the plain expression.

    As an example, this is a complete problem in core Prolog clauses. The comments contain the GF abstract tree corresponding to the original plain expression:

    % abs:fromProp (E1owns john (gen Fruit n7))
    % Eng:John has seven fruit .
    -(p(_1, fruit, own(john, _1)), *(7, unit(fruit))).
    % abs:fromProp (E1owns john (aplus (ConsAmount (gen Apple n2) (BaseAmount (some Orange) (gen Banana n3)))))
    % Eng:John has two apples , some oranges and three bananas .
    -(p(_5, apple, own(john, _5)), *(2, unit(apple))).
    -(p(_6, banana, own(john, _6)), *(3, unit(banana))).
    -(p(_7, orange, own(john, _7)), some(orange)).
    % abs:fromQuestion (Q1owns john Orange)
    % Eng:how many oranges does John have ?
    find(_19, orange, own(john, _19)).
    

    Workflow for modeling a problem

    When the model tool is started on a word problem file, the system uses the GF abstract lines to display the statement of the problem in the selected language. Now the student must go through a sequence of steps to have the problem correctly modeled:

    1. Assigning variables. At the beginnig the student must choose variables to designate unknowns that are relevant to the problem. This includes the target unknowns (they appear as arguments of find clauses) and expressions like some apples.

    2. Discovering relations. In this step the student has to combine information from different statements into new relations. For example, decomposing the fruits that John has into the apples and bananas that John has.

    3. Stating equations. In the next step, the student converts the relations uncovered in the previous step into numerical equations. This steps finishes when there are enough equations to determine the unknowns of the problem. The system checks that the student's equations are consistent equations and are entailed by the problem information.

    4. Final. At the last step, the system displays the solution for the unknowns of the problem and exits.

    5. Current limitations and future work

    The current prototype is a proof of concept aiming at demonstrating that the semantics of word problems can be handled given a formalization of the specific domain, a decision procedure on the resulting model, and a natural language application that allows to express and semantically interpret the facts describing the specific world instance.

    In this work we have considered problems of a specific kind but we maintain that, in the e-Learning scenario, which is our target area of application, word problems can be classified according to schemes which can be formalized along the lines shown here. In many problems, understanding of natural language formulation is translated to facts in the knowledge base where two seemingly independent facts are put in a relation and become a new assumption for solving the problem (if A is an animal tamer, then A is not afraid of animals. if F is the father of S, then F is older than S. Every orange is a fruit). Construction of the correct assumption can be done in an exhaustive way only under a finite world assumption (what is known is what it is explicitly stated).

    Future work

    Replacing the Prolog engine by a proof assistant. These systems delivers proofs of propositions in a theory. By using a theory supporting a kind of word problems and forcing the problem author to express the problem as a a valid theorem in this theory would have the benefit of uncover hidden assumptions on the problem statement.

    Also, these systems being more expressive than Prolog clauses, and supporting complex tactics for automatic proving could benefit the maintenance of the system.

    On the student side, the discovering of new facts is converting into asserting propositions that can be transparently proved by tactics or presented to the student to deal with them: This would lead to re-using the same problem in different educations levels according to what is assumed and what is proved by the student.

    D8.2 Multilingual grammar for museum object descriptions


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D8.2 Multilingual grammar for museum object descriptions
    Security (distribution level): Public
    Contractual date of delivery: 1 Mar 2012
    Actual date of delivery: 16 Mar 2012
    Type: Prototype
    Status & version: Draft
    Author(s): D. Dannélls et al.
    Task responsible: UGOT
    Other contributors: All


    AttachmentSize
    WP8-D2.pdf241.34 KB
    d8.2-grammars.tar.gz9.21 KB

    D9.1 MOLTO test criteria, methods and schedule


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D9.1. MOLTO test criteria, methods and schedule
    Security (distribution level): Confidential
    Contractual date of delivery: M7
    Actual date of delivery: October 2010
    Type: Report
    Status & version: Draft (evolving document)
    Author(s): L. Carlson et al.
    Task responsible: UHEL
    Other contributors:



    Abstract

    The present paper is the summary of deliverable D 9.1 as of M6. Workpackage materials can be found at the UHEL MOLTO website (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/WebHome). This document also links to the MOLTO official website (http://www.molto-project.eu/).

    (The official MOLTO website is the prime place for coordinating the project as (long as) material on it is uncluttered, reliable and up to date. For local work, informal project communication and creative planning, the UHEL MOLTO website is open to all MOLTO partners.)

    This paper is structured into an introduction followed by sections per workpackage, The WPs are divided into the front end WPs (WP3 and use cases) and the back end ones (WPs 2,4,5). For each WP we survey promises from DoW, ongoing work, and derive requirements from them, followed by evaluation plans or recommendations. Text in brackets refer to source. Action points are in boldface.

    The wealth of cited content aims to bring different strains of documented work planned or in progress together, in order to get an updated view of the ongoing MOLTO process, and thus cover the bases for making the tool and user WP requirements meet. We take as base what the technology offers and scale user expectations from that.

    </br/></p/>

    1. Introduction

    We go over the later WP9 tasks first:

    • performing evaluation: individual partners as contributors to WP9 perform WP-wise evaluation subject to criteria collected in this document. The WP10 leader collects the contributions to be included for the periodic reports as required. ???
    • bug fixing and consolidation: The WP leader and/or the coordinator will maintain bug tracking tools and distribute bug reports or feature requests to the appropriate partner(s).

    D 9.1 is to define the requirements for both the generic tools and the case studies in a coherent way that can lead to maximal synergy between work packages. To do this need to detail the project plan and schedule. This then implies the main outline of the evaluation schedule.

    2. Schedule

    The MOLTO dependency chart only shows dependencies for WP 9 with the use cases WP 6-8 plus the dissemination WP 10. The boldfaced bits above entail that there are dependencies to the tools workpackages as well.

    By the MOLTO timetable, WPs 2,4,9 (tools, ontology, req/eval) started at once. Translation tools WP3 and use case WPs 5-6 start at m7 (Varna). Patents use case WP7 has not started due to failure of partner.

    By the DoW, MOLTO aims to have working prototypes on the way. So far, each partner has been providing their own demos. Progressively, there will be more need for integration, WP3 in particular will use most of the rest as components. In the best case, integration can be just plugging in APIs, with local bilateral negotiation at best between a provider and a user. But to ensure this, we must agree in time what the APIs will provide.

    As suggested in the DoW text (but not spelled out in the schedule), specification/version checkpoints should be agreed more often between the tools WPs. At Varna, we get the first update of the tools and ontology workpackages. We should get together to fix times and expected contents for the remaining internal checkpoints as well. It would help to add checkpoint dates plus time dependencies to the above schedule (turn it into a Gantt chart proper --- the “Gantt chart” in the DoW is more like a PERT chart.) It also helps to be clear just what capabilities each release is planned to offer. Proposals what to insert into the project schedule are made along the way below.

    Checkpoints can be constructed from the deliverables list and the milestones table.

    The deliverables list implies these checkpoints with implications to the evaluation timetable:

    • M03 Molto web service, first version – we got the phrasebook running from MOLTO website.
    • M12 GF grammar compiler API – what does this add?,
    • M18 Grammar IDE, Ontology-GF interface, Translation tools API – a web translation platform that allows on the fly extension of lexicon and grammars with ontology tools. Evaluation of lexicon and grammar extension can start.
    • M24 Translation tool prototype running – means translator tool evaluation can start with test users (on the museum and if relevant the math case).
    • M30 GF tools integrated with SMT tools – evaluating the combined system with the patent use case can start. Manuals done: testing with new users possible

    Milestone MS3 may need updating relative to the deliverables list. No important deliverables are scheduled between M6 and M12 that would motivate a demonstrator there. A more appropriate place for the next version of translation tool (after the Phrasebook) is after M18 . M18 should make available ontology interoperability, and along with that, new lexical tools.

    Having fixed the schedule some, we go through the WP 9.1 tasks boldfaced from the WP9 statement of purpose.

    [From DoW] The work will start with collecting user requirements for the grammar development IDE (WP2), translation tools (WP3), and the use cases (WP6-8). We will define the evaluation criteria and schedule in synchrony with the WP plans (D9.1). We will define and collect corpora including diagnostic and evaluation sets, the former, to improve translation quality on the way, and the latter to evaluate final results.

    3. Collecting user requirements

    We have not been able to do much interviewing here because the patent user partner (WP7) is missing and the two others have not started their WPs yet. We have not got real end users in the use cases. In the mathematics case, the end users could be math teaching platform developers; in the patent case, patent office staff; in the museum case, museum workers. These are content professionals with more than average technical facility.

    The use cases were scheduled as follows.

    • WP6: case study Mathematics start month 7
    • WP7: case study patents start month 4 (not started due to loss of patent partner)
    • WP8: case study cultural heritage start month 13

    This problem was implicit in the original timetable which expected WP9 to work on the use cases before the use case WP's started working. This was noted in the kickoff meeting and agreed that this task would be rescheduled as necessary.

    Pending user input, we decided to derive requirements from MOLTO's promises and compare them to the tools resources. The promises made by MOLTO from DoW are summarised below.

    [DoW 5]

    The single most important S&T innovation of MOLTO will be a mature system for multilingual on-line translation, scalable to new languages and new application domains. The single most important tangible product of MOLTO is a software toolkit, available via the MOLTO website. The toolkit is a family of open-source software products:

    1. a grammar development tool, available as an IDE and an API, to enable the use as a plug-in to web browsers, translation tools, etc, for easy construction and improvement of translation systems and the integration of ontologies with grammars (WP2)
    2. a translator’s tool, available as an API and some interfaces in web browsers and translation tools (WP3)
    3. a grammar library for linguistic resources
    4. a grammar library for the domains of mathematics, patents, and cultural heritage

    4. Defining evaluation criteria

    A helpful list of quality dimensions relevant to MOLTO evaluation can be derived from the DoW list of links between the main objectives and the tasks in WP’s:

    1. adaptability of translation systems: WP2
    2. user friendliness and integration in workflows: WP3
    3. integration with semantic web technology: WP4
    4. usefulness on different domains: WP6,WP7,WP8
    5. scaling up towards more open text: WP5,WP7
    6. quality of translation: WP9
    7. wide user adaptation and exploitability: WP10

    Here are some measurable expected outcomes. Most of them are directly applicable as testable quantitative evaluation measures.  It is another thing how many test rounds we can do, given the need of fresh test subjects.

    Feature Current Projected Remarks
    Languages up to 7 up to 15 languages treated simultaneously
    Domain size 100’s of words 1000’s of words 4 domains with substantial applications (“substantial” not quantified here)
    Robustness none open-text capability translation quality: “complete” or “useful” on the TAUS scale (Translation Automation Users Society)
    Development per domain months days
    Development per language days hours
    Learning (grammarians) weeks days
    Learning (authors) days hours source authoring: the MOLTO tool for writing translatable controlled text can be learned in less than one hour, the speed of writing translatable controlled text is in the same order of magnitude as writing unlimited plain text

    The number 18 of grammar library languages is the minimum number of languages we expect to be available at the end of MOLTO. The number 3 to 15 is the number of languages actually implemented in MOLTO’s domain grammars (3 in WP7, 15 in WP6 and WP8).

    The measurements of all these features are performed within WP9 in connection to the project milestones. The advisory group will confirm the adequacy and accuracy of the measurements.

    The objects of evaluation – even the translated texts – vary considerably per WP. We detail some criteria per WP below. Evaluation criteria and methods have been collected on the UHEL MOLTO website (esp. https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook).

    5. Defining evaluation corpora and tools

    Defining evaluation corpora and tools

    Not much could be done here (yet). We have not got patent corpora. The mathematicians have yet to collect their word problems. We got a small museum text corpus (approx. 25000 words in Swedish, a set of 9 short passages translated into English presumably by non-native speakers) from Gothenburg.

     

    We have translated parts of this corpus both manually and using MT for test material in BLEU evaluation. A pilot comparing BLEU scores on this material to a manual error analysis is on the way.

     

    A small test GF grammar for a sample of the corpus has been written (link). It has helped making more concrete the requirements on grammar-ontology interoperability (below).

     

    We have also fetched the usual EU multilingual corpora on our test platform (hippu.csc.fi).

     

    We have found time to install an evaluation platform, collect and test standard issue translation quality evaluation tools, to develop forthcoming MOLTO lexicon tools, to learn GF and develop ideas about the ontology to grammar interface. The IQmt evaluation platform was tested on a small sample of machine and human translated text (English into Finnish) (see https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook).

     

    UHEL also took part in the MOLTO phrasebook task, a demo for translating touristic phrases between 14 European languages: Bulgarian, Catalan, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Romanian, Spanish, Swedish. This experiment presents one way evaluate the effort required for adding new language versions (more on this below).

     

    We divide the rest of the paper by WPs into the front end: translation tool, the use cases  and associated lingware (ontologies and grammars), and the back end: the translation system (WPs 2,4,5), presented in this order. We also try to form an idea about what WPs are currently about to see how they are construing their tasks. Information about this (at least task titles) was found on MOLTO website.

    6. Front end

    WP3: Translator's tools

    MOLTO translation scenario and user roles

    The MOLTO workflow is a break to tradition in the professional translation business as well as the consumer end in that it merges the roles of content author and translator. In professional translation, a document is authored at source and the translator's work on the source is read-only. At the consumer end, MT is largely used for gisting from unknown languages to familiar ones.

    The main impact is expected to be on how the possibilities of translation are viewed in general. The field is currently dominated by open-domain browsing-quality tools (Google translate and Systran), and domain-specific high-quality translation is considered expensive and cumbersome.

     

    MOLTO will change this view by making it radically easier to provide high-quality translation on its scope of application—that is, where the content has enough semantic structure—and it will also widen this scope to new domains. Socioeconomically, this will make web content more available in different languages, including interactive web pages.

     

    At the end of MOLTO, the technology will be illustrated in case studies that involve up to 15 languages with a vocabulary of up to 2,000 special terms (in addition to basic vocabulary provided by the resource grammar).

     

    The generic tools developed MOLTO will moreover make it possible for third parties to create such translation systems with very little effort. Creating a translation system for a new language covering an unlimited set of documents in a domain will be as smooth (in terms of skill and effort) as creating an individual translation of one document.

     

    (The last sentence sounds like a tall order. But probably  it just points out that once MOLTO has been primed for one text it can translate any number of (sufficiently) similar ones.)

     

     The MOLTO change of roles will also entail a change of scenarios.

     

    Translator's new role (parallel to WP3: Translator's tools) will be designed and described in the D9.1 deliverable. Most current translator's workbench software treat the original text as read-only source. The tools to be developed within WP3 (+ 2) will lead towards more mutable role of source text. The translation process will resemble more like structured document editing or multilingual authoring than transformation from a fixed source to a number of target languages.

     

    Since the MOLTO scenario implies major differences to the received translation workflow and current roles and requirements from translation client, translator, revisor etc. MOLTO is not likely to impact   translation business at large in the near future. Instead, it has its chances in entering and creating new workflows, in particular, in multilingual web publishing.  Multilingual websites are currently developed by means of  crowdsourcing translation with tools borrowed from the software localization business. (links). MOLTO could complement or replace this workflow with its new role cast of a content producer or technical editor that generates multilingual content from a single language source.  Applications may include multilingual Wikipedia articles, e-commerce sites, medical treatment recommendations, tourist phrasebooks,  social media , SMS.

     

    The introductory scenario of this proposal, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). Any change made in any of the languages gets automatically translated to the other languages.

     

    As for CAT in general, the advantages of MOLTO can be particularly clear in versioning of already existing sites.

     

    We next review user requirements by type of user and the expected expertise of each. Consider the role cast around MOLTO. The role cast in MOLTO can have at least these:

     

    •                     Author

    •                     Editor

    •                     Translator

    •                     Checker

    •                     Ontologist

    •                     Terminologist

    •                     Grammarian

    •                     Engineer

    So far, all of these roles are merged. Different use scenarios may separate some and merge others. Peculiar to MOLTO is the merge of the author/editor/translator roles. In the MOLTO scenario, the editor-translators cannot be expected to know (all) the target language(s). The target checker(s) and terminologist(s)-grammarian(s) are likely to be different from them, possibly a widely distributed crowd.

    The translator's tool serves primarily for author/editor/translator/checker roles. It links to TF which serves ontologist/terminologist roles (and connects them to the former). Presumably, the Grammar IDE   supports the last four roles on the above list. 

    The author is likely to be some sort of an expert on the subject matter, but not necessarily an expert on ontology work. The editor, if separate from the author, could be less of a subject expert  but possibly more of an ontologist. How much of a difference there need be between these roles depends on the cleverness of the MOLTO tools.

    Say an author types away and MOLTO counters with questions caused by the underlying ontology (of type do you mean this or that?) Unless the author agrees with the ontology, he may be hard put to answer, while an editor/ontologist (familiar with the ontology and/or the way MOLTO works) may know how to proceed – to choose the right thing or to realize the right alternative is missing and how to fix it.

    Analogous comments can be made of the relations between author, translator, checker and terminologist. It is all very well for the author to immediately see translations in umpteen languages he does not know. He has no way of knowing whether they are correct (unless MOLTO provides some way for him to check – say back translation with paraphrase?). Also, concrete grammars may ask awkward questions (of the type do you mean male or female, familiar or polite?). To get things right, the author would need to know whether one should be familiar or polite in language N. Here, he needs (to be) a translator or native checker.  Considerations like this need to be taken into account in WP3 requirements analysis.

     

    WP3 requirements

    The following lengthy quote from DoW recaps the main ingredients of the translation tools made available to WP3 by WP2.

     

    [9    Translator’s tools in DoW]

     

    For the translator’s tools, there are three different use cases:

    •              restricted source

    •              production of source in the first place

    •              modifying source produced earlier

    •              unrestricted source

     

     

    Working with restricted source language recognizable by a GF grammar is straightforward for the translating tool to cope with, except when there is ambiguity in the text. The real challenge is to help the author to keep inside the restricted language. This help is provided by predictive parsing, a technique recently developed for GF (Angelov 2009). Incremental parsing yields word predictions, which guide the author in a way similar to the T9 method1 in mobile phones. The difference from T9 is, however, that GF’s work prediction is sensitive to the grammatical context. Thus it does not suggest all existing words, but only those words that are grammatically correct in the context.

     

    Predictive parsing is a good way to help users produce translatable content in the first place. When modifying the content later, e.g. in a wiki, it may not be optimal ... This is where another utility of the abstract syntax comes in: [syntax editing]. in the abstract syntax tree, all that is changed is the noun, and the regenerated concrete syntax string automatically obeys all the agreement rules. This functionality is implemented in the GF syntax editor (Khegai & al. 2003).

     

    The predictive parser of GF does not try to resolve ambiguities, but simply returns all alternatives in the parse chart. This is not always a problem, since it may be the case that the target language has exactly the same ambiguity and then it remains hidden in the translation. In practise this happens often in closely related languages. But if the ambiguity makes a difference in translation, it has to be resolved. There are two complementary approaches: using statistical models for ranking or using manual disambiguation. … For users less versed in abstract syntax, however, a better choice is to show the ambiguities as different translation results. Then the user just has to select the right alternatives. The choice is propagated back in the abstract syntax, which has the cumulative effect that a similar ambiguity in a third language gets fixed as well. This turns out to be very useful in a collaborative environment such as Wikipedia.

     

    Both predictive parsing and syntax editing are core functionalities of GF and work for all multilingual grammars. While the MOLTO project will exploit these functionalities with new grammars, it will also develop them into tools fitting better into users’ work flows. Thus the tools will not require the installation of specific GF software: they will work as plug-ins to ordinary tools such as web browsers, text editors, and professional translators’ tools such as SDL and WordFast.

     

    The snapshot in Figure 2 is from an actual web-based translation prototype using GF. It shows a slot in an HTML page, built by using JavaScript via the Google Web Toolkit (Bringert & al. 2009). The translation is performed in a server, which is called via HTTP. Also client-side translators, with similar user interfaces, can be built by converting the whole GF grammar to JavaScript (Meza Moreno and Bringert 2008).

     

    To deal with unrestricted legacy input, such as in the patent case study, predictive parsing and syntax editing are not enough. The translator will then be given two alternatives: to extend the grammars, or to use statistical translation.

     

    For grammar extension, some functionalities of the grammar writer’s tools are made available to the translator—in particular, lexicon extension (to cope with unknown words) and example-based grammar writing (to cope with unknown syntactic structures). In statistical translation, the worst-case solution is to fall-back to phrase-based statistical translation. In MOLTO, we will study the ways to specialize this to translation in limited domains, so that the quality is higher than in general-purpose phrase-based translation. We will also study other methods to help translators with unexpected input.

     

     

    WP3 has its main deliverables  at months 18, 24 and 30.

    WP3 Deliverables

     

    Del. no

    Del. title

    Nature

    Date

    D 3.1

    MOLTO translation tools API

    P

    M18

    D 3.2

    MOLTO translation tools prototype

    P

    M24

    D 3.3

    MOLTO translation tools / workflow manual

    RP, Main

    M30

     

    [WP3 in DoW]

     

    The standard working method in current translation tools is to work on the source and translation as a bilingual text. Translation suggestions are sought from TM (Translation Memory) based on similarity, or generated by a MT system, are presented for the user to choose from and edit manually. The MOLTO translator tool extends this with two additional constrained-language authoring modes, a robust statistical machine translation (UPC) mode, plus vocabulary and grammar extension tools (UGOT), including: (i) mode for authoring source text while context-sensitive word completion is used to help in creating translatable content; (ii) mode for editing source text using a syntax editor, where structural changes to the document can be performed by manipulating abstract syntax trees; (iii) back-up by robust and statistical translation for out-of-grammar input, as developed in WP5; (iv) support of on-the-fly extension by the translator using multilingual ontology-based lexicon builder; and (v) example-based grammar writing based on the results of WP2.

    The WP will build an API (D3.1, UHEL) and a Web-based translator tool (D3.2, by Ontotext and UGOT). The design will allow the usage of the API as a plug-in (UHEL) to professional translation memory tools such as SDL and WordFast. We will apply UHEL’s ContentFactory for distributed repository system and a collaborative workflow for multilingual terminology.

    This is what we say about the eventual translation platform in DoW (section numbering 1.2.5 seems a random error):

     

     

    1.2.5  Multilingual services

     

    MOLTO will provide a unique platform for multilingual document management, satisfying the five desired features listed in Section 1.1. [?] It will enable truly collaborative creation and maintenance of content, where input provided in any language of the system is immediately ported to the other languages, and versions in different languages are thereby kept in synchrony. This idea has had previous applications in GF (Dymetman & al. 2000, Khegai & al. 2003, Meza Moreno and Bringert 2008). In MOLTO, it will be developed into a technology that can be readily applied by non-experts in GF to any domain that allows for an ontology-based interlingua.

     

    The methodology will be tested on three substantial domains of application: mathematics teaching material, patents, and museum object descriptions. These case studies are varied enough to show the generalisability of the MOLTO technology, and also extensive enough to produce useful prototypes for end users of translations: mathematics students, intellectual property researchers, and visitors to museums. End users will have access in their own languages to information that may be originally produced in other languages.

     

    This does not actually say that all three use cases use one and the same platform (unless 'unique' means just one). It is not even sure they want the same features. The mathematicians are likely to need some math editing tool and perhaps access to a computational algebra solver. Patent translators may need access to patent corpora and databases. Museum people may need to work with images. Future MOLTO users may have their own favourite platforms with such facilities in place.

     

    Rather, the WP3 translation tools deliverable should be a set of plugins usable in many different platforms, in turn variously using the common GF back-end plugins listed above.

     

    Still, we need a flagship demonstrator for the project. The flagship demonstrator should be a generic web editing platform. Minimally, it can be an extension of the existing GF web translation demo. In the best case, it could be installed as a set of plugins to some existing web platform like Mediawiki, Drupal and/or some open source CAT tool(s). 

     

    The demonstrator should be able to have at least the following plugins:

     

    •                     GF translation editor (including autocompletion and syntax editing)

    •                     GF grammar IDE

    •                     TF ontology/lexicon manager

    •                     Ontotext ontology tools (if separate from above)

    •                     SMT translator (if separate from above)

    •                     TM (translation memory)

     

    The TM on the list is a stand-in for tools to support non-constrained editing. (It appears that some use cases will need to mix GF translation with manual (CAT or SMT supported) translation.

     

    All or parts of some existing web translation/localization platform(s) could be taken as starting point. Or conversely, some existing CAT tool components could be plugged into ours. (The latter plan may now seem more promising.)

     

    Translator’s tools promised by WP2 include

    •                       text input + prediction (= autocompletion from grammar)

    •                       syntax editor for modification

    •                       disambiguation

    •                       on the fly extension

     

    The MOLTO worfklow and role play must be spelled out in the grammar tool manual (D 2.3) and the MOLTO translation tools / workflow manual (D 3.3).  We should start writing these manuals now, to fix and share our ideas about the user interfaces. 

     

    WP3 evaluation

    The main claims to fame in MOLTO are to produce high automatic translation quality, particularly in view of faithfulness, into multiple languages from one pre-editable source, and as a way to that, practically (= economically) feasible multilingual online translation editing with a minimum of training:

     

    [DoW]

    The expertise needed for using the translation system will be minimal, due to the guidance provided by MOLTO.

     

    Feature

    Current

    Projected

     

    Learning (authors)

    days

    hours

     

     

    These claims should then be among the items to evaluate.

     

    Quantified evaluation of translation tool features make sense starting with the translation tool prototype developed in WP3 (M24). The tests can be developed and calibrated on the initial demonstrator at M18.

    We distinguish below between evaluating the translation result and evaluating the translation process.

    3a. Evaluating the translation result

    We argue below that there is little sense for WP9 to quantitatively measure MOLTO translation quality with standard MT eval tools except at the end of MOLTO (D 9.2). On the way there, WPs (in particular the GF grammar and  SMT WPs) should institute their own progress evaluation schedules. They may then outsource translation quality evaluations to WP9 when appropriate. What we want to avoid is an externally imposed evaluation drill during WP work which can produce skewed results and cause useless delays on the way.

     

    We have created a UHEL MOLTO TWiki website to coordinate our workpackages internally (link). The website is open for other MOLTO partners as well.

    We have installed standard SMT evaluation tools (hippo.csc.fi). A Pilot study on measuring translation fidelity have been conducted in PhD project associated to MOLTO (Maarit Koponen).

     

    This is what MOLTO promised in the DoW about translation quality assessment:

     

    To measure the quality of MOLTO translations, we compare them to

     

    (i) statistical and symbolic machine translation (Google, SYSTRAN); and

    (ii) human professional translation.

     

    We will use both

     

    automatic metrics (IQmt and BLEU; see section 1.2.8 for details (???)) and

     

    TAUS quality criteria (Translation Automation Users Society1)

     

    As MOLTO is focused on information-faithful grammatically correct translation in special domains, TAUS results will probably be more important.

     

    Given MOLTO’s symbolic, grammar-based interlingual approach, scalability, portability and usability are important quality criteria.

     

    These criteria are quantified in (D9.1) and reported in the final evaluation (D9.2).

     

    In addition to the WP deliverables, there will be continuous evaluation and monitoring with internal status reports according to the schedule defined in D9.1.

     

    The criteria (scalability, portability, and usability) mean that MOLTO should have wider coverage, be easier to extend and need less expertise than similar (symbolic, grammar-based, interlingual) solutions heretofore.

     

    [12   Translation quality]

     

    We will compare the results of MOLTO to other translation tools, by using both automatic metrics (BLEU, Bilingual Evaluation Understudy, Papineni & al. 2002) and, in particular, the human evaluation of “utility”, as defined by TAUS. The comparison is performed with the freely available general-purpose tools Google translate and Systran. While the comparison is “unfair” in the sense that MOLTO is working with special-purpose domain grammars, we want to perform measurements that confirm that MOLTO’s quality really is essentially better. Comparisons with domain-specific systems will be performed as well, if any such systems can be found. Domain-specific translation systems are still rare and/or not publicly available.

     

    Regarding automatic metrics for MT, the usage of lexical n-gram based metrics (WER, PER, BLEU, NIST, ROUGE, etc.) represents the usual practice in the last decade. However, recent studies showing some limitations of lexical metrics at capturing certain kind of linguistic improvements and making appropriate rankings of heterogeneous MT systems Callison-Burch et al. (2006); Callison-Burch et al. (2007); Callison-Burch et al. (2008); Giménez (2008) have fostered research on more sophisticated metrics, which can combine several aspects of syntactic and semantic information. The IQmt suite1, developed by the UPC team, is one of the examples in this direction Giménez and Amigó (2006); Giménez and Màrquez (2008). In IQmt, a number of automatic metrics for MT, which exploit linguistic information from morphology to semantics, are available for the English language and will be extended to other languages (e.g., Spanish) soon. These metrics are able to capture more subtle improvements in translation and show high correlation with human assessments Giménez and Màrquez (2008); Callison-Burch et al. (2008). We plan to use IQmt in the development cycle whenever it is possible. For languages not covered in IQmt, we will rely on BLEU (Papineni et al. 2002).

     

    Regarding human evaluation, the TAUS method is the more appropriate one for the MOLTO tasks, since we are aiming for reliable rendering of information. It consists of inspection of a significant number of source/target segments to determine the effectiveness of information transfer. The evaluator first reads the target sentence, then reads the source to determine whether additional information was added or misunderstandings identified.

     

    The scoring method is as follows:

     

    4. Complete: All of the information in the source was available from the target; reading the source did not add to information or understanding.

     

    3. Useful: The information in the target was correct and clear, but reading the source added some additional information or understanding.

     

    2. Marginal: The information in the target was correct, but reading the source provided significant additions or clarifications.

     

    1. Poor: The information in the target was unclear and/or incorrect; reading the source would be necessary for understanding.

     

    We aim to reach “complete” scores in mathematics and museum translation, and “useful” scores in patent translation.

     

    Dimensions not mentioned in the TAUS scoring are “grammaticality” and “naturalness” of the produced text. The grammar-based method of MOLTO will by definition guarantee grammaticality; failures in this will be fixed by fixing the grammars. Some naturalness will be achieved in the sense of “idiomaticity”: the compile-time transfer technique presented in Section 1.2.3 will guarantee that forms of expression which are idiomatic for the domain are followed. The higher levels of text fluency reachable by Natural Language Generation techniques such as aggregation and referring expression selection have been studied in some earlier GF projects, such as (Burke and Johannisson 2005). Some of these techniques will be applied in the mathematics and cultural heritage case studies, but the main focus is just on rendering information correctly. On all these measures, we expect to achieve significant improvements in comparison to the available translation tools, when dealing with in-grammar input.

     

    Applying BLEU and similar methods which compare MT output to human model translations promises to be laborious in the case of MOLTO because we have a large number of less-common target languages and lack use case related corpora. Though we have not full knowledge yet what corpora we shall have access to, they are not likely to  provide a wealth of (preferably many parallel) human model translations for comparison in the special domains we have:

     

    •                     We expect the mathematics WP to  involve  a small number (tens or hundreds) of short (one-paragaph) examples

    •                     The museum corpus (at least so far) is not much larger (25K words in all). The largest subset is Swedish only.

    •                     We do not know yet what to expect from the patent partner.

     

    The main difficulty for automatic comparison measures are ambiguities in natural languages: Usually, there is more than one correct translation for a source sentences; there are ambiguities in the choice of synonyms as well as in the order of the words. Allowance for free variation through synonymy and paraphrase (free translation in general) is made with more comparison text. For instance, the NIST evaluation campaign uses four parallel translations (to the same language) of texts in the order of 15-20K words.

     

    What is more to the point, BLEU results are not likely to prove MOLTO's strengths, because they are not sensitive to fidelity, being in this respect like the n-gram SMT methods they simplify.  Preliminary tests to this effect have been conducted by Maarit Koponen (links).

     

    BLEU and similar tests have been developed in the context of SMT and for the assimilation (gisting) scenario. Most of the weight in BLEU or WER like measures comes from matched words and shorter n-grams. These measures point in the right direction as long as translation quality is low (as long as long distance dependencies and fidelity do not matter).

     

    The distinction between fluency and fidelity in human evaluation measures is not made for automatic evaluation measures. Each such measure is considered to judge the overall quality of a candidate sentence or system, rather than the quality with respect to certain aspects.  Leusch (link) shows that some measures have preferences for certain aspects – the unigram PER correlates with adequacy to a higher degree than the bigram PER, whereas this is vice versa on the fluency, but the observation remains to be exploited.

     

    To evaluate fidelity as well as fluency, more grammar sensitive measures are needed. In smaller use cases, human evaluation is likely to be the cost effective solution (link). An innovative approach suggested by work in Koponen (to appear) would to develop the MOLTO evaluation methodology using MOLTO's own technology. The idea is to use simplified (MOLTO or other) parsing grammars to test fidelity  and domain ontologies to test fluency.

     

    Fidelity (preservation of grammatical relations) would be gauged by using simplified grammars to parse summaries of text and comparing MOLTO translations of summaries with summaries of translations. The assumption is (like it implicitly is in BLEU) that the translator is more reliable with shorter bits (and there are more of them).

     

    Acceptability of lexical variation in the target text would be checked (not against parallel human translations but) against multilingual domain ontologies (e.g., use vessel or boat instead of ship).

     

    Note the analogy here to BLEU's use of n-grams as a simplification of SMT methods to compare SMT to human targets. Work developing these ideas is in progress in a PhD project associated to MOLTO (Koponen to appear). The planned GF/SMT hybrid system is interesting here. It suggests  analogous ideas for hybridizing statistical and grammar based evaluation measures.

     

    At the evaluation phase towards the end of MOLTO, a comparison of (say) the patent case output to competing methods using generic tools like the SMT evaluation tools and TAUS criteria is worth doing, and has been promised in the DoW. On the way there, however, we prefer developing and applying MOLTO specific evaluation methods. 

     

    UHEL needs to synchronise evaluation plans with the SMT workpackage.

    3b. Evaluating the translation process 

    WP9 aims to set requirements and evaluate the MOLTO translation workflow from the beginning. We argue below that evaluating the translation workflow and translator productivity are particularly important in MOLTO. For related work in other projects, see (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/EvaluationCookbook) Our initial proposals follow below.

     

    The MOLTO pre-editing strategy lets an author or technical editor modify the text, the translator enrich the vocabulary, and the grammarians perfect the grammar until the translation result is acceptable. Therefore the success criterion for the MOLTO approach must be how much effort it takes to get a translation from initial state to a break-even point (as defined by the use case). A translation can always be made better with more work on the tool, but the crux is when the result pays the effort. The DoW  sets these quantitative expectations on source editing:

     

    1.                  source authoring: the MOLTO tool for writing translatable controlled text can be learned in less than one hour, the speed of writing translatable controlled text is in the same order of magnitude as writing unlimited plain text

     

    “Of the same order” mathematically means that writing with MOLTO is not ten times slower than writing without it. We should clock this.

     

    We pick up this discussion again under WP2 in connection with measuring the vocabulary and grammar extension effort.

     

    WP6 Case Study: Mathematics

    The description of this case study in Dow and the MOLTO website makes apparent that the  math use case demonstrator is not so much a translation editor as natural language front end to computer algebra.

     


    Leader:  jordi.saludes

    Timeline: July, 2010 - May, 2012

    Objectives

    The ultimate goal of this package is to have a multilingual dialog system able to help the math student in solving word problems.

    Description of work

    The UPC team, being a main actor in the past development of GF mathematical grammars and having ample experience in mathematics teaching, will be in charge of the tasks in this work package with help from UGot and UHEL on technical aspects of GF and translator’s tools, along with Ontotext on ontology representation and handling. We will start by compiling examples of word problems. In parallel, we will take the mathematical multilingual GF library which was developed in the framework of the WebALT project and organize the existing code into modules, remove redundancies and format them in a way acceptable for enhancement by way of the grammar developer’s and translator’s tools of work packages 2 and 3 (D6.1). The next step will be writing a GF grammar for commanding a generic computer algebra system (CAS) by natural language imperative sentences and integrating it into a component (D6.2) to transform the commands issued to the CAS (Maybe as a browser plugin). For the final deliverable (D6.3), we will use the outcome of work package 4 to add small ontologies describing the word problem: We will end with a multilingual system able to engage the student into a dialog about the progress being made in solving the problem. It will also help in performing the necessary computations.

     

    The impression is confirmed by an email From Jordi Saludes:


    "The simplest implementation will be a terminal-based question/answer system like ELIZA, but focused on solving word problems. It will start by giving the statement of the problem, then it will do computations for the student/user, list unknowns, list relations between unknowns, state the progress of the resolution and, maybe, give hints.

    We are thinking about the kind of word problems which require solving a system of (typically two) linear equations. In Spain these are addressed to first or second year high school students."

    On the way to the demonstrator, the plan is to devise small ontologies describing math word problems and verbalise them using the MOLTO platform and WebAlt project math GF grammars. These phases of the work can be evaluated on the lines indicated under WP2-3. Since the corpus is small, manual quality evaluation using TAUS criteria is appropriate. We need to buy TAUS criteria if we are not getting them from the patent partner.

     

     

    Tasks



    Deliverables


     

    ID

     

    Due date

    Dissemination level

    Nature

    Publication

    D6.1

    Simple drill grammar library

    1 June, 2011

    Public

    Prototype

     


     

     

    WP7 Case study: Patents

    The description of this use case is on hold pending a new partner. There is another EU project about translating patents. One way to assess MOLTO could be to compare our results to them.

    PLuTO will develop a rapid solution for patent search and translation by integrating a number of existing components and adapting them to the relevant domains and languages. CNGL bring to the target platform a state-of-the-art translation engine, MaTrEx, which exploits hybrid statistical, example-based and hierarchical techniques and has demonstrated high quality translation performance in a number of recent evaluation campaigns. ESTeam contributes a comprehensive translation software environment to the project, including server-based, multi-layered, multi-domain translation memory technology. Information retrieval expertise is provided by the IRF which also provides access to its data on patent search use-cases and a large scale, multilingual patent repository. PLuTO will also exploit the use-case holistic machine translation expertise of Cross Language, who have significant experience in the evaluation of machine translation, while WON will be directly involved in all phases of development, providing valuable user feedback. The consortium also intends to collaborate closely with the European Patent Office in order to profit from their experience in this area.

    WP8 Case study: Cultural heritage

    WP No 8  Leader UGOT  Start M13  End M30

    WP Title Case Study: Cultural Heritage

    Objectives

    The objective is to build an ontology-based multilingual grammar for museum information starting from a CRM ontology for artefacts at Gothenburg City Museum[1], using tools from WP4 and WP2. The grammar will enable descriptions of museum objects and answering to queries over them, covering 15 languages for baseline functionality and 5 languages with a more complete coverage. We will moreover build a prototype of a cross-language retrieval and representation system to be tested with objects in the museum, and automatically generate Wikipedia articles for museum artefacts in the 5 languages with extensive coverage.

    Description of work

    The work is started by a study of the existing categorizations and metadata schemas adopted by the museum, as well as a corpus of texts in the current documentation which describe these objects (D8.1, UGOT and Ontotext). We will transform the CRM model into an ontology aligning it with the upper-level one in the base knowledge set (WP4) and modeling the museum object metadata as a domain specific knowledge base. Through the interoperability engine from WP4 and the IDE from WP2, we will semi-automatically create the translation grammar and further extend it (D8.2, UGOT, UHEL, UPC, Ontotext). The final result will be an online system enabling museum (virtual) visitors to use their language of preference to search for artefacts through semantic (structured) and natural language queries and examine information about them. We will also automatically generate a set of articles in the Wikipedia format describing museum artefacts in the 5 languages with extensive grammar coverage (D8.3, UGOT, Ontotext).

    Deliverables

     

    Del. no

    Del. title

    Nature

    Date

    D 8.1

    Ontology and corpus study of the cultural heritage domain

    O

    M18

    D 8.2

    Multilingual grammar for museum object descriptions

    P

    M24

    D 8.3

    Translation and retrieval system for museum object descriptions

    P,Main

    M30

     

     

    CIDOC Conceptual Reference Model (CRM), a high-level ontology to enable information integration for cultural heritage data and their correlation with library and archive information. The CIDOC CRM is now in the process to become an ISO standard.

    The CIDOC CRM analyses the common conceptualizations behind data and metadata structures to support data transformation, mediation and merging. It is property-centric, in contrast to terminological systems. It is now in a very stable form, and contains 80 classes and 130 properties, both arranged in multiple isA hierarchies.

    Semantic Computing Research Group (SeCo, Eero Hyvönen) has an Ontology for museum domain (MAO). MAO is an ontology for the museum domain, used for describing content such as museum items. MAO is ontologically mapped to the Finnish General Upper Ontology YSO and has been created as part of the FinnONTO-project. The most important application of MAO is The Semantic Portal for Finnish Culture Kulttuurisampo. Seco is specialised in indexing websites with ontologies. They are currently translating their ontologies into Finnish and Swedish.

    To be completed...

    7. Back end

    WP2: Grammar developer's tools

    WP2 requirements

    The deliverables promised from WP2:

     


    ID

     

    Due date

    Dissemination level

    Nature

    Publication

    D2.1

    GF Grammar Compiler API

    1 March, 2011

    Public

    Prototype

     

    D2.2

    Grammar IDE

    1 September, 2011

    Public

    Prototype

     

    D2.3

    Grammar tool manual and best practices

    1 March, 2012

    Public

    Regular Publication

     


     

    [this comes from the MOLTO website:]

    Objectives

    The objective is to develop a tool for building domain-specific grammar-based multilingual translators. This tool will be accessible to users who have expertise in the domain of translation but only limited knowledge of the GF formalism or linguistics. The tool will integrate ontologies with GF grammars to help in building an abstract syntax. For the concrete syntax, the tool will enable simultaneous work on an unlimited number of languages and the addition of new languages to a system. It will also provide linguistic resources for at least 15 languages, among which at least 12 are official languages of the EU.

    Description of work

    The top-level user tool is an IDE (Integrated Development Environment) for the GF grammar compiler. This IDE provides a test bench and a project management system. It is built on top of three more general techniques: the GF Grammar Compiler API (Application Programmer’s Interface), the GF-Ontology mapping (from WP4), and the GF Resource Grammar Library. The API is a set of functions used for compiling grammars from scratch and also for extending grammars on the fly. The Library is a set of wide-coverage grammars, which is maintained by an open source project outside MOLTO but will be via MOLTO efforts made accessible for programmers on lower levels of linguistic expertise. Thus we rely on the available GF resource grammar library and its documentation, available through digitalgrammars.com/gf/lib. The API is also used in WP3, as a tool for limited grammar extension, mostly with lexical information but also for example-based grammar writing. UGOT designs APIs and the IDE, coordinates work on grammars of individual languages, and compiles the documentation. UHEL contributes to terminology management and work on individual languages. UPC contributes to work on individual languages. Ontotext works on the Ontology-Grammar interface and contributes to the ontology-related part of the IDE.

    Here we try to make a bit clearer what the functionalities of the WP2 tools are, and how they relate to the translator's tool.

    We surmise that the grammar compiler's IDE is meant primarily for grammarian/engineer roles, i.e. for extending the system to new domains and languages.  But it may contain facilities or components which are also relevant for the translation tool. In many scenarios, we must allow the translator to extend the system, i.e. switch to some of the last four roles. Just how the translation tool is linked to the grammar IDE needs specifying.

    What the average user can do to fix the translation depends on how user friendly we can get. Minimally, a translator only supplies a missing translation on the fly, and all necessary adaptation is handled by the system. Maximally, an ontology or grammar needs extending as a separate chore by hand, using the grammar IDE. 

    An author/editor/translator can be expected to translate with the given lingware. The next level of involvement is extending the translation. This may cause entries or rules to be added to a text, company, or domain specific ontology/lexicon/grammar. If the tool is used in an organization, roles may be distributed to different people and questions of division of labor and quality control (as addressed in TF) already arise.

    For it is not only, even in the first place, a question of being able to change the grammar technically, but managing the changes. A change in the source may cause improvement in some languages, deterioration in others. The author can't possibly check the repercussions in all languages. Assume each user site makes its own local changes. How many different versions of MOLTO lingware will there be? One for each website maintained with MOLTO?  – how can sites share problems and solutions? A picture of a MOLTO community not unlike the one envisaged for multilingual ontology management  TF starts to form. The challenge  is analogous to ontology evolution. There are hundreds of small university ontologies in Swoogle. Quality can be created in the crowd, but there must be an organisation for it (cf. Wikipedia).

    The MOLTO worfklow and role play must be spelled out in the grammar tool manual (D 2.3) and the MOLTO translation tools / workflow manual (D 3.3).  We should start writing these manuals now, to fix and share our ideas about the user interfaces. 

     

    The way disambiguation now works is that translation of  a vague source against a finer grained target generates the alternative translations with disambiguating metatext to help choose the intended meaning. (try I love you in http://www.grammaticalframework.org/demos/phrasebook/. Compare to Boitet et al.'s 1993 dialogue based MT system Lidia e.g.  http://www.springerlink.com/content/kn8029t181090028/)

     

    This facility could link to the ontology as a source of disambiguating metatext, either from meta comments or directly verbalised from ontology).

     

    Some of the GF 3.2 features, like parse ranking and example based grammar generation, have consequences to front end design, as enabling technology.


    WP2 evaluation

     

    [11   Productivity and usability]

     

    Our case studies should show that it is possible to build a completely functional high-quality translation system for a new application in a matter of months—for small domains in just days.

     

    The effort to create a system dynamically applicable to an unlimited number of documents will be essentially the same as the effort it currently takes to manually translate a set of static documents.

     

    The expertise needed for producing a translation system will be low, essentially amounting to the skills of an average programmer who has practical knowledge of the targeted language and of the idiomatic vocabulary and syntax of the domain of translation.

     

    1.      localization of systems: the MOLTO tool for adding a language to a system can be learned in less than one day, and the speed of its use is in the same order of magnitude as translating an example text where all the domain concepts occur

     

    The role requirements for extending the system remain quite high, not because of the requirements on the individual skills, but because it is less common to find their combination in one person.

     

    The user requirements entail an important evaluation criterion: the guidance provided by MOLTO. It should also lead to system requirements, like online help, examples, profiling capabilities.

     

    One part of MOLTO adaptivity is meant to come from the grammar IDE. Another part should come from ontologies. While the former helps extending GF “internally”, the latter should allow bringing in semantics and vocabulary from OWL ontologies.  We discuss these two parts in this order.

     

    [8    Grammar engineering for new languages in DoW]

     

    In the MOLTO project, grammar engineering in GF will be further improved in two ways:

    •          An IDE (Integrated Development Environment), helping programmers to use the RGL and manage large projects.

    •          Example-Based Grammar Writing, making it possible to bootstrap a grammar from a set of example translations.

    The former tool is a standard component of any library-based software engineering methodology. The latter technique uses the large-coverage RGL for parsing translation examples, which leads to translation rule suggestions.

     

    The task of building a new language resource from scratch currently is described in http://grammaticalframework.org/doc/gf-lrec-2010.pdf. As this is largely a one-shot language engineering task outside of MOLTO (MOLTO was supposed to have its basic lingware done ahead of time), it should not call for evaluation here.

     

    Building a multilingual application  for a given abstract domain grammar by way of applying and extending concrete resource grammars can use a lighter process. The proposed example-based grammar writing process is described in the Phrasebook deliverable (http://www.molto-project.eu/node/1040). The tentative conclusions were:

     

    •         The grammarian need not be a native speaker of the language. For many languages, the grammarian need not even know the language, native informants are enough. However, evaluation by native speakers is necessary.

    •          Correct and idiomatic translations are possible.

    •          A typical development time was 2-3 person working days per language.

    •          Google translate helps in bootstrapping grammars, but must be checked. In particular, we found it unreliable for morphologically rich languages.

    •          Resource grammars should give some more support e.g. higher-level access to constructions like negative expressions and large-scale morphological lexica.

    Effort and Cost


     

    Based on this case study, we roughly estimated the effort used in constructing the necessary sources for each new language and compiled the following summarizing chart.

    Language

    Language skills

    GF skills

    Informed development

    Informed testing

    Impact of external tools

    RGL Changes

    Overall effort

    Bulgarian

    ###

    ###

    -

    -

    ?

    #

    ##

    Catalan

    ###

    ###

    -

    -

    ?

    #

    #

    Danish

    -

    ###

    +

    +

    ##

    #

    ##

    Dutch

    -

    ###

    +

    +

    ##

    #

    ##

    English

    ##

    ###

    -

    +

    -

    -

    #

    Finnish

    ###

    ###

    -

    -

    ?

    #

    ##

    French

    ##

    ###

    -

    +

    ?

    #

    #

    German

    #

    ###

    +

    +

    ##

    ##

    ###

    Italian

    ###

    #

    -

    -

    ?

    ##

    ##

    Norwegian

    #

    ###

    +

    -

    ##

    #

    ##

    Polish

    ###

    ###

    +

    +

    #

    #

    ##

    Romanian

    ###

    ###

    -

    -

    #

    ###

    ###

    Spanish

    ##

    #

    -

    -

    ?

    -

    ##

    Swedish

    ##

    ###

    -

    +

    ?

    -

    ##

     

    The phrasebook deliverable is one simple example what can be done to evaluate the grammar workpackage's promises. The results from the Phrasebook experiment may be positively biased because the test subjects were very well qualified. But this and similar tests can be repeated with more “ordinary people”, and changes in the figures followed as the grammar IDE is developing.

     

    It could be instructive to repeat the exact same test with different subjects and compare the solutions, to see how much creativity was involved in the solutions. The less there is variation the better the chances to automate the process. Even failing that, analysis of the variant solutions could help suggest guidelines and best practices to the manual. Possible variation here also raises the issue of managing changes in a community of users.

    WP4: Knowledge engineering

    Ontotext contributions to MOLTO through WP4 are

     

    •                                 Semantic infrastructure

    •                                 Ontology-grammar interoperability

     

    WP4  requirements
    Semantic infrastructure

     

    The semantic infrastructure in MOLTO will also act as a central multi-paradigm index for (i) conceptual models—upper-level and domain ontologies; (ii) knowledge bases; (iii) content and metadata as needed by the use cases (mathematical problems, patents, museum artefact descriptions); and provide NL-based and semantic (structured) retrieval on top of all modalities of the data modelled.

     

    In addition to the traditional triple model for describing individual facts,

     

     <subject, predicate, object> 

     

    the semantic infrastructure, will build on quintuple-based facts,

     

     <subject, predicate, object, named graph, triple set>

     

    The infrastructure will include: inference engine (TRREE7), semantic database (OWLIM8), semantic data integration framework (ORDI9) and a Multi-paradigm semantic retrieval engine, all of which are previous work, resulting from private (Ontotext) and public funding (TAO10. TripCom11). This approach will enable MOLTO’s baseline and use case driven knowledge modelling with the necessary expressivity of metadata-about-metadata descriptions for provenance of the diverse sources of structured knowledge (upper-level, domain specific and derived (from grammars) ontologies; thesauri; domain knowledge bases; content and its metadata)

     

    From Ontotext webpages, we can guess that the infrastructure builds on the following technologies:

     

    •                     KIM is a platform for semantic annotation, search, and analysis

    •                     OWLIM is the most scalable RDF database with OWL inference

    •                     PROTON is a top ontology developed by Ontotext.

     

    Milestone MS2 says the knowledge representation infrastructure is opened for retrieval access to partners at M6. The infrastructure deliverable D4.1 is due at M8.

     

    Grammar-ontology interoperability

     

    [7    Grammar-ontology interoperability for translation and retrieval in DoW]

     

    At the time of the TALK project, an emerging topic was the derivation of dialogue system grammars from OWL ontologies. A prototype tool for extracting GF abstract syntax modules from OWL ontologies was thereby built by Peter Ljunglöf at UGOT. This tool was implemented as a plug-in to the Protégé system for building OWL ontologies3 and intended to help programmers with OWL background to build GF grammars. Even though this tool remained as a prototype within the TALK project, it can be seen as a proof of concept for the more mature tools to be built in the MOLTO project.

     

     

    A direct way to map between ontologies and GF abstract grammars is a mapping between OWL and GF syntaxes.

     

    In slightly simplified terms, the OWL-to-GF mapping translates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows:

     

      Class(pp:integer ...)  <==>     cat integer ;

     

      ObjectProperty(pp:div  <==>     fun div :

        domain(pp:integer)                  integer -> integer -> prop ;

        range(pp:integer))   

     

    Less syntax-directed mappings may be more useful, depending on what information is relevant to pass between the two formalisms. The mapping is then also less generic, as it depends on the intended use and interpretation of the ontology. The mapping through SPARQL queries below is one example. A mapping over TF could be another one.

     

    The GF-Protégé plug-in brings us to the development cost problem of translation systems. We have noticed that in the GF setting, building a multilingual translation system is equivalent to building a multilingual GF grammar, which in turn consists of two kinds of components:

    •              a language-independent abstract syntax, giving the semantic model via which translation is performed;

    •              for each language, a concrete syntax mapping abstract syntax trees to strings in that language.

     

    In MOLTO, GF abstract syntax can also be derived from sources other than OWL (e.g. from OpenMath4 in the mathematical case study) or even written from scratch and then possibly translated into OWL ontologies, if the inference capabilities of OWL reasoning engines are desired. The CRM ontology (Conceptual Reference Model5) used in the museum case study is already available in OWL.

     

    MOLTO’s ontology-grammar interoperability engine will thus help in the construction of the abstract syntax by automatically or semi-automatically deriving it from an existing ontology. The mechanical translation between GF trees and OWL representations then forms the basis of using GF for translation in the Semantic Web context, where huge data sets become available in RDF and OWL in initiatives like Open Linked Data (LOD).

     

    The interoperability between GF and ontologies will also provide humans with natural ways of interaction with knowledge based systems in multiple languages, expressing their need for information in NL and receiving the matching knowledge expressed in NL as well:

     

      Human -> NL -> GF -> ontology -> GF -> NL -> Human

     

    providing an entirely new dimension to the usability of semantics-based retrieval systems, and opening extensive structured bodies of knowledge in human understandable ways.

     

    Note also that the OWL to GF mapping also allows a wider human input to GF. OWL ontologies are written by humans (at present at least, by many more humans than GF grammars).

     

    MOLTO website gives detail what is going to delivered first by way of ontology-GF interoperability. The first round uses GF grammar to translate NL questions to SPARQL query language (http://www.molto-project.eu/node/987).


     

    The ontology-GF mapping here is a NL interface to PROTON ontologies, by way of parsing (fixed) NL to (fixed) GF trees and transforming the trees into SPARQL queries to run on the ontology DB.

    Indirectly, this does define a mapping between (certain) GF trees and RDF models, using SPARQL in the middle. SPARQL is not RDF but a SPARQL query does retrieve a RDF model given a dataset, but the model depends on the dataset. With an OWL reasoner thrown in, we can get OWL query results.

    What WP3 had in mind is a tool to translate between OWL models and GF grammars, i.e. convert OWL ontology content  into GF abstract syntax. This tool is forthcoming next according to the MOLTO presentation slides (http://www.molto-project.eu/node/1008).

    This was confirmed by email from Petar (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanWP4).

    The translation tools WP3 will consider using TermFactory multilingual ontology model and tools

    as middleware between (non-linguistic) ontology and GF grammar. The idea is to (semi)automatically match or bridge third party ontologies to TF, a platform for collaborative development of ontology-based multilingual terminology. It then remains to define an automatic conversion between TF and GF.

    The Varna meeting should adjudicate between WP3 and WP4 here.

     

    A concrete subtask that arises here is to define an interface between the knowledge representation infrastructure (due Nov 2010) and TF  (finished in ContentFactory project end of 2010).

     


     

    WP4  evaluation 

    Since the aims are more related to use cases and framework development, than enhancing performance of existing technologies, the evaluation to be done during the project will be more of a qualitative than quantitative kind.

    The evaluation of these features should reflect and demonstrate the multiple possibilities of GF that are gained through inter-operation with external ontologies. The evaluation of progress will exploit proof-of-concept demos and plans for further development. For further discussion, see https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanD91

     

     

     

    WP5: Statistical and robust translation

    WP5 requirements

    Objectives

    [From DoW] 

    The goal is to develop translation methods that complete the grammar-based methods of WP3 to extend their coverage and quality in unconstrained text translation. The focus will be placed on techniques for combining GF- based and statistical machine translation. The WP7 case study on translating Patents text is the natural scenario to test the techniques developed in this package. Existing corpora for the WP7 will be used to adapt SMT and grammar- based systems to the Patents domain. This research will be conducted on a variety of languages of the project (at least three).

    Deliverables

    Del. no

    Del. title

    Nature

    Date

    D 5.1

    Description of the final collection of corpora

    RP

    M18

    D 5.2

    Description and evaluation of the combination prototypes

    RP

    M24

    D 5.3

    WP5 final report: statistical and robust MT

    RP,Main

    M30

    
    
    
    Description of work
    [10 Robust and statistical translation methods in DoW]


    The concrete objectives in this proposal around robust and statistical MT are:

    • Extend the grammar-based approach by introducing probabilistic information and confidence scored predictions.
    • Construct a GF domain grammar and a domain-adapted state-of-the-art SMT system for the Patents use case.
    • Develop combination schemes to integrate grammar-based and statistical MT systems in a hybrid approach.
    • Fulfill the previous objectives on a variety of language pairs of the project (covering three languages at least).

    Most of the objectives depend on the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.

    Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information (1 and 2). We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. This compilation will rely on publicly available corpora and resources for MT (e.g., the multilingual corpus with transcriptions of European Parliament Sessions).

    Domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study). This corpora will come from the compilation to be made at WP7, leaded by Mxw.

    We already have the European Parliament corpus compiled and annotated for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.

    Combination of grammar-based and statistical paradigms is a novel and active research line in MT. (...) We plan explore several instantiations of the fallback approach. From simple to complex:

    • Independent combination: in this case, the combination is set as a cascade of independent processors. When Grammar-based MT does not produce a complete translation, the SMT system is used to translate the input sentence. This external combination will be set as the baseline for the rest of combination schemes.

    • Construction of a hybrid system based on both paradigms. In this case, a more ambitious approach will be followed, which consists of constructing a truly hybrid system which incorporates an inference procedure able to deal with multiple proposed fragment translations, coming from grammar-based and SMT systems. Again we envision several variants:

    • Fix translation phrases produced by the partial GF analyses in the SMT search. In this variant we assume that the partial translations given by GF are correct so we can fix them and let SMT to fill the remaining gaps and do the appropriate reordering. This hard combination is easy to apply but not very flexible.

    • Use translation phrase pairs produced by the partial GF analyses, together with their probabilities, to form an extra feature model for the Moses decoder (probability of the target sentence given the source).

    • Use tree fragment pairs produced by the partial GF analyses, together with their probabilities, to feed a syntax based SMT model, such as the one by Carreras and Collins (2009) . In this case the search process to produce the most probable translation is a probabilistic parsing scheme.

    
    
    

    The previous text describes the hybrid MT systems we consider to include. The baseline is clear. In fact, one can define three baselines: a raw GF system, a raw SMT system and the naïve combination of both. Regarding real hybrid systems there is much more to explore. Here we list four approaches to be pursued:

    • Hard integration. Force fixed GF translations within a SMT system.

    • Soft integration I. Led by SMT. GF partial output, as phrase pairs, is integrated as a discriminative probability feature model in a phrase-based SMT system.

    • Soft integration II. Led by SMT. GF partial output, as tree fragment pairs, is integrated as a discriminative probability model in a syntax-based SMT system.

    • Soft integration III. Led by GF. Complement with SMT options the GF translation structure and perform statistical search to find the final translation.

    At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.

    In the evaluation process, these families of methods will be compared to the baseline(s) introduced above according to several automatic metrics.

    WP5 evaluation

    WP5 is going to have its own internal evaluation complementary to that of WP9. Since statistical methods need of fast and frequent evaluations, most of the evaluation within the package will be automatic. For that, one needs to define the corpora and the set of automatic metrics to work with.

    Corpora

    Statistical methods are linked to patents data. This is the quasi-open domain where the hybridization is going to be tested. The languages of the corpus are not still completely defined, but by looking at other works with patents we guess they will probably be English, German, and French or Spanish.

    Besides the large training corpus, we need at least two smaller data sets, one for development purposes and another one for testing. The order of magnitude of these sets is usually around 1,000 aligned segments or sentences. We expect to reach this size, but the final amount will depend on the available data.

    Metrics

    BLEU (Papineni et al. 2002) is the de facto metric used in most machine translation evaluation. We plan to use it together with other lexical metrics such as WER or NIST in the development process of the statistical and hybrid systems.

    Lexical metrics have the advantage of being language-independent, since most of them are based on n-gram matching. However, they are not able to catch all the aspects of a language and they have been shown not to always correlate well with human judgements. So, whenever it is possible, it is a good practice to include syntactic and/or semantic metrics as well.

    The IQmt package1 provides tools for (S)MT translation quality evaluation. For a few languages, it provides metrics to do this deep analysis. At the moment, the package supports English and Spanish, but other languages are planed to be included soon. We will use IQmt for our evaluation on the supported language pairs.

    1http://www.lsi.upc.es/~nlp/IQMT/

    D9.1A Appendix to MOLTO test criteria, methods and schedule

    Contract No.:FP7-ICT-247914
    Project full title:MOLTO - Multilingual Online Translation
    Deliverable:D9.1A Appendix to MOLTO test criteria, methods and schedule
    Security (distribution level):Public
    Contractual date of delivery:April 2012
    Actual date of delivery:April 2012
    Type:Report
    Status & version:Final
    Author(s):Lauri Carlson, Inari Listenmaa, Seppo Nyrkkö et al. (UHEL)
    Task responsible:UHEL
    Other contributors:


    Abstract

    During the review on March 20, 2012, an appendix was requested to better specify the methodology that MOLTO intends to adopt to carry evaluation of the work and results related to each workpackage. This document tries to clarify the goals and how they will be achieved in Workpackage 9.

    Requirements of the addendum D9.1A

    Requirements of the addendum:

    The first year review recommendation states:

    1. Taking into account the numerous endeavors undertaken in the translation domain, both research and commercial, the market segment addressed by MOLTO should be identified with maximum precision.
    2. The specific case studies should also be taken into account in this effort.

    The second year review recommendation adds:

    1. A concrete evaluation methodology is needed focusing on MOLTO major goals: how the consortium will prove that its objects were fully/partially met (target: producers, input: predictable, coverage:limited, quality: publishing).
    2. This should also include an updated description of the test criteria and used methods for each of the use cases as they are progressing, so each of the use cases can be properly evaluated at the end of the project. This also holds for the new use cases.

    MOLTO use scenarios and the market segment

    The scope of applicability for MOLTO translation is a function of the domain and language coverage. The locale and grammar coverage at the start of the project was fixed by the apported GF resource grammar library. One of the main tasks of the MOLTO project is to provide tools for extending domain coverage and the associated lexical coverage by MOLTO translation users themselves. The tools should make it feasible for user communities to extend MOLTO translation to new domains and vocabularies. The market segment that can be targeted by MOLTO tools by the end of the project is in turn a function of the availability and efficiency of these tools and thereby the potential coverage of MOLTO translation. We are aiming at making it feasible to build and use domain specific grammars with lexicons in the order of thousands of words (instead of hundreds).

    The two properties: restricted coverage and predictable input, restrict the market segment to production (dissemination). The constrained language property means MOLTO will not offer a replacement for CAT, i.e. translation tools that help human translators with complex third party authored documents which they are not allowed to modify. But MOLTO translation can be added an additional facility in the CAT toolkit. Conversely, traditional TMS facilities may add value to the application and extension of MOLTO methods. These ideas are explored in WP 3.

    MOLTO remains at the core a tool for constrained language multilingual generation. Its potential strengths are 1) multiple simultaneous target languages and 2) reliable enough quality for blind translation (translation from a known language to unknown languages). 2) can only be obtained if the quality is higher than human translation. In practice, some level of human revision is probably going to be needed, but the need can be significantly less than in current workflows.

    From this, we conclude that the most promising market segment for MOLTO translation is constrained language content localization. In current translation industry, there is a more or less clear split between interface localization, which involves translation of fixed short strings from a list of interface messages by professional or volunteer translators, and content translation, which is mostly done outside of the website using CAT tools.

    MOLTO targets an as yet less explored and little exploited niche between them, viz. multilingual content localization of constrained language content. Typical use cases are a webstore inventory, a museum guide, rule generated correspondence, or formulaic parts of a more complex document type (say descriptions of chemical formulas in a patent). Here the content is already regulated and predictable.There are further such scenarios beyond those included in MOLTO use cases, typically involving some database generated information (e.g. product descriptions, user guides, chemical manufacturer's data sheets, job tickets, medical reports). In some such scenarios. real time blind translation to multiple languages would be a major selling point.

    In the MOLTO translation scenario related to this market segment, there is a close interaction between some database/ontology and a human/ruleset that generates the text to translate, and the translation process itself. The content to translate can co-evolve with the grammar by which it is translatable. Such use cases will be tested in the MOLTO semantic wiki platform.

    If or as the vision of Linked Data becomes reality, there is bound to be a growing demand for natural language verbalization of the web of linked data ontologies. The Web of Data is supposed to become an additional layer of the web that is tightly interwoven with the classic document Web and has many of the same properties:

    • The Web of Data is generic and can contain any type of data.
    • Anyone can publish data to the Web of Data.
    • The Web of Data is open, meaning that applications do not have to be implemented against a fixed set of data sources, but can discover new data sources at run-time by following RDF links.

    In particular,

    • Data publishers are not constrained in choice of vocabularies with which to represent data.
    • Data is self-describing. If an application consuming Linked Data encounters data described with an unfamiliar vocabulary, the application can dereference the URIs that identify vocabulary terms in order to find their definition.

    The growing linked data cloud can create a growing market segment for a matching linked cloud of multilingual MOLTO ontology verbalizers. Ontotext's GF based natural language query interface into Ontotext linked data is a first application of MOLTO resources in this direction. As the review points out, a generalization of the ad hoc ontology/GF mappings the KRI and museum cases gets a high priority here.

    A. The multilingual semantic wiki scenario

    A. The multilingual semantic wiki scenario

    The new workpackage 11 aims to use GF to extend AceWiki to a multilingual constrained language semantic wiki. Like the original AceWiki, it allows users to express in natural language logical constraints that are subject to automated reasoning. AceWiki already has facilities for extending the lexicon. A subset of the constraints expressible in ACE are interatranslatable with the OWL ontology language. In the scenario envisaged here, the multilingual semantic wiki works as a tool for extending a special domain ontology through natural language verbalization. This platform supports the scenario where a special domain ontology and its verbalizations are extended simultaneously.

    In one natural scenario, a special domain expert expresses the constraints in unconstrained natural language as comments in the wiki. One or more ontology experts refine the description into a set of simpler statements in a constrained subset that maps to OWL, using already existing ontologies as base and creating the missing ontology resources and their verbalizations in a common natural language using the lexicon editor. The domain experts can test the conceptualization by asking questions of the ontology. The questions are answered in natural language using the wiki's reasoners. When the coverage of the ontology and its verbalization in the chosen language/s is sufficient, the lexicon is extended for the remaining languages, using existing term ontologies as a base, by target language experts.

    B. The MOLTO CAT scenario

    B. The MOLTO CAT scenario

    More traditional translation projects can also contain parts which can be handled with constrained language translation. The MOLTO patents case has shown that certain sections of patent text, in particular complex chemical compound descriptions, are not well covered by SMT. The MOLTO translator tools workpackage looks into ways of embedding MOLTO constrained language translation as one tool in the toolkit of a more traditional CAT platform. In this use case, we also test the ability of a translation community (company) to collaboratively extend coverage of the fragment handled with MOLTO tools. This sort of a hybrid SMT+MOLTO+CAT workflow is tested with the patents use case in the MOLTO Translators tools platform as described in D 3.1. Note that the two scenarios are not exclusive. In an overarching scenario, a domain translation is developed in the first scenario and it is applied in production translation in the MOLTO CAT scenario. Actors in some of the supporting roles of the MOLTO CAT scenario may use the wiki tool in their work.

    The CAT scenario is described in more detail below under WP 3.

    Relating the scenarios to the MOLTO use cases

    Relating the scenarios to the MOLTO use cases

    The following details the MOLTO use cases relating them to the scenarios above. Each section lists the evaluation criteria, measures and methods applied in the use case.

    WP2 - Grammar developers tools

    WP2: Grammar developers tools

    The grammar developers tools promise to enable quick development of a new domain and language. This promise is best tested directly by measuring the time and expertise taken

    1. to create or extend an ontology to a new domain using MOLTO tools
    2. to generate an abstract grammar for a domain from an ontology for it
    3. to create or extend the concrete grammar for a language to a new domain
    4. to extend the vocabulary for a language to cover a new or extended domain

    The measures are taken for a system with a coverage in the order of a) 100 concepts b) 1000 concepts. The platforms used in carrying out the tests include the multilingual semantic wiki (tasks 1 and 2), the TermFactory? platform (tasks 1 and 4) and the grammar editing tools (tasks 2,3,4). To test these claims, we need to fix one or more domains to create/extend. We haven't got a great many domains to choose from yet. We would do well to extend in the direction of known 'good' ontologies.

    1. phrasebook < travel ontologies, e.g. http://sites.google.com/site/ontotravelguides/Home/ontologies
    2. museum/painting < music ontology, http://musicontology.com/
    3. patents/pharmaceutics < chemical safety data sheets + chemical ontologies e.g. CheBI? , ChemINF?

    Baseline evaluation figures prior to the use of MOLTO tools for a domain of smaller size were obtained in the phrasebook exercise reported in Ranta et al.2010 [9]. For comparability, the same criteria and measures are to be applied in subsequent evaluations.

    WP3 - MOLTO CAT tools

    WP3: MOLTO CAT tools

    The MOLTO CAT scenario is designed to serve a translation community that carries out translation projects using MOLTO tools as an additional CAT tool. The translation community members are assigned different roles. What they may do depends on the role. Roles are assigned in the translation management system. In the MOLTO demonstration system, the TMS is Globalsight. The TMS manages the resources of a project. The resources include

    • documents
    • grammars

    • translation memories
    • term collections

    A MOLTO CAT translation project is composed by a collection of resources and a community of actors playing different roles in the project. One actor can bear more than one role.

    The roles include

    • project manager (rights to manage the resources and the workflow)
    • editor (source competence in domain, domain expert, authority to edit the source)
    • translator (bilingual competence in domain, not necessarily domain expert)
    • revisor (target language competence in domain, domain expert)
    • ontologist (competence to extend the domain ontology)
    • terminologist (bilingual or target competence in the domain)
    • grammarian (competence to extend domain GF grammar)

    The TMS manages the project workflow, that is, routes documents through different steps between the actors. The actions include

    • project manager:
      • create users
      • assign roles to users
      • create a translation project
      • prepare resources for a translation project
      • plan the workflow
      • assign actors to actions

    • editor
      • split source to constrained/unconstrained sections
      • indicate allowed/authorize new deviations from constrained language source

    • translator:
      • translate unconstrained sections using CAT tools (including SMT proposals from translation memory)
      • translate constrained language sections using MOLTO
      • propose term for lexical gap
      • create grammar extension request

    • ontologist
      • find or create missing concept
      • create grammar extension request
      • create terminology extension request

    • terminologist
      • find or define equivalents to a new concept

    • grammarian a revisor
      • carry out grammar extension

    The typical envisaged workflow is this. A translator in a multilingual translation project works on a structured multipart document, some of whose parts are marked as amenable to translation with the MOLTO editor. The rest is translated with traditional CAT tools. A subsection appropriate for MOLTO translation is opened in the MOLTO translation editor. The appropriate GF grammar and terminology are specified in the project resources. If the section is properly within the fragment covered by the grammar, the section should parse and translate correctly without translator intervention. This is the default if the MOLTO marked section has been created in scenario A. However, until the domain grammar has been fully tested for blind translation in all target languages, a target language translator or revisor must check that the target text is correct.

    If the grammar coverage is not complete, the translation editor shows some parts of the section marked as untranslatable.

    In the easy case, the coverage problem can be fixed by a conservative paraphrase or, if the translator's brief permits pre-editing, by a more creative rewrite of the section source to bring it under the coverage of the MOLTO grammar. The original source and its paraphrase get stored in the translation memory as an instance of source rewrite, and will be available for other translators as a model solution of the coverage problem. If a rewrite is not possible, the next move depends on the workflow.

    1. If the translator's brief is just to produce a complete translation to the target language in a bilingual project, the translator just translates the part not covered by MOLTO using traditional CAT tools. The out-of-coverage segment gets marked as a manually translated MOLTO section segment in the translation memory. Such segments can be collected and sent off as non-coverage tickets to the project's terminology and grammar management.

    1. The task may be to extend MOLTO translation to a language whose coverage in the given domain is not complete.
      1. In the case of a simple out-of-vocabulary term or concept belonging to a category known to the grammar, the MOLTO equivalents editor can be used to extend the concrete and/or abstract vocabulary of the grammar. If a concept with a matching GF category and verbalizations is found in an existing MOLTO term ontology, the missing term can be added into the translation project's GF grammar extension module so as to become immediately available to further MOLTO translation in the project and subsequently included in the project ontology.
      2. If a candidate term is found using some non-authoritative lexical source, the candidate term gets added as a term candidate to the relevant domain for community approval. That is, the translation unit containing the proposed candidate concept/term in its abstract/concrete grammar context is saved in translation memory and sent to the terminology management platform for terminology checking and approval.

    1. The task may be to develop a master text or pilot translation, in preparation for a subsequent multilingual translation project (pre-editing). A gap in the MOLTO coverage can arise when the special domain section subject to MOLTO translation has not been authored in the semantic wiki, but for instance generated from a database or merged from text from more than one subdomain. In this case, more effort is worth spending to extend the coverage of the MOLTO grammar to the source before proceeding to multilingual translation.
      1. In the case of out of vocabulary terms or concepts, the grammar can be extended through the translation editor as above.
      2. In more complex cases needing grammar extension, the translator just creates a model translation and submits it back to the ontology/grammar editing workflow. The model translation is saved in translation memory and can be used in regression testing against the edited grammar.

    The MOLTO translation editor

    As indicated in the MOLTO CAT system design, the MOLTO translation editor is integrated as a plugin to the translation management system alongside more traditional CAT editors. The MOLTO CAT scenario sets the following requirements on the editor and its integration to the TMS.

    • editor
      • the MOLTO translation editor parser can out from the source parts it can translate and indicate what it lacks for parts that do not translate.
      • the GF back end is able to include proposed extensions into the grammar.

    The development of the translation editor to satisfy these requirements is taken over by UGOT, as it is closely coupled to the ongoing development of the GF robust parsing and grammar extension services.

    • integration
      • the TMS environment is able to extract from structured source text parts which are subjected to MOLTO translation.
      • the editor has access to a term/ontology manager to propose terms/concepts to fill the indicated gaps and submit new proposals for approval

    These requirements remain the responsibility of UHEL.

    Term ontology management with TermFactory?

    The TermFactory? term management specification and query/editing API is a Tomcat Axis2 webservice API for querying, editing, and storing small RDF/OWL ontologies representing concepts and multilingual expressions/terms associated with the concepts. TermFactory? contains a term ontology schema that follows professional terminology standards, but the tools can also be used to edit any RDF/OWL ontologies through an XHTML representation RDF. The XHTML representation is extremely configurable. It can be parametrized for the presentation layout (concept oriented, lemma oriented), filtered for content, and even localized with another TF term ontology so that names of properties and classes shown to the user are chosen from the localization ontology. The term ontology editor is a pluggable javascript editor that is offered as a standalone Tomcat servlet as well as a MediaWiki? extension. A simpler tabular editor exists for the common task of adding different language equivalents to an existing ontology term.

    TermFactory? is to be integrated with the MOLTO KRI over the JMS transport interface provided in the KRI. Besides the Ontotext repositories, TermFactory? also talks to Jena RDB and triple set repositories. TermFactory? user management is planned to happen through the GlobalSight? API.

    WP3 Evaluation

    The GlobalSight? translation management system forms a platform to test the MOLTO TT scenario that combines traditional CAT tools with the MOLTO translation editor. The best dataset for testing the full MOLTO CAT scenario should be the patents, since it already uses hybrid methods and generates a translation of less than 100% coverage. To have a complete use case of the mixed scenario, a pure GF grammar for chemical compounds could be applied to translate chemical compound definitions in the patent text.

    The MOLTO CAT review workflow will be used manage translation quality evaluation of the multilingual translations produced in the other use cases. This exercise in itself also serves to test the usability of MOLTO scenario B.

    WP4 - Ontology-grammar interoperability

    WP4: Ontology-grammar interoperability

    The second year review considered Deliverable 4.2 and Deliverable 4.3 insufficient and they were not approved by the reviewers in their current status. The objectives of WP4 are, as stated in the DoW? :

    (i) research and development of two-way grammar-ontology interoperability bridging the gap between natural language and formal knowledge; (ii) infrastructure for knowledge modeling, semantic indexing and retrieval; (iii) modeling and alignment of structured data sources; (iv) alignment of ontologies with the grammar derived models.

    D4.2 should contain a report on the Data Models, Alignment Methodology, Tools and Documentation. More specifically, it should contain information about the aligned semantic models and instance bases. While D4.2. contains information about Reason-able views and the key principles constituting these views are stated in the document, it does not state how these key principles have been implemented in the MOLTO-project. D4.2 does not comply with the key principle stating “Clean up, post-process and enrich the datasets if necessary, and do this in a clearly documented and automated manner.” D4.2 should contain exactly all details about the automation process of multiple ontologies. so that this knowledge and technique can be re-used to integrate new ontologies with the existing ones.

    D4.3. should clear out the issue of the two-way interoperability between ontologies and GF grammars. This is still unclear, although objective (i) of WP4 is clear that this is a research-intensive part of MOLTO. Based on the WP4 presentation given in the review, this process requires the manual writing of mapping rules (NL Query -> GF, GF-> SPARQL query), which means limited potential for further re-use. The partners must clear the degree of automation that can be performed. What is required for porting this to a new application? Concrete steps should be provided making clear what can be automated and what cannot with the provided infrastructure. Details about mapping rule induction etc. should be provided.

    As for the ontology/grammar mappings, here is what we have concretely got so far:

    • Ontotext has defined one instance of single ontology triple to GF translation in WP 4.3.
    • Aarne et al. have defined a more complex property tuple to text translation for the Museum case.

    The examples show that the owl to GF mapping need not be difficult in any given case. What seems open is how to generalize these examples for the general case of generating a mapping for a new domain. In particular, we want a solution that allows the reuse of ontology to GF mappings to create more complex grammars from existing parts. The modularity of both OWL and GF suggest ways of approaching this goal.

    One approach to a more general solution is to use the term ontologies developed in TermFactory? to also store parts of mappings needed for GF verbalization. In a TermFactory? term ontology, a term is a pair of a general language expression and a special language concept. In this approach, an ontology concept would map to an abstract grammar term. Individual language expressions and terms associated with the concept map to concrete grammar terms. A term or expression would inherit GF grammar properties from classes to which it belongs (say, exp:Noun). Grammatical properties common to all uses of a given general language expression would be stored as properties of the expression. GF terms or grammatical properties that are specific to a domain GF grammar would stored as properties of a domain specific term.

    Instead of having to define a new grammar and create concept to grammar associations from scratch, a grammar would be compiled from appropriate choices of resource from the term ontology plus a language and/or domain specific syntactic base. To extend a vocabulary, we add a new term (expression, concept) instance, typed in the appropriate categories, and add to it any further GF properties that are relevant to its correct linearization. The concrete expression associated to a compositional abstract grammar term need not be specified in the ontology, if it can be compositionally derived from the GF abstract syntax associated to the concept and other resources in the ontology. The above does not claim to do more than propose a way to decompose the ontology to grammar mapping into reusable parts.

    If the approach seems useful, UHEL is prepared to invest effort to building a test case using the museum case as a starting point.

    WP5 - Statistical Machine Translation

    WP5: SMT

    The research goal was to develop translation methods that complement the grammar-based methods of WP3 to extend their coverage in unconstrained text translation. Specifically, WP 5 promised to create a commercially viable prototype of a system for MT and retrieval of patents in the bio-medical and pharmaceutical domains, (ii) allowing translation of patent abstracts and claims in at least 3 languages, and (iii) exposing several cross-language retrieval paradigms on top of them.

    WP5 evaluation

    WP5 is has its own internal evaluation complementing that of WP9. Since statistical methods need fast and frequent evaluations, most of the evaluation within the package is automatic. The WP7 case study on translating Patents text is the use scenario to test the techniques developed in this package. Ultimately, Ontotext will examine the feasibility of the prototype as a part of a commercial patent retrieval system (D7.3).

    Corpora

    Statistical methods are linked to patents data. This is the quasi­open domain where the hybridization is going to be tested. The languages of the corpus are English, German, and French, the official languages of the European Patent Office (EPO).

    Besides the large training corpus, we need at least two smaller data sets, one for development purposes and another one for testing. The order of magnitude of these sets is usually around 1,000 aligned segments or sentences. For this, we have used a subset of MAREC patents (http://www.ir-facility.org/prototypes/marec), and a collection of 66 patents provided by the EPO. The concrete figures are explained in WP5 and summarised in the table below.

    Seg DE-EN Seg FR-EN Seg FR-DE dev MAREC 993 993 993 test MAREC 1,008 1,008 1,008 test EPO 847 858 831

    Metrics

    BLEU [3] is the de facto metric used in most machine translation evaluation. We plan to use it together with other lexical metrics such as WER or NIST in the development process of the statistical and hybrid systems. Lexical metrics have the advantage of being language­-independent, since most of them are based on n­-gram matching. However, they are not able to catch all the aspects of a language and they have been shown not to always correlate well with human judgments. So, whenever it is possible, it is a good practice to include syntactic and/or semantic metrics as well. The Asiya package provides tools for (S)MT translation quality evaluation. For a few languages, it provides metrics to do this deep analysis. At the moment, the package supports English and Spanish, but other languages are planed to be included soon. We will use Asiya for our evaluation on the supported language pairs.

    Manual evaluation

    Final translations will be also manually evaluated. This is the most realiable way to quantify the quality of a translation since automatic metrics cannot capture all the aspects that a human evaluator takes into account as said in the previous section.

    We now propose to follow the ranking for evaluation that is used in patent offices such as EPO. It can be applied to sentences but also to full patents. So, automatic metrics will also be adpated to deal with full patent evaluation and see how they correlate. This way we will be able to perform a deep study.

    Quality level: Ranking for human evaluation

    • Accurate + consistence IPC vocabulary

    The translation is understandable and actionable, with all critical information accurately transferred. Most of the text is well written using a language consistent with patent literature.

    • Fluent - consistence IPC vocabulary

    The translation is understandable and actionable, with all most critical information accurately transferred. Some text is well written using a language consistent with patent literature.

    • Actionable

    The translation is not entirely understandable and actionable, with some critical information accurately transferred. The text is of the text is well written using a language consistent with patent literature.

    • May be actionable

    Possibly understandable and actionable (given enough context and/or time to work it out), with some information stylistically or grammatically odd, but the language may still reflect a sound content to a patent professional. Most of the text written using a language consistent with patent literature.

    • Not useful

    Absolutely not comprehensible and/or little or no information is transferred accurately.

    WP6 - Math

    WP6: Math

    The math use case remains as it was, except that the use case may assume that premises requiring encyclopedic knowledge needed to frame word problems are given. Assuming that the math scenario will be embedded in the semantic wiki, the background premises may be given by the author of the problem in the facts database where the problems are formulated.

    The mathematics use cases involve a problem author, a student and a teacher. The usability of the scenario is tested with realistic subjects playing each of these roles and the evaluation collected with a questionnaire and/or a journal. In addition, we should try estimate the savings from the system when scaled up to a larger use base and variety of languages, since these are the novelties in the MOLTO solution.

    Evaluation for WP6

    Diagnostic and progress evaluation for translation quality

    WP 6 has developed a treebank based method for doing regression testing on the translations produced by the math grammar. A treebank entry consists of:

    • An abstract tree for the gf grammar
    • For each language (encoded as ISO 3 letter code), one or more Changesets.

    A Changeset has:

    • source: The person submitting it;
    • revision: An integer equal to the svn revison in which this item is commited
    • concrete: The proposed linearization
    • and optionally a comment.

    A defect is a difference between the actual linearization of an entry and the sample in the last changeset.

    The procdure is as follows.

    1. Using the gr command, create a list of abstract trees
    2. Refine this list by removing or modifying unnatural productions (too deep, too long, too meaningless);
    3. Add linearizations for all targeted languages: This makes the initial changeset;
    4. Send the pairs (abstract tree, L linearization) to a fluent speaker of language L and ask for corrections;
    5. Add the corrections to the treebank as new changesets.
    6. Generate a list of defects and tackle them
    7. Generate new linearizations, and go to step 4. Cycle until satisfied or out of resources.

    See http://www.molto-project.eu/wiki/living-deliverables/d61-simple-drill-grammar-library/5-testing for further discussion.

    Use case and usability evaluation

    WP7 - Patents

    WP7: Patents

    The first year review recommended that WP7 work should focus on the major issues examined in MOLTO, especially in relation to the grammar-ontology interoperability rather than chemical compound splitting. Specific scenarios are needed for the exploitation of MOLTO tools in this case study. It was recommended to include such scenarios in a new version of deliverable D9.1.

    In response, two use case scenarios were described: UC-71 and UC-72.

    • UC-71 focuses on grammar-ontology interoperability. User queries, written in CNL (controlled natural language) are used to query the information retrieval system.
    • UC-72 focuses on high-quality machine translation of patent documents. It uses an SMT baseline system to translate a big dataset and fill up the retrieval databases. In order to study the impact of hybrid systems in translation quality, a smaller dataset will be translated using the hybrid system developed in WP5.

    Evaluation related to WP7

    WP7 corresponds to the Patents Case Study. Its objective is to build a multilingual patents retrieval prototype. The prototype consists of three main modules: the multilingual retrieval system, the patents translation and the user interface. This document proposes a methodology to evaluate these modules within the MOLTO framework.

    Translation system

    The automatic translations included in the retrieval database have been produced by the machine translation systems developed within the WP5. Hence, the evaluation related to this module is the same as the one described for the WP5 systems.

    Retrieval system

    Nowadays, the IR-facility organizes the TREC Chemical IR Evaluation campaign (http://www.ir-facility.org/trec-chem-2011-cfp) The evaluation campaign has three different tracks. One of them is very related to our objective in this WP. - Technology Survey - Given an information need (from the bio-chemistry domain) expressed in natural language, retrieve all patents and scientific articles which may satisfy this need.

    Following the guidelines described in the TREC campaign, the methodology proposed to evaluate the patents retrieval system is as follows.

    1. Select a set of topics (between 5-10) and create a natural language queries for each topic (preferably, they must be manually created by experts). Each query must express an information needed based on the data described in a patent. The priority is to be as similar as possible to a genuine information need of an expert searcher.
    2. The system will have to return a set of documents that answer this information need as best as possible. For any of the runs, it may return a maximum of 100 relevant documents (our database will contain ~8000 documents), preferably using the standard trec_eval format: Topic_number query_number document_id rank score run_name.
    3. Manually annotate the retrieved documents as match/mismatch.
    4. Calculate the AP (Average Precision, [1]) and NDCG (Normalized Discounted Cumulative Gain [2]), which are common metrics for these kind of systems [4].

    The user interface

    User interfaces are usually evaluated by means of their Usability. According to the ISO 9241-11, usability must measure the "Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.".

    Hence, to get a complete picture of the usability, we need to measure the user satisfaction (users reaction to the interface), effectiveness (can people complete their tasks?) and efficiency (how long do people take?).

    The three measures of usability are effectiveness, efficiency and satisfaction. They are independent and it must be measured all three to get a rounded measure of usability.

    1. Effectiveness. This can be automatically by logging the user interactions with the system, and manually analysing the system responses. The measure can be also contrasted with a specific question in the satisfaction questionnaire.
    2. Efficiency. This measure can be automatically obtained by logging the user interaction. To do so, the experiment requires to implement the needed mechanisms to a) determine the start and end of the experiment (for each scenario and/or for the complete experiment), b) relate the previous record with a specific user and the other two measures (effectiveness and satisfaction). We could also request the users to time themselves, but this measurements will be less reliable.
    3. Satisfaction. This measure can be obtained through requesting the users to answer a questionnaire. Commonly used questionnaires for this tasks are the IBM CSUQ [5] or the SUS Questionnaire [6]. Another novel method is the cloud of words, in which users have to select a subset of words describing the system among a predefined set of adjectives. An general description of this method can be found in [7].

    The experiment setting may consist of two scenarios: a closed one (i.e., specifying the information that must be obtained) and an open one (i.e., let the user search any type of information). The users are requested to complete both scenarios, and the order in which they are done must be balanced (i.e., Half of them will do the open scenario first). They must answer the questionnaire twice, just after each scenario.

    The potential users might be of two types: MOLTO participants and related people (internal) and external users. The internal users can be used as the control test. External participants can be engaged from tools like the Mechanical Turk Requester [8].

    WP8 - Museum

    WP8: Museum

    D8.2 (AR, DD, RE 2012) -->

    The museum grammar creates multilingual descriptions from a museum ontology using GF grammar for the verbalization. The GF grammar provides a direct verbalization of the triples and different types of complex discourse patterns: a text generated by the grammar has necessary elements painting, painting type and painter, and as optional information year, museum, colour, size and material. For a detailed description, see D8.2 (Ranta et al. 2012).

    An abstract syntax for the direct verbalization grammar can be generated automatically from the ontology. The discourse patterns have been human-generated, and they can be reused for different language versions and for more objects. For example, the type of a complete painting is described in an abstract syntax as following:

    cat CompletePainting Painting PaintingType Painter OptYear OptMuseum OptColour OptSize OptMaterial ;

    CompletePainting is a type constructor that takes type parameters to construct a type for a painting. A painting from Gothenburg City Museum has a following type:

    data GSM940042ObjPainting : CompletePainting GSM940042Obj MiniaturePortrait JKFViertel (MkYear (YInt 1814)) (MkMuseum GoteborgsCityMuseum) (MkColour Grey) (MkSize (SIntInt 349 776)) (MkMaterial Wood) ;

    In the concrete syntax all this complexity is hidden. Porting the grammar to a new language requires only writing the concrete syntax. However, the underlying ontology makes sure that the grammar generates only valid descriptions and not random combinations of paintings, painters and other properties.

    As of March 2012, the translation of the museum objects and the additional lexicon (painting materials, colours) needs to be done manually. The future plan is to combine tools developed in WP3 to make the lexicon extension automatic, by using multilingual lexicon harvesting from term ontologies or other reliable sources (DBPedia, TermFactory? ).

    Evaluation for WP8

    D8.2 has promised to increase the coverage from 5 languages to 10 languages, and extend the grammar and the lexicon for at least 5 languages. The GF grammar can be tested continuously, while developing, with the treebank method described earlier in this document. A grammar developer should be fluent in the language she is developing the concrete syntax, and the treebank testing should be thorough. If the testing is done properly in the grammar development phase, there shouldn't be need to have specific translation quality evaluation experiments. The best way to spot problems is through real usage, so UHEL is offering a bug tracking platform, where users can report all kinds of issues, including language errors.

    The idea is not to translate existing texts, but to generate descriptions in response to user queries. As described in D8.2,

    D8.2: The grammar presented here allows to generate well-formed multilingual natural language descriptions about museum artefacts with the aim of empowering users who wish to access cultural heritage information through different computing devices.

    Other question is to evaluate the use of the queries. Currently the grammar has one discourse pattern with optional elements; the variety comes from adding or leaving out some information. One possibility discussed in D8.2 is to include more variety in the generated text. A qualitative evaluation study with non-expert human subjects would serve this purpose. The aspects to test in this experiment would be the ease of querying and whether the results answer the query. However, as long as this plan is not certain, we are not designing any concrete test methods.

    A third question is the ease of the grammar writing and the reusability of the grammar -- is it possible for other museums to use the grammar, if they have their own standards? Currently a prerequisite for the museum grammar is an ontology that follows Cidoc-CRM standard. This is an important aspect, if we are to make MOLTO tools used outside the test cases within the project. The step from a specified format to verbalizations are well defined, now it should be given more thought how to cover the first step of the process: whichever type of museum database to a CRM format. We could, as a part of evaluation, interview some domain specialists and survey the needs and interests for this kind of system, and whether the first step is a big enough threshold to prevent them to use the system.

    WP11, WP12 - Multilingual semantic wiki and beInformed

    WP 11 Multilingual semantic wiki

    The main goal of the proposed work-package is to build an engine for a multilingual semantic wiki, where the involved languages are precisely defined (controlled) subsets of the 15 languages that are studied in the MOLTO project.

    The wiki engine would allow the input and presentation of the wiki content in all the languages, and perform formal logic based reasoning on the content in order to enable e.g. natural language based question answering. The users of the wiki can contribute to the wiki in any of the supported languages by adding statements to the wiki, as well as extending its concept lexicon. The wiki would integrate a "predictive editor" that helps the user cope with the restricted syntax of the input languages, so that explicit learning of the syntactic restrictions is not required. Ideally, the wiki would also integrate semantics-support, e.g. a paraphraser and a consistency-checker that could be used to enhance the quality of the wiki articles. The wiki engine is going to be implemented by combining the resources and technologies developed in the MOLTO project (GF grammar library, tools for translation and smart text input) with the resources and technologies developed in the Attempto project (Attempto Controlled English, AceWiki? ).

    The task of WP11 will be to combine the technologies developed in the MOLTO project with ACE and AceWiki? , concretely:

    • porting the ACE grammar from English to the 15 MOLTO languages. The work in this task will be supported by the other MOLTO work-packages who are involved in developing GF-based grammars;
    • extending AceWiki? to allow input in multiple different languages, i.e. develop AceWiki? into a multilingual controlled language wiki. This task includes work on modularizing AceWiki? and integrating existing GF tools for translation and smart text input;
    • using existing ACE application domains and test cases to evaluate the new multilingual wiki-system.

    WP 11 evaluation

    In this document, the list of application domains to evaluate multilingual semantic wiki becomes longer, since we envisage using the multilingual wiki as a common testbed for those MOLTO use cases where an ontology and its verbalization are developed in parallel. This can include some or all of the following cases:

    • Creation/extension of a domain and its verbalization (WP 2)
    • Developing mathematics exercises (WP 6)
    • The museum guide browser (WP 3)
    • beInformed scenario

    WP12: beInformed

    It is too early to describe evaluation of this case in detai pending a description of the use case itself. But we can suggest that the beInformed use case could be framed and tested as an instance of the multilingual semantic wiki scenario, if the business logic reasoning rules can be expressed in the semantic wiki database.

    References

    References

    [1] AP. E. M. Voorhees and D. K. Harman, editors. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, 2005.

    [2] NDCG. K. Kärvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002.

    [3] BLEU. Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in ACL-2002: 40th Annual meeting of the Association for Computational Linguistics pp. 311--318

    [4] IR Metrics. http://en.wikipedia.org/wiki/Information_retrieval#Mean_average_precision

    [5] IBM CSUQ. http://hcibib.org/perlman/question.cgi?form=CSUQ

    [6] SUS. http://www.usabilitynet.org/trump/methods/satisfaction.htm

    [7] Word Cloud. Usability. http://www.userfocus.co.uk/articles/satisfaction.html

    [8] Mechanical Turk Requester. https://requester.mturk.com/

    [9] Ranta, Aarne, Enache Ramona, and Détrez Grégoire, Controlled Language for Everyday Use: the MOLTO Phrasebook. Controlled Natural Languages Workshop (CNL 2010) http://www.molto-project.eu/sites/default/files/everyday.pdf

    D9.2 MOLTO evaluation and assesment report


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D9.2 MOLTO evaluation and assesment report
    Security (distribution level): Public
    Contractual date of delivery: M36
    Actual date of delivery: March 2013
    Type: Report
    Status & version: Draft
    Author(s): Jussi Rautio, Maarit Koponen
    Task responsible: UHEL
    Other contributors: UPC


    Abstract

    • Evaluation of the results. (manual, automatic, ..)
    • Evaluation of the grammars and the grammar writing process in terms of D2.3 Best practices document.

    X. Grammar evaluation

    The impact of MOLTO is not about just individual use cases. During the 3 years of the project, we have developed methods of efficient grammar writing, dividing the task such that grammar experts and domain experts get to do what they can best. These guidelines are documented in D2.3, Best practices.

    We have conducted a grammar evaluation survey for people who have written grammars. The results of the survey and an overview of the practices are documented in Part 1.

    We have also noted the time and measures for correcting grammars. Since the release of the first MOLTO demo (D10.2, tourist phrasebook), we have collected feedback and bug reports, and corrected the bugs. Part 2 describes these bugs and the effort that has been needed to fix them.

    X.1 Grammars

    The impact of MOLTO is not about just individual use cases. During the 3 years of the project, we have developed methods of efficient grammar writing, dividing the task such that grammar experts and domain experts get to do what they can best. These guidelines are documented in D2.3, Best practices.

    Best practices document was published in October 2012, but many of the grammars are written before that. Here is first an overview of the best practices and whether the grammars are written accordingly.

    Best practices

    (This summary is copypaste from the document.)

    To make your work reusable, and to enable a division of labour:
     Divide the grammar into a base module (syntactic) and domain extension (lexical).
    To make it maximally simple to add languages:
     Consider defining the base part by a functor.
    To avoid low-level hacking and guarantee grammatical correctness:
     In the concrete syntax, use only function applications and string tokens, maybe records - but no tables, no concatenation.
    To guarantee that the grammar will continue to work in the future:
     Only use the API level of the resource grammar library.
    For scalability:
     Choose solutions that remain stable when new languages are added.
    A corollary:
     Never use lexical categories as linearization types.
    A scalability tools:
     Use type synonyms and constructors rather than raw types for linearization.
    To monitor your progress:
     Create a treebank for unit and regression testing, and use it often with the diagnostic tools.

    The following tools are standard and well-tested in MOLTO’s and other applications:

    • the GF compiler and shell
    • the GF run-time for Haskell, Java, and C, as well as the web API
    • the RGL for the 15 MOLTO languages
    • the GF-Eclipse IDE * the use of smart paradigms for lexicon building

    Phrasebook

    It has two modules: Sentences, which contains phrases that can be defined by a functor over the resource grammar API. The phrases that are likely to have different implementations are in the module Words.

    Semantic validity is handled with simple, restrictive abstract syntax. For example, an abstract syntax tree like

    HowFarBy : Place -> ByTransport -> Question

    guarantees that we can say "How far is the church by taxi" but not "How far is John by beer": the arguments need to be a place and a transport.

    Module structure: Common constructions with a functor

    Starting point for the grammar was a test corpus of sentences we want to express in the grammar. These sentences are used as a documentation for the abstract syntax:

    AHasAge     : Person -> Number -> Action ;    -- I am seventy years
    AHasChildren: Person -> Number -> Action ;    -- I have six children
    AHasName    : Person -> Name   -> Action ;    -- my name is Bond

    ACE-GF

    ACE-GF: based on Attempto Controlled English. (ACE is ____.)

    Acewiki working on ACE (acewiki subset), grammars for Cat, Dan, Dut, Eng (not ACE), Est, Fin, Fre, Ger, Ita, Lav, Nor, Pol, Ron, Rus, Spa, Swe, Urd (https://github.com/Attempto/ACE-in-GF/tree/master/grammars/acewiki_aceowl).

    Grammar modules: ACE base, in addition domain lexicons (Geography).

    (in AceWiki also normal grammars, not ace. But unrelated to ACE grammar.)

    Museum

    Query grammars


    Grammar evaluation survey

    Questionnaire
    
    Basic information: 
    
    Use of development tools:
    
    Diagnostic tools
    Compilation diagnostics: 
    Grammar display modes: 
    
    Testing
    Tools for generation and testing: 
    
    RGL
    Resource grammar tools:
    
    Grammar writing
    Starting point for your grammar:
    
    Basic unit of the grammar:
    
    Semantic control:
    
    Module structure: 
    
    Concrete syntax:
    

    Analysis of answers: ....

    Some things answered in "Other", not in Best practices(?):

    Other method for treebanks: Haskell code to store, edit and show differences in treebanks.

    Other development tool: Haskell and shell scripts generating grammars

    X.2 Grammar modification

    Examples of grammar modification

    case study: Phrasebook

    Phrasebook was published as deliverable 10.2 in June 2010, third month of MOLTO. Initially it translated between 14 European languages (now 20 languages) and was written by 8 authors. These include people with varied GF skills, from 2-day GF course to major developers of GF. Some of the language versions were written by people with actually no skills in the language, using example-based grammar writing (see the report for more information).

    During the 2.5 years, we have gotten feedback and bug reports. The issues can be divided in Phrasebook errors and resource grammar library (RGL) errors. Both of course show as errors in the application grammar, but the error needs to be fixed at a different level. Also the time spent fixing the problem and the expertise of the grammar writer is different between the two error types.

    Feedback

    Feedback has been given various ways. There is a feedback button in the demo for anonymous feedback; this has gone to ____ (WHERE) and has been assigned to ____ (WHO). The Phrasebook demo has been shown in various presentations, and sometimes during the presentation an audience members or the presenter has noticed a problem. The problem has been either fixed by the presenter, or in a case where the presenter lacks time, language skills or GF skills to fix the bug, it has been given to someone with skills and time.

    Initially there was no project-wide reporting system, but since autumn 2012, UHEL has set up one in http://tfs.cc/trac. Each application grammar has an owner who gets a notification about new tickets, and can fix the bug or assign the job to someone.

    Crowdsourcing is another possible source for bug detection. However, in order to profit from that we would need a large number of people browsing the site and our apps, which is not realistic. Most of the bug reports come from people already involved in MOLTO.

    List of grammar issues

    Here I list issues that I know of. This is not necessarily a complete list.

    The difference between application grammar issue and RGL issue can be unclear; for instance, an incorrect morphology in the application grammar may result in using wrong RGL functions or there not being a correct RGL function in the first place. In a case where there exists a correct RGL function but the user has chosen a wrong one, I have classified the error as application grammar issue, as the fix has been made in the application grammar.

    Application grammar issues

    Spanish:

    1) HowFar, HowFarFrom, HowFarBy ja HowFarFromBy

    • Error: Structure of "How far" questions. Initially had a structure that was more common in Latin America and sounded weird for speakers in Spain.
    • Fix: By copying the structure from French into the application grammar.
    • Time: < 30 minutes
    • Skills: Medium GF skills (have made a mini resource and some application grammars)

    2) Plane

    • Error: The word for plane (avión) had wrong gender. The word had been defined in the application grammar and not in the resource grammar.
    • Fix: Changing the gender in the application grammar, mkN "avión" masculine.
    • Time: < 5 minutes
    • Skills: Medium GF skills

    3) Fish

    • Error: The word for fish was a word that means live fish, whereas the context in Phrasebook needs a word for fish as a food. The word was taken from the RGL lexicon, which has only one fish_N, and its meaning is live fish.
    • Fix: Defining the word in the application grammar, mkN "pescado".
    • Time: < 5 minutes
    • Skills: Medium GF skills

    4) Adjectives ending in consonant inflect wrong

    • Error: Wrong paradigm chosen in the RGL functions.
    • Initial fix: Choose right paradigm of mkA. With smart paradigms this means choosing the right number of arguments, which in this case is 5 as opposed to 1. Applied to 8 adjectives in the application grammar
    • Time: < 30 minutes
    • Skills: Medium GF skills

    Catalan:

    1) HowFar, HowFarFrom, HowFarBy ja HowFarFromBy

    • Same error as in Spanish. Due to Catalan having been copied from Spanish. Same fix.

    Finnish:

    1) Locative cases for geographical names

    • Error: All geographical names have the same locative case, which is wrong for some
    • Fix: Added parameters for the data structure of geographical names, so that the right locative case can be chosen.
    • Time: < 30 minutes?
    • Skills: Advanced GF skills, native speaker of Finnish

    RGL errors

    Spanish and Catalan:

    1) Negative imperatives

    • Error: Negative imperatives formed by using the positive imperative and adding a negation particle. Really it should be done with subjunctive mood + negation particle.
    • Fix: Created an ImpNeg function in Spanish and Catalan RGL and used it in the application grammar.
    • Time: ~1 hour
    • Skills: Medium GF skills, fluent non-native Spanish & Catalan

    2) Adjectives ending in consonant inflect wrong

    • Error: The same error Wrong paradigm chosen in the RGL functions.
    • Fix: Make new smart paradigm for these adjectives that takes only 2 forms. In Catalan a more throrough revision of the smart paradigm system.
    • Time: ~1 hour in Spanish
    • Time: half day in Catalan
    • Skills: Medium GF skills, fluent non-native Spanish & Catalan

    French:

    1) Wrong agreement in French superlative forms

    • Error: The superlative is formed with DetNP, which only produces masculine versions.
    • Fix: Make DetNPFem for all Romance languages, have the application grammar a construction based on the gender of the noun
    • Time: ~1 hour
    • Skills: Medium GF skills

    Finnish:

    1) Vowel harmony of possessive suffixes

    • Error: Vowel harmony of possessive suffixes not working, gives all words a back vowel variant
    • Fix: Implement new parameter for vowel harmony in the Finnish resource grammar, change cat for nouns and determiners, change functions that handle them
    • Time: ~1 day (if counting first attempt, that turned out being too slow, and redesign)
    • Skills: Medium GF skills

    2) Wrong word forms in Finnish genetive+possessive suffix http://tfs.cc/trac/ticket/34 3) Pronoun problems with the modal verb "must" in Finnish http://tfs.cc/trac/ticket/23 4) Incorrect plural stem for "children" in Finnish http://tfs.cc/trac/ticket/27 5) Translation of modal verb + a location not working for Finnish. Modal verb problems also in Italian, Catalan and Russian. http://tfs.cc/trac/ticket/15

    • All these corrected by a user with advenced GF skills, time taken in total around half a day.

    D10.1 Dissemination Plan with Monitoring and Assessment

    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D10.1 Dissemination plan with Monitoring and Assessment
    Security (distribution level): Confidential
    Contractual date of delivery: M3
    Actual date of delivery: 1 Jun 2010
    Type: Report
    Status & version: Draft
    Author(s): Olga Caprotti and Aarne Ranta
    Task responsible: UGOT ( WP10 )
    Other contributors: Lluís Màrquez, Borislav Popov, and Jordi Saludes

    Abstract

    This deliverable described the range of dissemination activities planned for MOLTO. It also formally introduces the Advisory Board, the Steering Group members, and their deputies as ratified during the kick off meeting of the project.

    1. Dissemination activities

    List dissemination activities per year. This list is intended for planning purposes. According to our workplan: '' Dissemination on conferences, symposiums and workshops will be in the areas of language technology and translation, semantic technologies, and information retrieval and will include papers, posters, exhibition booths and sponsorships (by Ontotext at web and semantic technology conferences like ISWC, WWW, SemTech), and academic/professional events such as the Information Retrieval Facility Symposium. We will also organize a set of MOLTO workshops for the expert audience, featuring invited speakers and potential users from academy and industry''

    1.1 International Conferences and Meetings

    MOLTO research and results will be published in conferences in the fields of computational linguistics, statistical machine translation, artificial intelligence and machine translation in general but also in specialized areas related to the domain case studies. We envision the possibility to showcase the results of the MOLTO studies in meetings on mathematical user interfaces, patent translation (information retrieval symposium for patent and scientific content), semantic web and OWL technologies, and natural language processing.

    A small sample of dissemination activities has already taken place at the beginning of MOLTO:

    • Creating Linguistic Resources with GF, a GF tutorial at the International Conference on Language Resources and Evaluation, LREC 2010 in Malta
    • MOLTO: Multilingual On-line Translation, 14th Annual Conference of the European Association for Machine Translation, EAMT 2010 short presentation and poster
    • Robust Estimation of Feature Weights in Statistical Machine Translation 14th Annual Conference of the European Association for Machine Translation, EAMT-2010.
    • Tools for Multilingual Grammar-Based Translation on the Web 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden, July 11–16, 2010.
    • MOLTO will be presented at the information retrieval symposium for patent and scientific content - IRF Symposium 2010 in Vienna, Austria, 1-4 June 2010. Ontotext holds an exhibition booth.
    • Possible paper at the 23rd OpenMath Workshop, Paris, France, 8th July 2010
    • Possible exhibition at ACL 2010 (http://acl2010.org/call_exhibits.html)

    Possible future conferences include:

    • CICLing Conference on Intelligent Text Processing and Computational Linguistics
    • Annual Conference of the North American Chapter of the Association for Computational Linguistics
    • International Conference on Computational Semantics, (IWCS 2011 in Oxford, Jan. 2011)
    • Annual Conference of the European Association for Machine Translation
    • Annual Meeting of the Association for Computational Linguistics
    • International Conference on Language Resources and Evaluation, see http://www.lrec-conf.org/
    • AISC, Artificial Intelligence in Symbolic Computation
    • AMTA, Conference of the Association for Machine Translation in the Americas
    • MT Summit, (MT Summit XIII in China in 2011)
    • ACL WMT, ACL Workshop on Statistical Machine Translation
    • EMNLP, Conference on Empirical Methods in Natural Language Processing
    • CoNLL, Conference on Computational Natural Language Learning
    • COLING, International Conference on Computational Linguistics
    • AAAI, AAAI Conference on Artificial Intelligence
    • IJCAI, International Joint Conference on Artificial Intelligence (IJCAI-11 will be in July 2011 in Barcelona, Spain)
    • ECAI, European Conference on Artificial Intelligence
    • CNL, Workshop on Controlled Natural Language (CNL 2012 will be held in August 2012 in Zurich, Switzerland, see http://attempto.ifi.uzh.ch/site/cnl2012/)

    The Association for Computational Linguistics and the European Association for Machine Translation maintain lists of relevant events at http://www.eacl.org and at http://www.eamt.org.

    MOLTO also plans to disseminate at a regional level by presenting the work in meetings organized by national organization, or at the university and professional level in each partner country, e.g.:

    • Annual Conference of The Spanish Association for Natural Language Processing (SEPLN, http://www.sepln.org)
    • Conference of the Spanish Association for Artificial Intelligence (CAEPIA, http://www.aepia.org/)
    • Conference of the Catalan Association for Artificial Intelligence (ACIA, http://www.acia.org/)
    • Nordic Conference on Computational Linguistics NODALIDA (in 2011 at the University of Latvia, Riga)
    • SLTC, Swedish Language Technology Conference

    As example, MOLTO will be presented at the meeting La Indústria de la Traducció entre Llengües Romàniques, 1st Workshop on The Industry of Translation of Romance Languages, organized by the Polytechnical University of Valencia (UPV) on September 8, 2010. See http://www.upv.es/contenidos/JORTRAD/info/indexnormalc.html.

    1.2 Journal Publications

    Aside from proceedings of conferences and special issues arising in connection with presentations given at international conferences, MOLTO expects to publish results of the work in scientific journals such as:

    • MT Journal
    • Computational Linguistics
    • Linguistic Issues in Language Technology
    • Research on Language and Computation
    • Language and Linguistics Compass
    • International Journal of Computational Linguistics and Applications
    • Linguistic Issues in language Technology
    • Natural Language Engineering
    • Language Resources and Evaluation
    • Journal of Artificial Intelligence Research
    • AI Magazine (Artificial Intelligence Magazine

    We are monitoring also aggregation sites such as eLanguageNet.

    1.3 Project events

    Events organized by MOLTO in primis include the project meetings. A preliminary schedule is the following:

    Title Date Location
    Kickoff meeting 8-11 March, 2010 Barcelona, Spain
    1st Project Meeting Sept. 2010 Varna, Bulgaria
    2nd Project Meeting 2nd week Mar 2011 UGOT
    3rd Project Meeting Sept. 2011 UHEL
    4th Project Meeting Mar. 2012 UZH
    5th Project Meeting Sept. 2012 BI
    6th Project Meeting Mar. 2013 UGOT

    MOLTO also envisions the possibility to organize specific events targeted to its user groups: either because of the case studies, or because of the scientific results. Training activities such as hands-on sessions and special courses will be organized during the final year of the project's lifetime. Likely venues for such events are the major conferences as well as graduate schools in linguistics, e.g. GSLT, Graduate School in Language Technology (http://www.gslt.hum.gu.se). Project members visiting partners' nodes will be encouraged to present their work to a wider audience in departmental seminars, tutorials and intensive courses. For example, A. Ranta and R. Enache from UGOT gave a GF tutorial during the exchange visit to UHEL on May 4-5 2010.

    In addition, MOLTO is planning to organize a high-profile scientific meeting on machine translation in connection to a major event, attracting prominent speakers from the field.

    1.4 Press releases

    The target audience of press releases should be the general public. Press releases need to address the goals and results of the project in a way as to popularize and inform about the area of machine translation.

    Press releases will be produced on yearly basis and circulated using each partner's channels, in addition to publication on the website. The yearly release will also be circulated in bulletins of professional associations, such as that of the European Chapter of the ACL (EACL), the primary professional association for computational linguistics in Europe.

    Sample activities that have already taken place include:

    UGOT has a specific office in charge of public relations and will be contacted to distribute the news of the project. Helena Åberg keeps us informed of the coverage of MOLTO in the media. The project has been prominently featured at its beginning as the following list shows:

    • EU funds effective translation tool CORDIS News – ons, 20 jan 2010 22:47 Europeans recognise the importance of communicating in other languages as well as their native tongue, and the availability of effective tools facilitating high-quality translation of texts between multiple languages is pivotal to this. [...]The five-strong consortium, which is being coordinated by the University of Gothenburg in Sweden, will develop prototypes that cover most of[...]
    • Multilingual Translation System Receives Over 2 Million Euro in EU Funding Communications of the ACM – ons, 20 jan 2010 21:41 University of Gothenburg professor Aarne Ranta is leading an effort to create a reliable translation tool that covers most of the European Union languages [...] Multilingual Translation System Receives Over 2 Million Euro in EU Funding University of Gothenburg professor Aarne Ranta is leading an effort to create[...]
    • Swedish translation system gets EU funding European Journalism Centre – ons, 20 jan 2010 10:40 A research group led by the University of Gothenburg in the west of Sweden, has been granted SEK 25m (USD 3.5m) in EU funding to develop an online multilingual translation system covering most European languages. [...] Swedish translation system gets EU funding A research group led by the University of Gothenburg in the west of Sweden, has been granted SEK 25m [...]
    • Multilingual Translation System Receives over 2 Million Euro in EU Funding Resource Shelf – mån, 18 jan 2010 23:12 From the Announcement: All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The MOLTO project, coordinated by University of Gothenburg, Sweden, receives more than 2 million Euro [...]
    • Multilingual translation system receives over 2 million euro in EU funding Uni-Online.de – mån, 18 jan 2010 19:10 All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The MOLTO project, coordinated by University of Gothenburg, Sweden, receives more than 2 million euro [...]
    • Multilingual translation system receives over 2 million euro in EU funding Juraforum.de – mån, 18 jan 2010 16:08 _All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The MOLTO project, coordinated by University of Gothenburg, Sweden, receives more than 2 million euro [...] _
    • Multilingual translation system receives over 2 million euro in EU funding Uni-protokolle – mån, 18 jan 2010 16:02 (idw) University of Gothenburg All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The MOLTO project, coordinated by University of Gothenburg, Sweden, [...]
    • Multilingual translation system receives over 2 million euro in EU funding idw - Informationsdienst Wissenschaft – mån, 18 jan 2010 15:38 All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The MOLTO project, coordinated by University of Gothenburg, Sweden, receives more than 2 million euro [...]
    • Multilingual translation system receives over 2 million Euro in EU funding Alpha Galileo – mån, 18 jan 2010 15:55 All citizens, regardless of native tongue, shall have the same access to knowledge on the Internet. The MOLTO project, coordinated by University of Gothenburg, Sweden, receives more than 2 million Euro [...]
    • La UE destina más de 2 millones a financiar un traductor online Noticiasdot.com – tis, 26 jan 2010 15:00 La Unión Europea ha financiado con más de 2,3 millones de euros - procedentes del tema ‘Tecnologías de la información y las comunicaciones’ del Séptimo Programa Marco (7PM) - el proyecto MOLTO (Traducción plurilingüe en Internet), que trabaja para desarrollar una herramienta de traducción eficiente en Internet.[...]cada idioma. La encargada de coordinar el consorcio será la Universidad de Gotemburgo (Suecia), que desarrollará prototipos que abarcarán la[...]
    • La UE destina más de 2 millones a financiar un traductor online Hoytecnologia – mån, 25 jan 2010 10:51 La Unión Europea ha financiado con más de 2,3 millones de euros - procedentes del tema 'Tecnologías de la información y las comunicaciones' del Séptimo Programa Marco (7PM) - el proyecto MOLTO (Traducción plurilingüe en Internet), que trabaja para desarrollar una herramienta de traducción eficiente en Internet.[...]
    • La UE financia una herramienta de traducción eficiente madri+d – tis, 26 jan 2010 10:42 _ Los europeos son conscientes de la importancia de la comunicación en otros idiomas además de en su lengua materna, y para ello es fundamental la disponibilidad de herramientas eficientes que logren traducciones de gran calidad entre muchos idiomas.[...]_
    • La UE financia una herramienta de traducción eficiente Faq-Mac – mån, 25 jan 2010 08:45 Los europeos son conscientes de la importancia de la comunicación en otros idiomas además de en su lengua materna, y para ello es fundamental la disponibilidad de herramientas eficientes que logren traducciones de gran calidad entre muchos idiomas[...]
    • La UE financia una herramienta de traducción eficiente Asociación Española de Empresas de Consultoría – tor, 21 jan 2010 16:08 Los europeos son conscientes de la importancia de la comunicación en otros idiomas además de en su lengua materna, y para ello es fundamental la disponibilidad de herramientas eficientes que logren traducciones de gran calidad entre muchos idiomas[...]
    • La UE financia una herramienta de traducción eficiente CORDIS Noticias – ons, 20 jan 2010 15:26 Los europeos son conscientes de la importancia de la comunicación en otros idiomas además de en su lengua materna, y para ello es fundamental la disponibilidad de herramientas eficientes que logren traducciones de gran calidad entre muchos idiomas[...]
    • La UE destina más de 2 millones de euros a financiar una herramienta de traducción eficiente en Internet Granada Digital – sön, 24 jan 2010 13:58 La Unión Europea ha financiado con más de 2,3 millones de euros - procedentes del tema 'Tecnologías de la información y las comunicaciones' del Séptimo Programa Marco (7PM) - el proyecto MOLTO (Traducción plurilingüe en Internet), que trabaja para desarrollar una herramienta de traducción eficiente en Internet. ...cada idioma. La encargada de coordinar el consorcio será la Universidad de Gotemburgo (Suecia), que desarrollará prototipos que abarcarán [...]
    • EU finanziert leistungsfähiges Übersetzungs-Tool CORDIS Nachrichten – ons, 20 jan 2010 15:20 In Fremdsprachen ebenso wie in der Muttersprache kommunizieren zu können, ist eine überaus wichtige Sache: Die Europäer wissen das sehr wohl. Effiziente Instrumente verfügbar zu machen, die eine hochwertige Übersetzung von Texten zwischen mehreren Sprachen erleichtern, ist daher ein zentrales Anliegen[...]
    • L'UE finance un outil de traduction efficace CORDIS Nouvelles – ons, 20 jan 2010 15:11 Les Européens comprennent l'importance de communiquer dans d'autres langues que leur langue maternelle, ce qui impose de disposer d'outils efficaces facilitant la réalisation de traductions de haute qualité entre plusieurs langues[...]
    • University of Gothenburg | Multilingual translation system receives over 2 million euro in EU funding Pressrelations.de – mån, 18 jan 2010 15:29 (idw) Multilingual translation system receives over 2 million euro in EU funding "It has so far been impossible to produce a translation tool that covers entire languages," says Aarne Ranta, professor at the Department of Computer Science and Engineering at the University of Gothenburg, Sweden[...]

    1.5 Liaison Activities

    MOLTO plans to establish contact with related projects in machine translation such as EuroMatrix and T4ME to organize joint meetings in the future. Initial discussions with members of these projects have taken place during LREC 2010, last May. Initial communication with representatives of EuroMatrix has taken place during LREC2010 and identified as common interest the development of hybrid translation systems. UPC is a partner both to MOLTO and to the FAUST - Feedback for User adaptive Statistical Translation project. The ICT-FY project HATS: Highly Adaptable and Trustworthy Software using Formal Models has approached MOLTO to discuss possible future cooperation in the area of translation between formal and informal software specifications. The project ATLAS, ATLAS (Applied Technology for Language-Aided CMS), is another EU project whose aims are similar to MOLTO. We will monitor their work to evaluate possible overlaps and areas of cooperation.

    Furthermore, as a research and technology development project, MOLTO is entitled to join META-SHARE, an effort to setup a pool of language resources and technologies. META is the Multilingual Europe Technology Alliance network, see http://www.meta-net.eu.

    A number of personal contacts and email communication has already taken place during this first trimester. Here below a short summary aimed at giving an overview of what target group is interested in the project's results.

    • Ralf Steinberger, European Commission - Joint Research Centre (JRC) IPSC - GlobeSec - OPTIMA (OPensource Text Information Mining and Analysis) [...] We (the Joint Research Centre of the European Commission) collect, cluster and categorise a daily 100,000 news articles per day in 50 languages, and we perform a whole lot of text mining applications on 20 of them, including novel things such as cross-lingual topic tracking for all language pairs and the automatic merging of name variants across many languages and scripts. Have a look at http://emm.jrc.it/overview.html, and especially at NewsExplorer (http://emm.newsexplorer.eu/)[...]I will be at LREC[...]
    • Flora Muir LLM, Grant & Project Director WHICHMUIR CONSULTING Ltd I represent and act as project coordinator for a group of organisations who are both enterprises and Educational Institutions and Professional Regulators in 6 different countries who are very interested in developing a Transfer of Innovation project. [...]Some of the sectors involved are legal work based education & qualification and information systems, Wine & spirits/ hospitality work based education & qualification.These organisations are already working together on developing standard cross border /language materials under the Lifelong learning programme (LLP) Cross border multilingual work based education, training and information systems of both professionals and consumers of their services are at the heart of the project objectives. We would be very pleased to discuss with you the potential of the Molto project, current capacities to participate in developing and informing our group objectives and how your organisation (s) might be able to work with us in future to develop mutual capabilities.
    • Brian McConnell, Worldwide Lexicon Project http://www.worldwidelexicon.org WWL (www.worldwidelexicon.org) is an open source translation memory that combines inputs from machine, volunteer and professional translation services, and provides developers with a simple standards based web API to interact with the system. First, I'd be interested in supporting MOLTO as an MT service. If you have a web API, we can easily build a connector to it, so that users can easily request translations without having to implement different protocols for different MT engines, a big problem at present. We work with a number of MT engines including Google, Moses, Apertium, Language Weaver, and are adding more in the near future. I'd be glad to implement this for you. It takes a few hours to add new MT engines in most cases. Second, our translation corpus is open content, so as we collect user edited translations and professional translations, we archive these and are making them available to researchers. The corpus is currently at about a million sentences (mostly English Arabic), but should grow significantly as we roll out new services, such as our Word Press plugin that enables machine and professional translation for Word Press blogs. [...]
    • Paraskevopoulos Spyros European Mobility, http://www.geniusmobility.eu I recently read about the project you coordinate, called “Molto”. I am interested to be informed on a regular basis about its results, as I think it might prove useful to a lot of individuals in the E.U. Please add me to any newsletter lists if there are any.
    • Christian Fraunholz, www.php10.de, I am working as a web developer since 1999. Today I heared of the MOLTO project...will it be possible to use the translator in PHP as a web service?
    • Patrizia Biani Country Coordinator - European Affairs - Member States Co-op. - Dir. 5.1.5.1 European Patent Office, As an initiative of general European interest, the European Patent Office (EPO), in co-operation with the national patent offices of the member states, intends to set up language technology services for patents for European languages, so to enable the users of the patent system to access, disseminate and work with multi-lingual patent information. We are currently preparing the phase of collecting the patent corpora existing within the EPO and in the local patent offices, which is planned to start in the 3rd quarter of 2010. The collected corpora should enable a proper training of one or more translation engines to be defined. At present we have established contacts with the PluTO project. However, we are keen to follow all relevant initiatives in this area, so to be able to provide the highest quality translation for patents in a continuative way, keeping pace with the evolution of technology. The MOLTO research project is certainly one of great interest for us. We should be very pleased if we could be in touch and, in any case, you could keep us informed on the developments and results.
    • Henk Becker, I have designed and tested a system for multilingual communication.[...] My system is called Sociolinguafranca. It is possible to use it for communication in two or three languages at the same time. It can be used without and with machine translation. If machine translation is applied, two types can be used. The first type is a closed system, applying terms and sentences already translated, controled and stored in a digital archive. The second type is a half-closed system based on a general translation system, like Babylon. In a half-closed system, mistakes are to be found. [...] Sociolinguafranca is a registered trademark. It has been registered by the Utrecht Centre for Applied Sociology Ltd. I use UCAS Ltd for continuing my work in science since my retirement. For details see: www.ucas.nl. I am convinced that your MOLTO project could profit from integrating my system. If you would be interested in looking for opportunities for some kind of cooperation, I am prepared for a discussion, for instance by video conferencing.
    • Ana González Escudero, Murcia City Council, in South-East Spain, is preparing its candidature for the European Capital of Culture event in 2016. Our programme will include a section on languages, and more specifically about multilinguism and translation. That is why we are very interesting in knowing more of the MOLTO translation tool you are developing, and would like to know if you would consider to explain its functioning in Murcia at a later stage. [...] we are preparing Murcia's candidature (cultural programme overview) for the European Capital of Culture event in 2016. Just an overview, but already connecting Murcia to cultural institutions in Europe. For the language section, we would like to design an event about multilinguism together with our university (Translation and interpreting degree) and other universities in Europe, and we would be delighted to include conferences about innovative projects like yours. At this stage you don't need to do anything but say if you would consider to come to Spain for a conference in that framework.
    • Anton Cpin, Technical Team Lead, Jonckers Translation & Engineering Our company with head office in Brussels is specializing in Localization, Internationalization and Translation services and is one of only four Premier Vendors worldwide performing product localization for the Microsoft Corporation, CISCO, Adobe, HP, Canon. For expanding our business activity to meet market demand and to optimize our internal processes, we are looking into all kind of Technologies and one of our priorities today is also Machine translation. During our research we have found this emergent and promising MOLTO project. Therefore I am writing you this letter to ask for information about possible MOLTO project partnership and cooperation.
    • Sophie Oestreich, Junior Marketing Manager, http://www.tolingo.de thank you very much for the information about your project. I would be pleased if you kept me informed about the progress of MOLTO. Please take a look at our website www.tolingo.com. As we also work on tools to make the job of the translators more easy and fluent, let me know if you could imagine any way of cooperation.

    1.6 Web channels

    MOLTO will use the web as a main channel of dissemination. The web site will initially be designed mainly to assist the management of the project with a section for registered users not available to anonymous readers.

    News of the project will appear regularly and are available either per subscription (member-only) or using the RSS feed: http://www.molto-project.eu/news/rss.xml. The RSS feed is public and will distribute the public news. To be able to read the internal news, the members have to be authenticated.

    Subscriptions allow members to fine tune the type of information that is sent automatically from the website. It can be personalized from the profile pages.

    Registered project members can post news items by either:

    • creating a Story,
    • entering a Biblio item such as a paper or a presentation (or even software),
    • creating an Event, for instance a relevant call for papers.

    All these types of content appear in the news flow, while the Event is also added to the calendar. If some item is meant to be only for the Consortium, then its access controls can me modified accordingly.

    Events of interest will be advertised via newsletters (examples ....) and social sites, in particular:

    Partners are featuring the MOLTO projects on their websites.

    The project can be reached for questions by a contact form accessible online and questions will be answered in the FAQ: http://www.molto-project.eu/view/faq. Any registered user can add questions and answers to the FAQ. Current categories include: - Goals and Promises - Technology - People and Organization

    Analytics

    Access and usage of the MOLTO website is monitored via Google Analytics. Reports are available to interested project members.

    2. Monitoring progress and assessing

    The MOLTO management structure comprises two bodies whose function is monitoring the progress of the project: and internal and an external one.

    2.1 The MOLTO Steering Group

    MOLTO has setup a Steering Group to help the management of the project. The Steering Group is composed by the Coordinator, assisted by the Project Manager and by a representative of each site. During the kickoff meeting, each partner has nominated a Site Leader to be active in the Steering Group as follows:

    • UGOT: Aarne Ranta (Coordinator), deputy: Olga Caprotti
    • UHEL: Lauri Carlson, deputy: Seppo Nyrkkö
    • UPC: Jordi Saludes, deputy: Lluís Màrquez
    • Ontotext AD: Borislav Popov, deputy: Marin Nozhchev
    • MXW: Neil Tipper, deputy: Dominique Maret

    Work package leaders that have also been nominated during the kickoff meeting, are listed in the online work plan.

    Schedule of regular meetings

    The Steering Group holds monthly calls (usually during the 3rd week of the month) via Skype and extraordinary calls when necessity arise. The minutes of these calls is posted on the confidential pages of the web site at http://www.molto-project.eu/node/867.

    Monitoring activities of the SG

    The Steering Group convenes at project meetings, during which a Business Meeting is called to ratify major decisions and to resolve conflicts.

    2.2 The MOLTO Advisory Board

    The task of the MOLTO advisory board is to perform independent quality assurance and assess the progress of the work. It is composed by leading scientists that are outside the MOLTO Consortium. This choice serves two purposes: to obtain an independent opinion on the research and approaches taken by MOLTO, and to disseminate the work done by the project to related scientific communities.

    The final composition of the Advisory Board is the following:

    Schedule of regular meetings

    Members of the Advisory Board are expected to attend the second yearly meeting, namely project meetings 2, 4, and 6 to learn about the yearly outcome of MOLTO. Their travel costs will be funded by MOLTO.

    Activities of the Advisory Boards

    The Advisory Board will write an assessment report which will be delivered as part of the yearly report to the Commission. This report will evaluate the results and, if desirable, suggest ways to improve them.

    3. Quality evaluation

    In the MOLTO workplan, workpackage WP9 Requirements and evaluation runs throughout the entire project's lifetime. In the beginning it will define the requirements for both the generic tools and the case studies, later it performs evaluation and delivers feedback including bug fixing.

    The liaison person from UHEL is Mirka Hyvärinen, he will be in contact with other project members. UHEL has also setup an internal working wiki "MOLTO kitwiki" (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/WebHome), open to all project members who request access.

    D9.1 MOLTO test criteria, methods and schedule due on 1st September, 2010 will contain the detailed schedule and plan for the quality evaluation workflow.

    4. Exploitation

    Planning exploitation of MOLTO results.

    Multilingual description layer for monuments or other points of interests in maps

    See e.g. the text describing the monument in this screenshot of an Art Guide for Genova. It is rather simple and one can easily imagine the information content that should be carried by the underlying ontology. This is a generalization of the work done in WP8. Note that to produce this the ontology must contain also information on location, opening hours, etc.

    and as it happens, there is no such info in English.

    In general, I think this would be a contribution to the area of Geographical Information Systems.

    AttachmentSize
    Screen Shot 2012-10-07 at 3.12.27 PM.png245.4 KB
    Screen Shot 2012-10-07 at 3.19.06 PM.png213.85 KB

    Online auctions

    A very interesting scenario of exploitation of WP8 cultural heritage description is for online auctions, see for instance the catalogue listing on such a site in Skåne:

    It is not clear what kind of knowledge base the auction houses adopt, namely whether they use the same metadata as the museums. In any case, these are potential customers for WP8 results.

    D10.2 MOLTO web service, first version


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D10.2 MOLTO web service, first version
    Security (distribution level): Public
    Contractual date of delivery: M3
    Actual date of delivery: 2 June 2010
    Type: Prototype
    Status & version: Final
    Author(s): Krasimir Angelov, Olga Caprotti, Ramona Enache, Thomas Hallgren, Inari Listenmaa, Aarne Ranta, Jordi Saludes, Adam Slaski
    Task responsible: UGOT
    Other contributors: UPC, UHEL


    Abstract

    This phrasebook is a program for translating touristic phrases between 14 European languages included in the MOLTO project (Multilingual On-Line Translation): Bulgarian, Catalan, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Romanian, Spanish, Swedish. A Russian version is not yet finished but will be added later. Also other languages may be added.

    The phrasebook is implemented by using the GF programming language (Grammatical Framework). It is the first demo for the MOLTO project, released in the third month (by June 2010). The first version is a very small system, but it will extended in the course of the project.

    The phrasebook is available as open-source software, licensed under GNU LGPL, at http://code.haskell.org/gf/examples/phrasebook/.

    </br/></p/>

    1. Purpose

    The MOLTO phrasebook is a program for translating touristic phrases between 14 European languages included in the MOLTO project (Multilingual On-Line Translation):

    • Bulgarian, Catalan, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Romanian, Spanish, Swedish. A Russian version is not yet finished but is projected later. Other languages may be added at a later stage.

    The phrasebook is implemented in the GF programming language (Grammatical Framework). It is the first demo for the MOLTO project, released in the third month (by June 2010). The first version is a very small system, but it will be extended in the course of the project.

    The phrasebook has the following requirement specification: - high quality: reliable translations to express yourself in any of the languages - translation between all pairs of languages - runnable in web browsers - runnable on mobile phones (via web browser; Android stand-alone forthcoming) - easily extensible by new words (forthcoming: semi-automatic extensions by users)

    The phrasebook is available as open-source software, licensed under GNU LGPL. The source code resides in ftp://code.haskell.org/gf/examples/phrasebook/

    2. Points Illustrated

    We consider both the end-user perspective and the content producer perspective.

    From the user perspective

    • Interlingua-based translation: we translate meanings, rather than words
    • Incremental parsing: the user is at every point guided by the list of possible next words
    • Mixed input modalities: selection of words ("fridge magnets") combined with text input
    • Quasi-incremental translation: many basic types are also used as phrases, one can translate both words and complete sentences, and get intermediate results
    • Disambiguation, esp. of politeness distinctions: if a phrase has many translations, each of them is shown and given an explanation (currently just in English, later in any source language)
    • Fall-back to statistical translation: currently just a link to Google translate (forthcoming: tailor-made statistical models)
    • Feed-back from users: users are welcome to send comments, bug reports, and better translation suggestions

    From the programmer's perspective

    • The use of resource grammars and functors: the translator was implemented on top of an earlier linguistic knowledge base, the GF Resource Grammar Library
    • Example-based grammar writing and grammar induction from statistical models (Google translate): many of the grammars were created semi-automatically by generalization from examples
    • Compile-time transfer especially, in Action in Words: the structural differences between languages are treated at compile time, for maximal run-time efficiency
    • The level of skills involved in grammar development: testing different configurations (see table below)
    • Grammar testing: use of treebanks with guided random generation for initial evaluation and regression testing

    3. Files

    The phrasebook is available as open-source software, licensed under GNU LGPL. The source code resides in http://code.haskell.org/gf/examples/phrasebook/. Below a short description of the source files.

    Grammars

    • Sentences: general syntactic structures implementable in a uniform way. Concrete syntax via the functor SencencesI.
    • Words: words and predicates, typically language-dependent. Separate concrete syntaxes.
    • Greetings: idiomatic phrases, string-based. Separate concrete syntaxes.
    • Phrasebook: the top module putting everything together. Separate concrete syntaxes.
    • DisambPhrasebook: disambiguation grammars generating feedback phrases if the input language is ambiguous.
    • Numeral: resource grammar module directly inherited from the library.

    The module structure image is produced in GF by

        > i -retain DisambPhrasebookEng.gf
        > dg -only=Phrasebook*,Sentences*,Words*,Greetings*,Numeral,NumeralEng,DisambPhrasebookEng
        > ! dot -Tpng _gfdepgraph.dot > pgraph.png
    

    Ontology

    The abstract syntax defines the ontology behind the phrasebook. Some explanations can be found in the ontology document, which is produced from the abstract syntax files Sentences.gf and Words.gf by make doc.

    Run-time system and user interface

    The phrasebook uses the PGF server written in Haskell and the minibar library written in JavaScript. Since the sources of these systems are available, anyone can build the phrasebook locally on her own computer.

    4. Effort and Cost

    Based on this case study, we roughly estimated the effort used in constructing the necessary sources for each new language and compiled the following summarizing chart.

    Language Language skills GF skills Informed development Informed testing Impact of external tools RGL Changes Overall effort
    Bulgarian ### ### - - ? # ##
    Catalan ### ### - - ? # #
    Danish - ### + + ## # ##
    Dutch - ### + + ## # ##
    English ## ### - + - - #
    Finnish ### ### - - ? # ##
    French ## ### - + ? # #
    German # ### + + ## ## ###
    Italian ### # - - ? ## ##
    Norwegian # ### + - ## # ##
    Polish ### ### + + # # ##
    Romanian ### ### - - # ### ###
    Spanish ## # - - ? - ##
    Swedish ## ### - + ? - ##

    Legend

    Language skills

    • - : no skills
    • # : passive knowledge
    • ## : fluent non-native
    • ### : native speaker

    GF skills

    • - : no skills
    • # : basic skills (2-day GF tutorial)
    • ## : medium skills (previous experience of similar task)
    • ### : advanced skills (resource grammar writer/substantial contributor)

    Informed Development/Informed testing

    • - : no
    • + : yes

    Impact of external tools

    • ?: not investigated
    • - : no effect on the Phrasebook
    • # : small impact (literal translation, simple idioms)
    • ## : medium effect (translation of more forms of words, contextual preposition)
    • ### : great effect (no extra work needed, translations are correct)

    RGL changes (resource grammars library)

    • - : no changes
    • # : 1-3 minor changes
    • ## : 4-10 minor changes, 1-3 medium changes
    • ### : >10 changes of any kind

    Overall effort (including extra work on resource grammars)

    • # : less than 8 person hours
    • ## : 8-24 person hours
    • ### : >24 person hours

    5. Example-based grammar writing process

    The figure presents the process of creating a Phrasebook using an example-based approach for a language X, in our case either Danish, Dutch, German, Norwegian, for which we had to employ informed development and testing by a native speaker, different from the grammarian.

    Remarks : The arrows represent the main steps of the process, whereas the circles represent the initial and final results after each step of the process. Red arrows represent manual work and green arrows represent automated actions. Dotted arrows represent optional steps. For every step, the estimated time is given. This is variable and greatly influenced by the features of the target language and the semantic complexity of the phrases and would only hold for the Phrasebook grammar.

    Initial resources :

    • English Phrasebook
    • resource grammar for X
    • script for generating the inflection forms of words and the corresponding linearizations of the lexical entries from the Phrasebook in the language X. For example, in the case of the nationalities, since we are interested in the names of countries, languages and citizenship of people and places, we would generate constructions like "I am English. I come from England. I speak English. I go to an English restaurant" and from the results of the translation we will infer the right form of each feature. In English, in most cases there is an ambiguity between the name of the language and the citizenship of people and places, but in other languages all three could have completely different forms. This is why it is important to make the context clear in the examples, so that the translation will be more likely to succeed. The correct design of the test of examples, is language dependent and assumes analysis of the resource grammar, also. For example, in some languages we need only the singular and the plural form of a noun in order to build its GF representation, whereas in other languages such as German, in the worst case we would need 6 forms which need to be rendered properly from the examples.
    • script for generating random test cases that cover all the constructions from the grammar. It is based on the current state of the abstract syntax and it generates for each abstract function some random parameters and shows the linearization of the construction in both English and language X, along with the abstract syntax tree that was generated.
    Step 1 : Analysis of the target grammar

    The first step assumes an analysis of the resource grammar and extracts the information needed by the functions that build new lexical entries. A model is built so that the proper forms of the word can be rendered, and additional information, such as gender, can be inferred. The script applies these rules to each entry that we want to translate into the target language, and one obtains a set of constructions.

    Step 2 : Generation of examples in the target language

    The generated constructions are given to an external translator tool (Google translate) or to a native speaker for translation. One needs the configuration file even if the translator is human, because formal knowledge of grammar is not assumed.

    Step 3 : Parsing and decoding the examples with GF

    The translations into the target language are further more processed in order to build the linearizations of the categories first, decoding the information received. Furthermore, having the words in the lexicon, one can parse the translations of functions with the GF parser and generalize from that.

    Step 4 : Evaluation and correction of the resulting grammar

    The resulting grammar is tested with the aid of the testing script that generates constructions covering all the functions and categories from the grammar, along with some other constructions that proved to be problematic in some language. A native speaker evaluates the results and if corrections are needed, the algorithm runs again with the new examples. Depending on the language skills of the grammar writer, the changes can be made directly into the GF files, and the correct examples given by the native informant are just kept for validating the results. The algorithm is repeated as long as corrections are needed.

    The time needed for preparing the configuration files for a grammar will not be needed in the future, since the files are reusable for other applications. The time for the second step can be saved if automatic tools, like Google translate are used. This is only possible in languages with a simpler morphology and syntax, and with large corpora available. Good results were obtained for German and Dutch with Google translate, but for languages like Romanian or Polish, which are both complex and lack enough resources, the results are discouraging.

    If the statistical oracle works well, the only step where the presence of a human translator is needed is the evaluation and feedback step. An average of 4 hours per round and 2 rounds were needed in average for the languages for which we performed the experiment. It is possible that more effort is needed for more complex languages.

    Further work will be done in building a more comprehensive tool for testing and evaluating the grammars, and also the impact of external tools for machine translation from English to various target languages will be analysed, so that the process could be automated to a higher degree for the future work on grammars.

    6. Future and ongoing work

    Disambiguation
    Disambiguation grammars for languages other than English are in most cases still incomplete.
    Lexicon extension
    The extension of the abstract lexicon in Words by hand or (semi)automatically for items related to the categories of food, places, and actions will result in immediate increase of the expressiveness of the phrasebook.
    Customizable phone distribution
    Allow the ad-hoc selection of the 2^15 language subsets when downloading the phrasebook to a phone.

    7. How to contribute

    The basic things "everyone" can do are:

    • complete missing words in concrete syntaxes
    • add new abstract words in Words and greetings in Greetings

    The missing concrete syntax entries are added to the WordsL.gf files for each language L. The morphological paradigms of the GF resource library should be used. Actions (prefixed with A, as AWant) are a little more demanding, since they also require syntax constructors. Greetings (prefixed with G) are pure strings.

    Some explanations can be found in the implementation document, which is produced from the concrete syntax files SentencesI.gf and WordsEng.gf by make doc.

    Here are the steps to follow for contributors:

    1. Make sure you have the latest sources from GF Darcs, using darcs pull.
    2. Also make sure that you have compiled the library by make present in gf/lib/src/.
    3. Work in the directory gf/examples/phrasebook/.
    4. After you've finished your contribution, recompile the phrasebook by make pgf.
    5. Save your changes in darcs record . (in the phrasebook subdirectory).
    6. Make a patch file with darcs send -o my_phrasebook_patch, which you can send to GF maintainers.
    7. (Recommended:) Test the phrasebook on your local server: a. Go to gf/src/server/ and follow the instructions in the project Wiki. b. Make sure that Phrasebook.pgf is available to you GF server (see project wiki). c. Launch lighttpd (see project wiki). d. How you can open gf/examples/phrasebook/www/phrasebook.html and use your phrasebook!

    Finally, a few good practice recommendations:

    • Don't delete anything! But you are free to correct incorrect forms.
    • Don't change the module structure!
    • Don't compromise quality to gain coverage: non multa sed multum!

    8. Conclusions (tentative)

    The grammarian need not be a native speaker of the language. For many languages, the grammarian need not even know the language, native informants are enough. However, evaluation by native speakers is necessary.

    Correct and idiomatic translations are possible.

    A typical development time was 2-3 person working days per language.

    Google translate helps in bootstrapping grammars, but must be checked. In particular, we found it unreliable for morphologically rich languages.

    Resource grammars should give some more support e.g. higher-level access to constructions like negative expressions and large-scale morphological lexica.

    Acknowledgments

    The Phrasebook has been built in the MOLTO project funded by the European Commission. The authors are grateful to their native speaker informants helping to bootstrap and evaluate the grammars: Richard Bubel, Grégoire Détrez, Rise Eilert, Karin Keijzer, Michał Pałka, Willard Rafnsson, Nick Smallbone.

    MOLTO Phrasebook (version 1)

    Powered by GF, see doc. We also have a mobile-enhanced version.

    MOLTO Phrasebook Help

    The user interface is kept slim so as to also be usable from portable devices, e.g. mobile phones. These are the buttons and their functionality:

    • To start: klick at a word or start typing.
    • From: source language
    • To: target language (either a single one or "All" simultaneously)
    • Del: delete last word
    • Clear: start over
    • Random: generate a random phrase
    • Google translate: the current input and language choice; opens in a new window or tab.

    The symbol &+ means binding of two words. It will disappear in the complete translation.

    The translator is slightly overgenerating, which means you can build some semantically strange phrases. Before reporting them as bugs, ask yourself: could this be correct in some situation? is the translation valid in that situation?

    D10.3 MOLTO web service, final version


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - Multilingual Online Translation
    Deliverable: D10.3 MOLTO web service, final version
    Security (distribution level): Public
    Contractual date of delivery: M39
    Actual date of delivery: May 2013
    Type: Prototype
    Status & version: Final
    Author(s): Thomas Hallgren, Olga Caprotti et al.
    Task responsible: UGOT
    Other contributors: UPC, UHEL, Ontotext

    Abstract

    In this deliverable we document the web services that have been provided by the MOLTO project. Many of them have been released with dedicated deliverables, for those we do not enter into the specific details. Instead we focus on the web services powering some of the MOLTO flagships at the end of the project's lifetime.

    1. The GF Cloud Service API

    The GF Cloud Service API exposes any PGF compiled grammar as a web service via the PGF web service API, it provides additional functionality of some commands in the GF shell and some services for grammar compilation and persistent storage of files in the cloud. These features are used for instance in the implementation of the GF Simple Editor (http://cloud.grammaticalframework.org/gfse/) developed as a translators' tool during MOLTO.

    1.1 Availability and Protocol

    The service is available from http://cloud.grammaticalframework.org/. The source code for hosting a local version of the webservice is distributed in the GF distribution, hence users that have GF installed on their own computer can also run the service locally by starting GF with the parameter

    --server[=port] Run in HTTP server mode on given port (default 41296).

    Requests are made via HTTP with the GET or POST method. (The examples below show GET requests, but POST is preferred for requests that change the state on the server.) Data in requests is in the application/x-www-form-urlencoded format (the format used by default by web browsers when submitting form data). Data in responses is usually in JSON format. The HTTP response code is usually 200, but can also be 204 (after file upload), 404 (file to download or remove was not found), 400 (for unrecognized commands or missing/unacceptable parameters in requests) or 501 (for unsupported HTTP request methods). Unrecognized parameters in requests are silently ignored.

    More details on how to run the service are given in Deliverable 2.3 on page 7 under "Building a web application".

    1.2 PGF Service Requests

    The GF Cloud Service supports a set of PGF service requests, for example, a request like

    http://cloud.grammaticalframework.org/grammars/Foods.pgf?command=random

    might return a result like

    [{"tree":"Pred (That Pizza) (Very Boring)"}]

    The PGF Service in the GF Cloud is the application which exposes the PGF API as Web Service. The application uses FastCGI as communication protocol to talk with the web server. The data protocol that we use is JSON. Information for how to compile and install the service could be found here.

    A compiled GF grammars could be used in web applications in the same way as JSP, ASP or PHP pages are used. The compiled PGF file is just placed somewhere in the web site directory. When there is a request for access to a .pgf file then the web server redirects the request to the GF web service. The service knows how to load the grammar and interpret the parameters given in the URL.

    If my_grammar.pgf is a grammar placed in the root folder of the web site for localhost then the grammar could be accessed using this URL:

    http://localhost/my_grammar.pgf

    Since there aren't any parameters passed in this case, the web service will respond with some general information about the grammar, encoded in JSON format. To perform specific command you have to tell what command you want to perform. The command is encoded in the parameter command i.e.:

    http://localhost/my_grammar.pgf?command=cmd

    where cmd is the name of the command. Usually every command also requires specific list of other arguments which are encoded as parameters as well. The list of all supported commands follows:

    Commands


    Grammar

    This command provides some general information about the grammar. This command is also executed if no command parameter is given.

    Input

    Parameter Description Default
    command grammar -

    Output

    Object with three fields:

    Field Description
    name the name of the abstract syntax in the grammar
    userLanguage the concrete language in the grammar which best matches the default language, set in the user's browser
    categories list of all abstract syntax categories defined in the grammar
    functions list of all abstract syntax functions defined in the grammar
    languages list of concrete languages available in the grammar

    Every language is described with object having this two fields:

    Field Description
    name the name of the concrete syntax for the language
    languageCode the two character language code according to the ISO standard i.e. en for English, bg for Bulgarian, etc.

    The language codes should be specified in the grammar because they are used to identify the user language. The web service receives the code of the language set in the browser and compares it with the codes defined in the grammar. If there is a match then the service returns the corresponding concrete syntax name. If no match is found then the first language in alphabetical order is returned.


    Parsing

    This command parses a string and returns a list of abstract syntax trees.

    Input

    Parameter Description Default
    command parse -
    cat the start category for the parser the default start category for the grammar
    input the string to be parsed empty string
    from the name of the concrete syntax to use for parsing all languages in the grammar will be tried
    limit limit how many trees are returned (gf>3.3.3) no limit is applied

    Output

    List of objects where every object represents the analyzes for every input language. The objects have three fields:

    Field Description
    from the concrete language used in the parsing
    brackets the bracketed string from the parser
    trees list of abstract syntax trees
    typeErrors list of errors from the type checker

    The abstract syntax trees are sent as plain strings. The type errors are objects with two fields:

    Field Description
    fid forest id which points to a bracket in the bracketed string where the error occurs
    msg the text message for the error

    The current implementation either returns a list of abstract syntax trees or a list of type errors. By checking whether the field trees is not null we check whether the type checking was successful.


    Linearization

    The command takes an abstract syntax tree and produces string in the specified language(s).

    Input

    Parameter Description Default
    command linearize -
    tree the abstract syntax tree to linearize -
    to the name of the concrete syntax to use in the linearization linearizations for all languages in the grammar will be generated

    Output

    Field Description
    to the concrete language used for the linearization
    tree the output text

    Translation

    The translation is a two step process. First the input sentence is parsed with the source language and after that the output sentence(s) are produced via linearization with the target language(s). For that reason the input and the output for this command is the union of the input/output of the commands for parsing and the one for linearization.

    Input

    Parameter Description Default
    command translate -
    cat the start category for the parser the default start category for the grammar
    input the input string to be translated empty string
    from the source language all languages in the grammar will be tried
    to the target language linearizations for all languages in the grammar will be generated
    limit limit how many parse trees are used (gf>3.3.3) no limit is applied

    Output

    The output is a list of objects with these fields:

    Field Description
    from the concrete language used in the parsing
    brackets the bracketed string from the parser
    translations list of translations
    typeErrors list of errors from the type checker

    Every translation is an object with two fields:

    tree abstract syntax tree
    linearizations list of linearizations

    Every linearization is an object with two fields:

    Field Description
    to the concrete language used in the linearization
    text the sentence produced

    The type errors are objects with two fields:

    Field Description
    fid forest id which points to a bracket in the bracketed string where the error occurs
    msg the text message for the error

    The current implementation either returns a list of translations or a list of type errors. By checking whether the field translations is not null we check whether the type checking was successful.


    Random_Generation

    This command generates random abstract syntax tree where the top-level function will be of the specified category. The categories for the sub-trees will be determined by the type signatures of the parent function.

    Input

    Parameter Description Default
    command should be random -
    cat the start category for the generator the default start category for the grammar
    limit maximal number of trees generated 1

    Output

    The output is a list of objects with only one field:

    Field Description
    tree the generated abstract syntax tree

    The length of the list is limited by the limit parameter.


    Word_Completion

    Word completion is a special case of parsing. If there is an incomplete sentence then it is first parsed and after that the state of the parse chart is used to predict the set of words that could follow in a grammatically correct sentence.

    Input

    Parameter Description Default
    command complete -
    cat the start category for the parser the default start category for the grammar
    input the string to the left of the cursor that is already typed empty string
    from the name of the concrete syntax to use for parsing all languages in the grammar will be tried
    limit maximal number of trees generated all words will be returned

    Output

    The output is a list of objects with two fields which describe the completions.

    Field Description
    from the concrete syntax for this word
    text the word itself

    Abstract Syntax Tree Visualization

    This command renders an abstract syntax tree into image in PNG format.

    Input

    Parameter Description Default
    command abstrtree -
    tree the abstract syntax tree to render -
    format output format (gf>3.3.3) PNG

    Output

    Byy default, the output is an image in PNG format. The content-type is set to image/png, so the easiest way to visualize the generated image is to add HTML element <img> which points to URL for the visualization command i.e.:

    <img src="http://localhost/my_grammar.pgf?command=abstrtree&tree=..."/>
    

    The output can also be in GIF ('image/gif'), SVG ('image/svg+xml') or GV (graphviz) format by setting the 'format' option


    Parse Tree Visualization

    This command renders the parse tree that corresponds to a specific abstract syntax tree. The generated image is in PNG format.

    Input

    Parameter Description Default
    command parsetree -
    tree the abstract syntax tree to render -
    from the name of the concrete syntax to use in the rendering -
    format output format (gf>3.3.3) png
    options additional rendering options (gf>3.4) -

    The additioal rendering options are: noleaves, nofun and nocat (booleans, false by default); nodefont, leaffont,nodecolor, leafcolor, nodeedgestyle and leafedgestyle (strings, have builtin defaults).

    Output

    By default, the output is an image in PNG format. The content-type is set to 'image/png', so the easiest way to visualize the generated image is to add HTML element <img> which points to URL for the visualization command i.e.:

    <img src="http://localhost/my_grammar.pgf?command=parsetree&tree=..."/>
    

    The output can also be in GIF ('image/gif'), SVG ('image/svg+xml') or gv (graphviz) format by setting the format option


    Word Alignment Diagram

    This command renders the word alignment diagram for some sentence and all languages in the grammar. The sentence is generated from a given abstract syntax tree.

    Input

    Parameter Description Default
    command `alignment` -
    tree the abstract syntax tree to render -
    format output format (gf>3.3.3) PNG
    to list of languages to include in the diagram (gf>3.4) all languages supported by the grammar

    Output

    By default, the output is an image in PNG format. The content-type is set to 'image/png', so the easiest way to visualize the generated image is to add HTML element `` which points to URL for the visualization command i.e.:

    <img src="http://localhost/my_grammar.pgf?command=alignment&tree=..."/>
    

    The output can also be in GIF ('image/gif'), SVG ('image/svg+xml') or GV (graphviz) format by setting the 'format' option

    1.3 GF Shell Service

    This service lets you execute arbitrary GF shell commands. Before you can do this, you need to use the /new command to obtain a working directory (which also serves as a session identifier) on the server, see below.

    /gfshell?dir=...&command=i+Foods.pgf
     
    /gfshell?dir=...&command=gr
    Pred (That Pizza) (Very Boring)
    /gfshell?dir=...&command=ps+-lextext+%22That+pizza+is+very+boring.%22
    that pizza is very boring .

    For documentation of GF shell commands, see:

    Additional cloud service

    /new
    This generates a new working directory on the server, e.g. /tmp/gfse.123456. Most of the cloud service commands require that a working directory is specified in the dir parameter. The working directory is persistent, so clients are expected to remember and reuse it. Access to previously uploaded files requires that the same working directory is used.
    /parse?path=source
    This command can be used to check GF source code for syntax errors. It also converts GF source code to the JSON representation used in GFSE (the cloud-based GF grammar editor).
    /cloud?dir=...&command=upload&path1=source1&path2=source2&...
    Upload files to be stored in the cloud. The response code is 204 if the upload was successful.
    /cloud?dir=...&command=make&path1=source1&path2=source2&...
    Upload grammar files and compile them into a PGF file. Example response: { "errorcode":"OK", // "OK" or "Error"
      "command":"gf -s -make FoodsEng.gf FoodsSwe.gf FoodsChi.gf",
      "output":"\n\n" // Warnings and errors from GF
    }
    /cloud?dir=...&command=remake&path1=source1&path2=source2&...
    Like command=make, except you can leave the sourcei parts empty to reuse previously uploaded files.
    /cloud?dir=...&command=download&file=path
    Download the specified file.
    /cloud?dir=...&command=ls&ext=.pgf
    List files with the specified extension, e.g. ["Foods.pgf","Letter.pgf"].
    /cloud?dir=...&command=rm&file=path
    Remove the specified file.
    /cloud?dir=...&command=link_directories&newdir=...
    Combine server directores. This is used by GFSE to share grammars between multiple devices.

    1.4 Examples

    GF can be used interactively from the GF Shell. Some of the functionality availiable in the GF shell is also available via the GF web services API.

    The GF Web Service API page describes the calls supported by the GF web service API. Below, we illustrate these calls by examples, and also show how to make these calls from JavaScript using the API defined in <a href="js/pgf_online.js" rel="nofollow">pgf_online.js</a>.

    Note that pgf_online.js was initially developed with one particular web application in mind (the minibar), so the server API was incomplete. It was simplified and generalized in August 2011 to support the full API.

    These boxes show what the calls look like in the JavaScript API defined in pgf_online.js. These boxes show the corresponding URLs sent to the PGF server. These boxes show the JSON (JavaScript data structures) returned by the PGF server. This will be passed to the callback function supplied in the call.

    Initialization

    // Select which server and grammars to use:

    var server_options = {
                                 grammars_url: "http://www.grammaticalframework.org/grammars/",
                                 grammar_list: ["Foods.pgf"] // It's ok to skip this
                                 }
    var server = pgf_online(server_options);
    

    Examples

    // Get the list of available grammars

    server.get_grammarlist(callback)
    http://localhost:41296/grammars/grammars.cgi
    ["Foods.pgf","Phrasebook.pgf"]
    

    // Select which grammar to use

    server.switch_grammar("Foods.pgf")
    

    // Get list of concrete languages and other grammar info

    server.grammar_info(callback)
    http://localhost:41296/grammars/Foods.pgf
           {"name":"Foods",
            "userLanguage":"FoodsEng",
            "startcat":"Comment",
            "categories":["Comment","Float","Int","Item","Kind","Quality","String"],
            "functions":["Boring","Cheese","Delicious","Expensive","Fish","Fresh",
                               "Italian","Mod","Pizza","Pred","That","These","This","Those","Very",
                               "Warm","Wine"],
            "languages":[{"name":"FoodsBul","languageCode":""},
                                {"name":"FoodsEng","languageCode":"en-US"},
                                {"name":"FoodsFin","languageCode":""},
                                {"name":"FoodsSwe","languageCode":"sv-SE"},
                                ...]
              }
    

    // Get a random syntax tree

    server.get_random({},callback)
    http://localhost:41296/grammars/Foods.pgf?command=random
           [{"tree":"Pred (That Pizza) (Very Boring)"}]
    

    // Linearize a syntax tree

    server.linearize({tree:"Pred (That Pizza) (Very Boring)",to:"FoodsEng"},callback)
    http://localhost:41296/grammars/Foods.pgf?command=linearize&amp;tree=Pred+(That+Pizza)+(Very+Boring)&amp;to=FoodsEng
           [{"to":"FoodsEng","text":"that pizza is very boring"}]
           server.linearize({tree:"Pred (That Pizza) (Very Boring)"},callback)
    
     http://localhost:41296/grammars/Foods.pgf?command=linearize&amp;tree=Pred+(That+Pizza)+(Very+Boring)
         [{"to":"FoodsBul","text":"онази пица е много еднообразна"},
         {"to":"FoodsEng","text":"that pizza is very boring"},
         {"to":"FoodsFin","text":"tuo pizza on erittäin tylsä"},
         {"to":"FoodsSwe","text":"den där pizzan är mycket tråkig"},
         ...
         ] 
    

    // Parse a string

    server.parse({from:"FoodsEng",input:"that pizza is very boring"},callback)
    http://localhost:41296/grammars/Foods.pgf?command=parse&amp;input=that+p...
           [{"from":"FoodsEng",
             "brackets":{"cat":"Comment","fid":10,"index":0,
             "children":[{"cat":"Item","fid":7,"index":0,
             "children":[{"token":"that"},{"cat":"Kind","fid":6,"index":0,
             "children":[{"token":"pizza"}]}]},    
            {"token":"is"},{"cat":"Quality","fid":9,"index":0,
              "children":[{"token":"very"},{"cat":"Quality","fid":8,"index":0,
              "children":[{"token":"boring"}]}]}]},
              "trees":["Pred (That Pizza) (Very Boring)"]}]
    

    // Translate to all available languages

    server.translate({from:"FoodsEng",input:"that pizza is very boring"},callback)
    ...
    

    // Translate to one language

    server.translate({input:"that pizza is very boring", from:"FoodsEng", to:"FoodsSwe"}, callback)
    http://localhost:41296/grammars/Foods.pgf?command=translate&amp;input=th...
          [{"from":"FoodsEng",
            "brackets":{"cat":"Comment","fid":10,"index":0,
            "children":[{"cat":"Item","fid":7,"index":0,
            "children":[{"token":"that"},{"cat":"Kind","fid":6,"index":0,
            "children":  [{"token":"pizza"}]}]},{"token":"is"},{"cat":"Quality","fid":9,"index":0,
            "children":[{"token":"very"},{"cat":"Quality","fid":8,"index":0,"children":[{"token":"boring"}]}]}]},
            "translations":
            [{"tree":"Pred (That Pizza) (Very Boring)",
              "linearizations":
               [{"to":"FoodsSwe",
                  "text":"den där pizzan är mycket tråkig"}]}]}]
    

    // Get completions (what words could come next)

    server.complete({from:"FoodsEng",input:"that pizza is very "},callback)
    http://localhost:41296/grammars/Foods.pgf?command=complete&amp;input=tha...
          [{"from":"FoodsEng", "brackets":{"cat":"_","fid":0,"index":0,
            "children":[{"cat":"Item","fid":7,"index":0,
            "children":[{"token":"that"},{"cat":"Kind","fid":6,"index":0,
            "children":[{"token":"pizza"}]}]},{"token":"is"},{"token":"very"}]},
            "completions":["boring","delicious","expensive","fresh","Italian","very","warm"],
            "text":""}]
    

    // Get info about a category in the abstract syntax

    server.browse({id:"Kind"},callback)
    http://localhost:41296/grammars/Foods.pgf?command=browse&amp;id=Kind&amp...
          {"def":"cat Kind", "producers":["Cheese","Fish","Mod","Pizza","Wine"],
           "consumers":["Mod","That","These","This","Those"]}
    

    // Get info about a function in the abstract syntax

    server.browse({id:"This"},callback)
    http://localhost:41296/grammars/Foods.pgf?command=browse&amp;id=This&amp...
           {"def":"fun This : Kind -&gt; Item","producers":[],"consumers":[]}
    

    // Get info about all categories and functions in the abstract syntax

    server.browse({},callback)
    http://localhost:41296/grammars/Foods.pgf?command=browse&amp;format=json
           {"cats":{"Kind":{"def":"cat Kind",
                 "producers":["Cheese","Fish","Mod","Pizza","Wine"],
                 "consumers":["Mod","That","These","This","Those"]},
         ...},
             "funs":{"This":{"def":"fun This : Kind -&gt; Item","producers":[],"consumers":[]},
         ...}
         }
    

    // Convert an abstract syntax tree to JSON

    server.pgf_call("abstrjson",{tree:"Pred (That Pizza) (Very Boring)"},callback)
    
    http://localhost:41296/grammars/Foods.pgf?command=abstrjson&amp;tree=Pred+(That+Pizza)+(Very+Boring)
           {"fun":"Pred","fid":4,
            "children":[{"fun":"That","fid":1,
              "children":[{"fun":"Pizza","fid":0}]},
             {"fun":"Very","fid":3,
              "children":[{"fun":"Boring","fid":2}]}]}
    

    2. MOLTO Application Grammars

    At the beginning of the project, we have published the MOLTO Phrasebook as example application grammar. For the final version of our online service, we show all the relevant GF application grammars that have been developed in various work-packages as supporting grammars for larger applications. Each example in this collection can be used by a new GF grammar developer as a starting point that can be further extended. In this deliverable we briefly document the grammars, the online applications that use them, and give quick hints on where extension can occur in future work.

    The Geography Grammar

    This grammar has been developed originally for the semantic multilingual wiki system AceWiki-GF, as documented in Deliverable D11.3. The grammar can be used online at http://attempto.ifi.uzh.ch/acewiki-gf/.

    It currently supports 3 languages: ACE, German and Spanish, where ACE is a formal language used for automated reasoning. A 500-word geography domain vocabulary has been created to describe Europe.

    ACE is represented by two languages, Ace and Ape. Ape linearizations contain explicit lexical entries so that the ACE parser (APE) can be used to map the sentences of this grammar to OWL. The wiki shows how this mapping works.

    A snapshot of the grammar is available at http://www.molto-project.eu/biblio/software/geographypgf.

    The MOLTO Phrasebook

    The MOLTO Phrasebook has been the first demonstrator of the features of the Grammatical Framework technology, online since M3 of the project's lifetime. The application grammar was designed to serve as model for best practices. It shows a modular approach to the definition of abstract types and functions from the domain of travelers' phrasebooks , covering natural language for giving directions, ordering a meal, and greeting friends. It has categories for Citizenship, Country, Currency, Date and week Day, Digits, DrinkKind and MassKind, Languages, Greetings and many more. Eng, Bul, Cat, Dan, Dut, Fin, Fre, Ger, Hin, Ita, Lav, Nor, Pes, Pol, Ron, Rus, Spa, Swe, Tha, Urd. It has a module that handles disambiguation in Eng and in Ron.

    The final version is online at http://www.molto-project.eu/cloud/gf-application-grammars by selecting as application Phrasebook.pgf.

    The repository for the grammar file itself is at http://www.molto-project.eu/biblio/software/phrasebookpgf.

    The Mathematical Grammars

    MathBar.pgf

    MathBar.pgf is the application grammar developed for the mathematical natural language domain. It supports the following languages: Fre, Cat, Spa, Eng and Fin. More languages are available but have not been checked against quality. The Mathematical Grammar Library (MGL) is a specialized language in which textual fragments are interspersed with formal fragments represented in the typesetting language LaTeX.

    The source files are distributed via svn at URL: svn://molto-project.eu/mgl Repository Root: svn://molto-project.eu Repository UUID: 54d65b75-f25a-4862-968f-dc0a3298bc6b Revision: 2432

    The compiled PGF grammar is available from http://www.molto-project.eu/biblio/software/mathbarpgf.

    Commands.pgf

    Commands.pgf is the application grammar developed for natural language I/O to the Sage computer algebra system. It translates input queries and output answers into natural language of mathematical nature. Users can ask for computations related to arithmetic, domain and range of functions, differentiation and integration. It also supports the usage of referential mechanism by the pronoun it, which will link to the previous result in a session of sequential computations. English, German and Spanish are currently supported.

    Dialog.pgf

    Dialog.pgf translates natural language interactions of the word problems prototype documented in Deliverable D6.3. It is used to give hints in the student's language and to formalize the students' answers or commands as Prolog statements that can be reasoned automatically with. It is an example of how a description of a specific world situation (owing fruits, animal in a farm) can be interpreted and formalized. Catalan, English, Spanish and Swedish are currently supported. The programming language Prolog is also supported. SVN info for compilation from source: URL: svn://molto-project.eu/mgl/wproblems Repository Root: svn://molto-project.eu Repository UUID: 54d65b75-f25a-4862-968f-dc0a3298bc6b Revision: 2432 GF version compilation: Grammatical Framework (GF) version 3.4-darcs.

    The version archived and deployed on the MOLTO cloud is http://www.molto-project.eu/biblio/software/dialogpgf.

    The Painting Grammar

    The work-package dealing with the domain of cultural heritage has focused on the description of museum artefacts, in particular paintings. While the description of the subject matter of a painting is an open domain, the other characteristics of a painting can be described by a constrained natural language tightly coupled with the underlying knowledge representation used by museum curators. The design of this grammar has been based on sample descriptions of paintings retrieved from the Gothenburg City Museum and has been further applied to generate descriptions of artefacts stored on public web pages, such as DBPedia.

    One major discussion has concerned the identification of entity names, museum names, as well as famous painters' or masterpieces are often translated ad hoc. For such cases, it is hard to create grammar-based translation rules, consider for instance Mona Lisa, in Italian often referred to as La Gioconda. The approach taken in this work package has been that of not translating the entity names found in the knowledge base while investigating whether historically there could be a given title or name that could be taken as a universally valid identifier for that entity. Since to our knowledge, there seems to be no agreement by museum curators on unique resource identifiers (whereas for instance, in the publishing world, there have been efforts of uniquely indexing published material), we have adopted a naming based on the resource descriptors we retrieved in our samples. In terms of future web application building, we are aware that resource identification and/or retrieval by the common name is not as sound as by unique ID.

    This grammar is also modularly designed and assembled categories that are used to represent location, material, color, dimension, type of work, and painter's biographical data. The most relevant feature of this grammar is the construction of a description as a sequence of phrases related to the same artefact, using referential chains to build up a coherent discourse. Please see the list of publications tagged with WP8 for further information about the comparative study of texts in the cultural heritage domain and about the background knowledge base underlying the ontology from which texts in 15 languages are generated.

    The grammar files are avliable on svn: molto-project.eu/wp8/d8.3/grammars/

    The demo webpage is avaliable at: http://museum.ontotext.com/

    Grammar characteristics

    The version of the grammar on display at the MOLTO Application Grammar web service (TextPainting.pgf) features:

    • The following start categories: Main category: Description 9 semantic categories which represent the ontology classes: Colour, Material, Museum, Painter, Painting, PaintingType, Size, Title, and Year. Of these 8 categories, 5 are optional, hence the additional 'Opt' categories. 3 category types: String, Int, Float 1 grammatical category for creating nested colour strings: ListColour

    • Support for 15 languages: Bulgarian (Bul), Catalan (Cat), Danish (Dan), Dutch (Dut), English (Eng), Finnish (Fin), French (Fre), Hebrew (Heb), Italian (Ita), German (Ger), Norwegian (Nor), Romanian (Rom), Russian (Rus), Spanish (Spa), Swedish (Swe).

    • Up to three sentence long text generation where each sentence may be constructed with different semantic categories. For example, consider the first sentence of a description:

      Forest[PAINTING] was painted by Paul Cezanne[PAINTER] in 1902[YEAR].

      Forest[PAINTING] was painted on canvas[MATERIAL] by Paul Cezanne[PAINTER] in 1902[YEAR].

    • Change of the syntactic element of the reference entity in sentence initial, i.e.

      Forest was painted by Paul Cezanne in 1902. It[Pronoun] is painted in green and blue.

      Forest was painted by Paul Cezanne in 1902. This painting[NounPhrase] is displayed at the National Gallery of Canada.

    Restrictions of the grammar

    As mentioned above, the names of the paintings and painters have been left untranslated. Since museum names have been translated automatically, some translations are missing. Therefore two or three words names contain underscores.

    Hebrew texts with names that are missing translations cause wrong ordering of the words in a sentence.

    The Patent Query Grammar

    This grammar is used to translate user queries into SPARQL. It contains 4 languages: English, German, French and a concrete syntax corresponding to SPARQL. Since the grammar is adapted to the patents domain, the constructors from the abstract syntax describe individual queries that depend on the domain. So, the SPARQL mappings are written in a gap-filling fashion, by specifying the query with spaces for the arguments.

    Mode details from deliverables released by Work-package 7.

    The sources are in the svn://molto-project.eu/wp7/query/grammars.

    The Words300 Grammar

    The Words300-grammar was produced to evaluate the correctness of the multilingual translation of ACE sentences offered by the ACE-in-GF grammar. The grammar contains ~300 words from the GF resource grammar library (RGL), namely the words from the ACE word classes common noun, transitive verb and proper name. Currently, most of the RGL languages are included, altogether 21 languages. For the description of the evaluation, see D11.3.

    Note that the English sentences that this grammar produces are not always valid ACE sentences, due to "spaces in content words" which is not allowed in ACE. For example, the grammar supports For which computer does John wait? while ACE requires Which computer does John wait-for?.

    The grammar can be used in a wiki at: http://attempto.ifi.uzh.ch/acewiki-gf/gf/Words300/main/

    A snapshot of the grammar is available at http://www.molto-project.eu/biblio/software/words300pgf.

    3. Sample WADL for a GF Application Grammar

    The Web Application Description Language, WADL (http://www.w3.org/Submission/wadl/), is a specification language of HTTP-based Web applications that can be read and processed automatically to generate web service clients. In combination with an API platform, such as Apigee (http://apigee.com), it is possible to expose the API of a web service to developers of third-party web applications so they can quickly integrate with further services, for instance authentication, logging data, performance monitoring.

    For a GF grammar developer, writing a WADL specification for the grammar is a quick way to expose the translation command invocation details in a machine processable way. Any PGF compiled GF grammar can be fed to the GF Web Service along with specific commands and query parameters to provide for instance parsing, linearization, and random tree generation according to the the GF Web Service API. The documentation is available at http://code.google.com/p/grammatical-framework/wiki/GFWebServiceAPI. The web application running the GF web service is distributed in the regular GF distribution. A Java frontend was developed during the MOLTO project, http://www.molto-project.eu/biblio/software/gf-java-master, and is being maintained at Github, https://github.com/Kaljurand/GF-Java.

    The example WADL specification file for web services powered by the TextPainting.pgf grammar hosted on the Grammatical Framework cloud server and deployed on Apigee, as seen in the figure below, is available at http://www.molto-project.eu/biblio/web-service/textpaintingpgf. It exposes the GET command for retrieving the grammar information and the GET command for retrieving a random production in any of the available categories.

    The designer of the web service for translating painting descriptions might decide to expose a very specific command, for instance only parsing of descriptions in Italian. This is possible by selecting what to describe in the WADL specification in a careful way, by not exposing the full generality of the grammar. Grammars that are stable only in certain categories, for instance because of increasing complexity in their modular stepwise development, can in this way be deployed while under development, provided the only web services exposed are the stable ones.

    5. Future work

    GF compiled grammars deployed as web services seem to be able to offer valuable translation and parsing functionality to developers of online applications. With the work done during the MOLTO project we have only begun to experiment with the usage of GF powered web services and the results have been positive.

    To further the adoption of GF and MOLTO technologies for high-quality translation of web applications, it would be important to be able to obtain the machine processable specification of the services, for instance as WADL or SOAP, directly available as an export command in the GF Web Service API. The client applications for the web services exposed by the application grammar would then be generated automatically allowing very fast prototyping. Software that generates web clients based on SOAP or WADL is already existing.

    D10.4 MOLTO Dissemination and Exploitation Report


    Contract No.: ICT-FP7-ICT-247914 and 288317
    Project full title: MOLTO - Enlarged EU, Multilingual Online Translation
    Deliverable: 10.4
    Security (distribution level): PU
    Contractual date of delivery: M39
    Actual date of delivery: 30 May 2013
    Type: Report
    Status & version: Final
    Author(s): O. Caprotti, B. Popov, J. van Aart
    Task responsible: UGOT
    Other contributors:


    Abstract

    The final dissemination and explotation report discusses how the project MOLTO has informed the public of the results. The industrial partners of the Consortium, Ontotext and Be Informed are the main contributors to the exploitation plan for the technologies developed by MOLTO. Exploitation of MOLTO aims to pursue sustainability for the tools and technologies and to further their uptake.

    1. Introduction

    In the MOLTO initial plan for dissemination, we proposed to carry out the task in the following way:

    > Dissemination on conferences, symposiums and workshops will be in the areas of language technology and translation, semantic technologies, and information retrieval and will include papers, posters, exhibition booths and sponsorships (by Ontotext at web and semantic technology conferences like ISWC, WWW, SemTech), and academic/professional events such as the Information Retrieval Facility Symposium. We will also organize a set of MOLTO workshops for the expert audience, featuring invited speakers and potential users from academy and industry.

    Here we report on what has been done to popularize the work done in MOLTO and make the language translation community aware of the project.

    Additionally, this deliverable contains a plan of further exploitation of the project's results. In the longer run, as outlined by the Strategic Research Agenda for Multilingual Europe in 2020 by the META Technology Council, Language Technology is expected to enable forms of knowledge evolution, knowledge transmission, and knowledge exploitation that speed up scientific, social, and cultural development. Any exploitation of MOLTO results will have to take into account the themes of this research agenda. It is already clear that the trends have started. For instance, Theme 1, the translation cloud is the fitting trend for the MOLTO web services living in the cloud. Some of the MOLTO application grammars in the cloud do indeed provide "services for instantaneous reliable spoken and written translation among all European and major non-European languages".

    2. Dissemination activities

    During the lifetime of the project we have pursued many ways of informing the relevant stakeholders about the progress of the research and development of MOLTO tools. The user community for MOLTO technologies comprises academicians, working the areas of computational linguistics and semantic web, but also members of industry offering services such as translations of web pages and of online content, from e-Government and business logics to cultural heritage, patents in pharmacology and creators of resources for e-learning of mathematics.

    Here below the ways in which this broad user community has been targeted.

    2.1 International Conferences and Meetings

    Because of the Open Access Clause, we had to make sure that the copyright policy for the proceedings of chosen conferences and meetings would allow distribution of the publication also on the partners' Open Access Servers. This is the list of Open Access servers that are also distributing the MOLTO publications:

    Here below is the list obtained from the web pages by fetching publications registered by the authors as Conference Papers.

    [1253] Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation., Giménez, Jesús, and Màrquez Lluís , Fifth Machine Translation Marathon, Volume 94, Le Mans, (2010)  Download: PBML-2010-Gimenez.pdf (170.97 KB)
    Comparing human perceptions of post-editing effort with post-editing operations, Koponen, Maarit , Proceedings of the Seventh Workshop on Statistical Machine Translation, June, Montréal, Canada, p.181–190, (2012)
    Computational evidence that Hindi and Urdu share a grammar but not the lexicon, Prasad, K. V. S., and Virk Shafqat , 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), COLING 2012, (2012)  Download: wssanlp-camera-ready.pdf (144.1 KB)
    Controlled Language for Everyday Use: the MOLTO Phrasebook, Ranta, Aarne, Enache Ramona, and Détrez Grégoire , Controlled Natural Languages Workshop (CNL 2010), Marettimo, Italy, (2011)  Download: everyday.pdf (647.5 KB)
    [997] Creating Linguistic Resources with GF, Ranta, Aarne , LREC 2010, 05/2010, Valletta, Malta, (2010)
    Deep evaluation of hybrid architectures: simple metrics correlated with human judgments, Labaka, Gorka, Díaz De Ilarraza Arantza, España-Bonet Cristina, Sarasola Kepa, and Màrquez Lluís , International Workshop on Using Linguistic Information for Hybrid Machine Translation, 11/2011, Barcelona, Spain, p.50-57, (2011)  Download: LIHMT2011_cameraReady.pdf (183.94 KB)
    [1254] Document-level Automatic MT Evaluation based on Discourse Representations, Comelles, Elisabet, Giménez Jesús, Màrquez Lluís, Castellón Irene, and Arranz Victoria , 5th Workshop on Statistical Machine Translation (IWMT, 2010), (2010)  Download: wmt10-cgmac.pdf (67.6 KB)
    [1283] A Framework for Improved Access to Museum Databases in the Semantic Web, Dannélls, Dana, Damova Mariana, Enache Ramona, and Chechev Milen , RECENT ADVANCES IN NATURAL LANGUAGE PROCESSING, 09/2011, Hissar, Bulgaria, (2011)  Download: ranlpWS-final.pdf (421.76 KB)
    Full Machine Translation for Factoid Question Answering, España-Bonet, Cristina, and Comas Pere R. , EACL Workshop on Exploiting Synergies between Information Retrieval and Machine Translation, Avignon, Frnace, (In Press)  Download: qasmt.pdf (201 KB)
    [2012] General Architecture of a Controlled Natural Language Based Multilingual Semantic Wiki, Kaljurand, Kaarel , Third Workshop on Controlled Natural Language (CNL 2012), 09/2012, Volume 7427, Berlin / Heidelberg, Germany, p.110--120, (2012)
    On generating coherent multilingual descriptions of museum objects from SemanticWeb ontologies, Dannélls, Dana , International Conference on Natural Language Generation (INLG), 2012, (2012)  Download: inlg2012-final-proceedings.pdf (130.89 KB)
    The GF Mathematical Grammar Library, Caprotti, Olga, and Saludes Jordi , Conference on Intelligent Computer Mathematics /OpenMath Workshop, 07/2012, (2012)  Download: gf-mgl.pdf (219.79 KB); GF.matlib.xhtml (10.01 KB)
    The GF Mathematics Library, Saludes, Jordi, and Xambó Sebastian , Proceedings First Workshop on CTP Components for Educational Software (THedu'11), 02/2012, Volume Electronic Proceedings in Theoretical Computer Science , Number 79, Wrocław, Poland, p.102–110, (2011)
    How Much do Grammars Leak?, Angelov, Krasimir , COLING 2012, (Submitted)  Download: gf-penn.pdf (229.75 KB)
    [1285] Hybrid Machine Translation Guided by a Rule–Based System, España-Bonet, Cristina, Labaka Gorka, Díaz De Ilarraza Arantza, Màrquez Lluís, and Sarasola Kepa , Machine Translation Summit, 09/2011, Xiamen, China, p.554-561, (2011)  Download: SMatxinT.pdf (398.46 KB)
    An IDE for the Grammatical Framework, Camilleri, John , Third International Workshop on Free/Open-Source Rule-Based Machine Translation (FreeRBMT 2012), 06/2012, (2012)  Download: An IDE for the Grammatical Framework - FreeRBMT 2012.pdf (507.44 KB)
    MOLTO Enlarged EU - Multilingual Online Translation, Caprotti, Olga, and Ranta Aarne , 16th Annual Conference of the European Association for Machine Translation, 05/2012, Trento, Italy, (2012)  Download: OnePageFinal4MOLTO.pdf (90.91 KB); EAMT2012.pdf (3.14 MB)
    [1058] The MOLTO Phrasebook, Angelov, Krasimir, Caprotti Olga, Enache Ramona, Hallgren Thomas, and Ranta Aarne , Third Swedish Language Technology Conference (SLTC-2010), 10//2010, Linköping, (2010)  Download: sltc2010.pdf (145.36 KB); MOLTO posterSLTC.pdf (387.85 KB); MOLTO posterSLTC.pages (1.02 MB)
    [1108] Multilingual Packages of Controlled Languages: An Introduction to GF, Ranta, Aarne , CNL-2010 (2nd Workshop on Controlled Natural Language), 14 September, Marettimo, Sicily, (2010)  Download: cnl-2010.pdf (1.11 MB)
    [2013] A Multilingual Semantic Wiki Based on Attempto Controlled English and Grammatical Framework, Kaljurand, Kaarel, and Kuhn Tobias, Proceedings of the 10th Extended Semantic Web Conference (ESWC 2013), (2013)
    [1299] Patent translation within the MOLTO project, España-Bonet, Cristina, Enache Ramona, Slaski Adam, Ranta Aarne, Màrquez Lluís, and Gonzàlez Meritxell , Workshop on Patent Translation, MT Summit XIII, 09/2011, p.70-78, (2011)  Download: patentsMOLTO4.pdf (282.42 KB)
    [1277] Reason-able View of Linked Data for Cultural Heritage, Damova, Mariana, and Dannélls Dana , The Third International Conference on SOFTWARE, SERVICES & SEMANTIC TECHNOLOGIES (S3T), 09/2011, Bourgas, Bulgaria, (2011)  Download: s3t2011_submission.pdf (1.18 MB)
    [965] Robust Estimation of Feature Weights in Statistical Machine Translation, España-Bonet, Cristina, and Màrquez Lluís , 14th Annual Conference of the European Association for Machine Translation,EAMT-2010, (2010)  Download: EAMT2010-EM.pdf (188.23 KB)
    Smart Paradigms and the Predictability and Complexity of Inflectional Morphology, Détrez, Grégoire, and Ranta Aarne , EACL (European Association for Computational Linguistics), 04/2012, Avignon, (2012)
    [986] Tools for Multilingual Grammar-Based Translation on the Web, Ranta, Aarne, Angelov Krasimir, and Hallgren Thomas , ACL 2010, July 2010, Uppsala, Sweden, (2010)  Download: molto-demo.pdf (322.55 KB); MOLTO posterACL.pdf (982.89 KB)
    Typeful Ontologies with Direct Multilingual Verbalization, Angelov, Krasimir, and Enache Ramona , Controlled Natural Languages Workshop (CNL 2010), 09/2010, Marettimo, Italy, (2011)  Download: FinalSUMOCNL.pdf (252.38 KB)
    Using GF in multimodal assistants for mathematics, Archambault, Dominique, Caprotti Olga, Ranta Aarne, and Jordi Saludes , 02/2012, Digitization and E-Inclusion in Mathematics and Science 2012, (2012)  Download: ArchambaultCaprottiRantaSaludes.pdf (179.15 KB)

    2.2 Journal Publications

    Journal publication is a longer process than publication in conference proceedings so one might expect that it occurs after the end of a project's lifetime as archival medium for those results which are considered long lasting and of permanent value. In MOLTO we have already succeeded to list the following journal publication:

    Books and proceedings has also been published and is currently being translated to Chinese,

    The following work has appeared as chapters of books:

    Several junior members of the MOLTO team have completed their studies and written master or dissertation work related to tasks carried out as part of some work package. Some are continuing work that began in MOLTO as part of their thesis research. They include:

    2.3 Project's Events

    Project's Meetings

    The project meetings were held every six months and always included a day which was open to participants from outside the Consortium: the MOLTO Open Day. The talks delivered by the MOLTO project members were targeted to a generic audience with no specific background assumed except for interest in the goals of MOLTO. The presentations are all available from the project's web site.

    Workshops and Seminars

    Additionally the project organized focused meetings:

    • A. Ranta and R. Enache from UGOT gave a GF tutorial during the exchange visit to UHEL on May 4-5 2010.
    • GF Meets SMT 1-5 November 2010, Gothenburg, Sweden.
    • WP3 Seminar 13-14 June 2011, Helsinki, Finland.
    • FreeRBMT 2012, the Third International Workshop on Free/Open-source Rule-Based Machine Translation was organized and hosted by UGOT between 13-15 June 2012 in Gothenburg, Sweden.
    • Public Sector Seminar 19 September 2012, Utrecht, the Netherlands.
    • In house GF training for Be Informed. A [2-day GF course] (http://www.molto-project.eu/biblio/slide-presentation/gf-tutorial) was given at Be Informed (Apeldoorn) 11-12 Dec 2012 by Kaarel Kaljurand from UZH. The course covered all the material of the existing GF tutorials and additionally presented some GF-based applications (multilingual CNL-based semantic wiki system developed in MOLTO, and well as GF-based speech recognition grammars and smart phone applications).
    • LEMON_GF workshop on 13-14 December 2013 in Bielefeld, Germany.

    Summer Schools and Tutorials

    The GF Summer School is a biannual event, these were partly sponsored by the project:

    • Second GF Summer School, Frontiers of Multilingual Technologies. 15-26 August 2011, Barcelona, Spain.
    • FORTHCOMING: Third GF Summer School, Scaling up Grammatical Resources. 18–30 August 2013. Frauenchiemsee island, Bavaria.

    A number of GF tutorials were held during the past years during which the MOLTO tools are shown and actively used:

    2.4 Press Releases

    The press releases were done at the beginning of the project and have been already reported in Deliverable 10.1. Each public event has been publicized by the local organizers via their official channels, so that announcements have appeared in calendars, bulletins and mailing lists.

    2.5 Liaison Activities

    Liaison with other EU-funded projects in the area of Computational Linguistics took place at international meetings where MOLTO was presented. The most relevant result to report however is joint work carried out with the MONNET project after organizing a joint meeting and a joint workshop reported before.

    On December 13 and 14, 2012, PortDial members from Bielefeld met with Aarne Ranta (Grammatical Framework), Jeroen van Grondelle, Frank Smit and Jouri Fledderman from Be Informed, and John McCrae (lemon) in order to discuss the mapping from ontology-lexica to grammars, as well as the modular combination of induced domain grammars with dialog task grammars. The meeting gave rise to new ideas for the top-down grammar induction process being implemented. Moreover, the MOLTO-MONNET cooperation crystallized in the joint project proposal 611008- ADOPT coordinated by MONNET's coordinator Paul Buitelaar for combining the approaches, submitted as FP7-ICT-2013-SME-DCA but not granted.

    MOLTO is a member of META-NET (http://www.meta-net.eu/) and more specifically of META-Share. The MOLTO language technologies, resources, and tools are being distributed to members of the computational linguistics community under LGPL, consistent with the collaboration agreement signed with META-Share. As part of the liaison activities within META-NET, MOLTO also gave feedback for the final version of the strategy document for the META-NET agenda for Multilingual Europe 2020 (http://www.meta-net.eu/sra-en).

    In January 2012, A. Ranta presented MOLTO at Xerox Research Centre Europe, Grenoble, in a seminar that has been video recorded and is published online at http://videos.xrce.xerox.com/index.php/videos/index/618.

    MOLTO has been hosting the FreeRBMT conference in June 2012, with a special workshop day devoted to explore the possible cooperation between Apertium (http://www.apertium.org/) and MOLTO: results are already tangible, especially with respect to the adoption of the lexicons from Apertium.

    2.6 Web Channels

    MOLTO used the World Wide Web as its main channel for continuous dissemination and archiving. The project's web site, registered at http://www.molto-project.eu, has been designed mainly to support the internal management of the project, several sections are open only to registered members of the Consortium. Gradually, as work progressed and results became available, we added some public sections however we have leveraged the possibility to be present on popular social sites to push news to the readers outside the Consortium, most prominently Twitter and LinkedIn. Recently we added a Consortium-only Google community page, which could in the future help maintain informal ties among the Consortium members and those interested in the future of the MOLTO technologies. It is not yet clear for how long the URL of the project's website will be maintained but we plan to freeze the contents shortly after the end of the project and to produce an archival version. The most important documents will be stored also as multimedia showcase, as required in Appendix X to Annex I.

    In addition to the project's site, MOLTO has published multimedia content on:

    Screencasts for some of the MOLTO tools have appeared on Screenr (http://www.screenr.com/user/MOLTOproject).

    Events of interest have been advertised via newsletters and mailing lists (MT, EAMT) and social sites, in particular:

    Partners have featuring the MOLTO work on their websites (searching link:molto-project.eu yields about 45 hits).

    The project coordinator and the workpackage leaders have been reachable for questions by a contact form accessible online. Recurring questions have been answered in the FAQ: http://www.molto-project.eu/view/faq, commonly edited by all registered users.

    The publication list appearing on the website is an extensive reference list of the results of the project. It includes also software and other media. The RSS feed for the publications appearing in MOLTO is http://www.molto-project.eu/biblio and currently lists 224 items, many of which are the slide presentations delivered during the project's related events.

    3. Exploitation Plan

    Given the general direction of the field of language technologies, as outlined in the strategic research agenda for LT2020, the exploitation of MOLTO results will focus on the high-quality translation services in the cloud. These cloud services may serve the public sector as well as a more technical audience. The case studies have shown the versatility of the MOLTO technologies in terms of domain of application, scalability, and target audience.

    Exploitation of the project's results and acquired experience also goes towards furthering the field of language technologies. This has already been demonstrated by e.g. the work on the lexicon resources also with respect to usability of publicly available semantic web resources.

    3.1 Project's Outcomes

    Several of the project's deliverable are of interest for further upkeep and will be maintained and developed in the future. MOLTO tools and technologies that have been released include several multilingual translation web services, grammar writing IDEs, guidelines and tutorials, a translation platform integrating the editing tools, and sample multilingual software such as a dialog system, query interfaces, a multilingual semantic wiki. In addition, the events that have been organized during the project's lifetime aimed at capacity building, both in academia and in industry. Young academicians worked with the commercial partners on very concrete problems and had to learn how to communicate with non-experts. Similarly, the industrial partners had to identify the tasks and issues that could best be solved by asking to the academic partners.

    In what follows, the commercial partners outline the areas in which these newly created cooperation ties may in future be consolidated.

    3.2 Exploitation Strategy

    We identified three different strategies concerning exploitation:

    • Open Source Strategy: the project has adopted this strategy for the release of the final products. All software and tools are available under LGPL. Some of the technologies are under continuous development, as it is usual in the open source community, and can be adopted and commercially further developed by branching the repositories.

    • Spin off Strategy: this strategy is currently under discussion, interested parties are evaluating whether to provide a spinoff consultancy firm to provide GF and grammar/multilinguistic knowledge that companies might not have readily available.

    • Commercialization Strategy: this is the strategy by the commercial partners, outlines in the following sections.

    3.3 Structure of explanation plan

    The project members discussed and agreed upon the use of the method from Stähler(1) to develop commercial exploitation plans. This results in the following contributions from each industrial partner:

    • A company profile, which is a short description of the nature of the company: what it does; how large it is; how it makes money; and how it typically transfers research into products.
    • A list of identified exploitation opportunities

    For each promising opportunity the method from Stähler was followed to develop a plan to exploit outcomes. This results in the following sections corresponding to the method phases depicted in the figure below.

    Overview of Process to plan for exploitation

    • a description of each opportunity identified
    • a description of the markets associated with each opportunity
    • a strengths-weaknesses-opportunities-threats (SWOT) analysis for each identified opportunity and market

    (1)Stähler, Patrick; Geschäftsmodelle in der digitalen Ökonomie: Merkmale, Strategien und Auswirkungen. Josef Eul Verlag, Lohmar, 2001

    3.4 Be Informed Exploitation Plans

    Company Profile

    Be Informed is an internationally operating, independent software vendor. The Be Informed business process platform transforms administrative processes. Thanks to Be Informed’s unique semantic technology and solutions, business applications become completely model-driven, allowing organizations to instantly execute on new strategies and regulations. Organizations using Be Informed often report cost savings of tens of percents. Further benefits include a much higher straight-through processing rate leading to vastly improved productivity, and a reduction in time-to-change from months to days.

    The role of Be Informed in MOLTO is to make sure that the solutions developed in the project can indeed be readily integrated into their solutions (the Be Informed Business Process Platform in particular). Be Informed will build on its strong expertise in its domain to guide the project and make sure that the results are exploitable from a commercial point of view in the mid-term. Dissemination to, and feedback from, its client base, as part of the use case development in WP12, will increase the degree of suitability for exploitation.

    Be Informed's exploitation strategy is tightly linked to its goal of quickly commercializing MOLTO results, and calls for a rapid and continuous flow of information to its sales force, existing client base and potential future customers. In addition, as an innovative company, Be Informed plans academic talks and publications.

    Products Relevant to Opportunity

    The outcome of MOLTO is relevant for Be Informed's Business Process Platform. For both client and server product components the translation services based on the GF based prototype can offer translation support at design time as well as runtime. This would enable several usage scenarios to deal with verbalization activities of customers business models and others artefacts.

    For more detailed information about this product and its solutions see www.beinformed.com.

    Research Transfer Process

    The main approach of Be Informed Research and Innovation is based on co-innovation with customers, partners, and other third parties. These activities usually result in a working prototype. Prototypes which seem promising to get enough traction with customers are handed over to Be Informed Product and Solution development.

    The MOLTO deliverables will be promoted to our clients and partner in the public sector. The prototype of the MOLTO multilingual verbalization component for integration with Be Informed Business Process Platform will be made available as an optional product component.

    Relevant Trends in business domain

    In this section we present a concise overview of relevant public sector trends and views within and across European Union Member States on future public services. This background information is not only necessary to understand the societal and political context in which multilingual public sector services take place, but also to detect synergies (and potential divergences) between visions about ontology driven services, language aspects and current developments within the public sector. The presented overview is inter alia based upon recent studies by the OECD (Towards Smarter and more transparent Government, e-government status spring 2010; OECD e-Government project; 25 March 2010; GOV/PGC/EGOV(2010)) and research results from the CROSSROAD Project (A Participative Roadmap for ICT Research in Electronic Governance and Policy Modelling; a support action under the European Commission 7th Framework Programme. http://crossroad.epu.ntua.gr/the-project/objectives/FP7-ICT-4-248458).

    Within the context of this project we are dealing with public sector services that provide information and advice and perform transactions between citizens or companies and administrations. By using ontologies which contain concepts, their relations and respective rules, public sector services become decision centric and goal driven. This enables the public sector to become more agile, customer centric, efficient, effective and accountable as well.

    In this section we will use the concepts of Governments and Public Sector interchangeably. Political institutions and administrative structures of counties are diverse, but regardless of their shape, they are all part of the Public Sector ecosystem that provides public sector services to citizens and companies or institutions. Governments in Europe face an increasing number of challenges such as ageing populations, immigration, climate change and globalization, further reinforced by the financial crisis. The globalization trend has limited the freedom of governments to manage their national economies and new challenges such as immigration and an ageing population seem to fundamentally affect the scope of public sector activities. At the same time, society’s expectations of public service delivery have by no means diminished as citizens from the 1980s onwards have become more concerned with choice and service quality. The paradox faced is one of open-ended demand versus a capped or falling resource share for actual delivery. Consequently, public administrations are under constant pressure to modernize their practices to meet new societal demands with reduced budgets.

    In the Visionary Scenarios Design of the CROSSROAD Project, the researchers present a summary of the main trends with respect to ICT for governance and policy making in the wider context of an evolving public sector. They define a set of core policy trends across the governance and policy modelling domain, which also resonate with the use case settings of the MOLTO project.

    1. Greater transparency and accountability of the public sector. A demand for a more transparent and accountable government can be discerned. Many EU Member States have put transparency and accountability policies in place.
    2. Improved accessibility of public services. An increased awareness and perception of the needs and wishes of citizens, results in a drive towards more choice and accessibility of public services.
    3. Quality, efficiency and effectiveness of the public sector. Many policies are aimed at delivering cheaper solutions while ensuring quality. An increased attention is given to efficiency, as in many sectors government institutions face considerable budget cuts. This trend is particularly driven by dwindling public finances.
    4. New models of governance and the emergence and active participation of new stakeholders. A trend that can be discerned in most public sector domains is the emergence of new partnerships, the involvement of intermediaries and the acknowledgement of new stakeholder roles. Citizens, civil society, advocacy groups are increasingly empowered to organise themselves and play a role in public service delivery.
    5. Stronger evidence based policy. A resurgence of governance models that value principles such as accountability, monitoring and evaluation reaffirms the principles of evidence-based policy as a necessity for making informed decisions.
    6. Citizens’ empowerment, expression of diversity, choice. The role of users is re-valued in a way that acknowledges their new found skills and growing empowerment. The principles of facilitating increased participation, user created content, user engagement, increased independence and ownership of public services applies to all public sector domains.
    7. Improved digital competencies, bridging the digital divide. As in all domains technologies increasingly play an important role in the provision of public services, in all sectors questions arise as to the ICT skills of citizens required to have access to those services.
    8. Promotion of independent living and self-organisation. Policy makers acknowledge that ICTs can play an important role for inclusion of all citizens and in order to achieve social equity and cohesion. In many countries ICT policies aim at enhancing the independence of citizens – for instance elderly or disadvantaged groups.

    Within the context of this project we are dealing with public sector services which provide information and advice and perform transactions between citizens or companies and administrations. This type of services is decision centric by nature. They are dealing with rights, permissions and obligations, for instance in the domain of permits and grants. The activities that have to be supported by the services are knowledge intensive. Another characteristic is that they are event driven. This makes them perfect candidates for semantic enabled services. Ontologies are situated at the core of this kind of services.

    We have to take into account that ontology support for public services is not only positioned at the end of the service chain, where government and citizen meet each other, but throughout the whole service chain. Treating a request for a permit and deciding upon this request is based upon the same rules as getting advice whether one is entitled to acquire the permit. So, the concepts and rules that are used in ontologies apply as well to the citizens interactions as to the administrative officials interactions. The need for localization can however differ between these two target groups. In a traditional view public sector services are positioned at the execution and enforcement layer of the public sector infrastructure. This layer deals with policy implementation. For reasons of scoping we will focus in this stage of the project also on this policy implementation layer.

    We foresee however a trend in which the use of ontologies will go more upstream towards the policy making process, since this will leverage the best outcome.

    Be Informed: Multilingual verbalization of Business Models

    Value Proposition

    Main beneficiaries of the Molto outcomes are domain experts using Be Informed in an international context in the public sector. These public sector services provide information and advice and perform transactions between citizens or companies and administrations. This type of services relies by nature highly on interaction and communication on the one hand and the execution of regulations on the other hand. The quality of both aspects must be guaranteed. We will describe in brief scenarios of public sector actors like domain experts that are confronted with localization aspects for the services they are providing or intend to provide. These scenarios are:

    1. National Government with International clients
    2. National Government cooperating Internationally
    3. National Government dealing with International Law/Policies
    4. National Government in Multilingual Countries
    5. International Government (European Union)

    In all scenarios we can see that, although policy making and implementation seems to be mostly a local (national) issue, there are very often also international issues/aspects that have to be taken into account.

    National Government with International clients

    A very common pattern in the world is the provision of public sector services in the field of immigration. Immigration services have to be provided to immigrants who want to work and/or live in another country and to companies or organizations who want to hire labour resources from another country. A specific kind of stakeholder is the group that wants to bring family members to the country they live in. The main process is the issuing of permanent or temporary/provisional permits for admission and residence. A crucial characteristic of this process is that the rules for admission and residence are changing frequently and sometimes with short notice. Since immigrations offices are communicating with ‘the whole world’, one cannot expect them to translate their services into all languages. Normally they will use the language or languages of their own country and maybe one or a few other languages that can be understood by the majority of their customers. And, in specific cases, they will want to translate a part of their information to a specific target language. This can be the case for instance as due to a certain incident a new group of immigrants from an individual country ‘threatens’ to flood the country. So they need a process that supports the translation of services to the current languages on a regular and flexible basis and an approach to deal with incidents that require instant translations in the non-current languages. In all cases it is a challenge to translate the complicated immigration laws and procedures into comprehensive services for national and international users.

    National Government cooperating International

    An example of a government agency that has to cooperate internationally is the Dutch Emission Authority (NEA). Emissions trading is a flexible policy instrument which governments use to improve the living environment. In the Netherlands there are two emissions trading systems, one for emissions of carbon dioxide (CO2) and one for emissions of nitrogen oxides (NOx). Emission trading requires an infrastructure for issuing permits, monitoring and allocating emission allowances. Emissions trading is inevitably an international business that requires cross boundary cooperation, information and communication. The public services of the Dutch Emission Authority must therefore be available and accessible in more than one language. In this case NEA wants to make its service also available in the English language.

    Trading requires international agreement on standards and preferably also on service patterns. By using one information concept it becomes easier to exchange information and to innovate. In such a case the ontology supported infrastructure of a frontrunner in the specific domain, such as NEA, could be used as a basis for internationalisation and standardisation.

    National Government dealing with International Law/Policies

    The times of splendid isolation are over (if they ever existed); we are living in a dynamic international world and an increasingly more global market. One of the government parties that is affected daily by this trend is Customs. They have to deal not only with local laws, but also with common market regulations, international trade regulations etcetera. The regulations, they have to comply with, and have to enforce, change frequently, based upon incidents, new insights and political developments. And within a set of regulations, the priorities for enforcement can change too.

    Customs have to deal with international treaties about traffic of goods between countries and the limitation thereof. For example for importing certain goods from China, one has to apply for an export license in China which is transformed to an import permit in the country of destination. This leads to multilingual public services that are delivered in different countries of the world. Depending on the types of goods there might be an additional import tax to protect a country’s internal market from being ‘flooded’ with low price goods from low cost countries.

    In order to be able to levy additional tax on certain goods one must be able to classify these goods. The EU defined the Combined Nomenclature, which is in fact a taxonomy of goods and their codes that can be used to classify goods that enter a country. This taxonomy is available in all official countries of the European Union. The taxonomy is based on the Harmonized Commodity Description and Coding System7 which is run by the World Customs Organization. The harmonized system is used by 137 countries and the European Union

    National Government in Multilingual Countries

    Many countries are bi-lingual or multilingual. This means that all official publications and services have to be provided in more than one language. Often the pilot language, the language in which a document is written first, depends on the preferred language of the author. By using an ontology, the meaning of the document in the pilot language can be expressed abstractly and unambiguously in concepts and rules. They can then be translated into a particular language to express the meaning using the vocabulary and syntax of that language.

    Value Creation

    Be Informed captures policy in ontologies. These ontologies are used throughout the policy lifecycle from choosing/deciding on policy, communicating the agreed upon policy to all stakeholders to running the supporting applications. As a consequence, verbalizations of these ontologies could be used in a number of scenarios throughout that policy lifecycle.

    Review, Validation and Feedback of Models

    For the ontologies to be used as the basis of actual applications, it is crucial they contain a correct representation of the requirements and constraints. Review and validation before deploying and the ability to provide feedback on the model after deployment is very important. A natural language representation of the models can help stakeholders to exercise these tasks. Special verbalization choices might have to be made to create texts that are effective in this specific scenario.

    Text based Editing of Models

    The most effective way of business user involvement is of course allowing them to create models themselves or, often more realistic, to maintain and alter existing models. In [EKAW2010] we explored editors that do use a textual metaphor to present models to the users, but that do not use typing text as editing metaphor.

    Self Documenting Models

    Typically, systems need to be well documented for IT organizations to be able to support production use and perform maintenance. The online, navigational access to the models is then often not acceptable, and conventional documentation sets need to be generated.

    Textual UI’s for Model Driven Applications

    Classically, business applications have used tables of data to present detailed information that is available in a business process. When involving customers in business processes, they find it hard to interpret the data. Verbalization into natural language can be a great way to present, for instance, process progress data to laymen, as the data can be presented in a self explanatory way.

    Communicating Model Based Decisions

    The ontologies capturing legislation and policy are used to drive decision services, applying the policy to actual cases. These decisions taken are communicated to the stakeholders and need to be documented and explained. Verbalization of the model could be extended to verbalization of the decisions based on the models.

    Revenue Model

    The proposed exploitation path would increase revenues of existing products like the Be Informed Business Process Platform. Be Informed will offer the Molto verbalization engine as an optional product component. It is difficult to predict the size of the increase at this stage of development.

    Market Overview

    Ontology translation systems are usually created using general-purpose programming languages, such as LISP or Java, and the mappings between expressions in the source and target languages are neither well-documented nor explained. Integrated tooling as part of Be Informed’s Business Process Platform is at this stage unique.

    3.5 Ontotext Exploitation Plans

    Introduction

    Ontotext’s business model combines the development of products (including some open source versions) with the provision of research, consultancy and development services. Many commercial projects combine all four elements. For Ontotext MOLTO will bring the unique opportunity to strengthen its position in the semantic technologies and knowledge-driven text analytics market, with development and adoption of intelligence methods that support ontology-based multilinguality. This will be possible due to the fact that MOLTO adds to the semantic technologies the GF formalism, which operates as an interlingua on language level and thus, localizes the ontologies in appropriate ways. More precisely, the main directions of future development will be as follows:

    • Interoperability and grounding in Linked Open Data resources and domain ontologies
    • High throughput, multilingual text processing
    • Robust cross-lingual translation of various domain data within search and retrieval services

    The business strategies will be as follows:

    • The company will put its technology (in this case, the KIM Semantic Annotation Platform, http://www.ontotext.com/kim; and Publishing tools) into a stronger multilingual context.
    • OntoText will strengthen its synergies between the semantic and world-knowledge infrastructure modules, and the MT services.
    • The task of Combining the GF model and the ontology standards would test and enrich the reasoning platform, developed and maintained at Ontotext.

    Company Profile

    Ontotext AD is the strongest semantic technologies company in Europe and a world-leading supplier of core semantic technology, text mining and web mining solutions.

    • Established in year 2000, today Ontotext has over 65 employees located in Bulgaria (Sofia and Varna), USA (Fairfield, CT) and UK (London);
    • After acquiring VC funding in 2008, at the end of 2010 Ontotext reached break-even and since then doubles its commercial revenues annually.
    • We are global leader in semantic database engines, successfully competing with ORACLE, IBM, and Microsoft in this field;
    • Our unique competences are backed by heavy investment in R&D – over the last 12 years we have invested more than 300 person-years in semantic technology. We know what works and what does not!

    We have unmatched portfolio of world-class technology and expertise in:

    • Semantic Databases: high-performance RDF DBMS, scalable reasoning;
    • Semantic Search: text-mining (IE), Information Retrieval (IR) ;
    • Web Mining: focused crawling, screen scraping, data fusion;
    • Linked Data Management and Data Integration.

    The main differentiator between Ontotext and other semantic technology vendors is that we deliver robust technology, proven in multiple high-profile projects that justify its maturity and usability. The best example in this direction is the usage of OWLIM (our RDF database engine) in the BBC FIFA World Cup 2010 website where most of the pages were generated dynamically through queries to OWLIM – millions of requests per day, hundreds of updates per hour, handled by a cluster of few servers. Following the success of this project, BBC extended the use of Ontotext technology for the BBC Sport website and for the London Olympics 2012 website.

    Ontotext’s clients span across several sectors:

    • Pharma: AstraZeneca, UK, and UCB, Belgium;
    • Media and publishing: BBC and Press Association, UK; and Publicis, Germany;
    • Telecommunications: Telecom Italia and Korea Telecom;
    • Archives and cultural heritage: The National Archive, UK, the British Museum, the Dutch Public Library;
    • Government: Department of Deffence, USA; and House of Commons and TNA, UK; Natural Resource Canada.

    Considering the substantial number of clients of Ontotext in UK, we are running in London regular open training courses “Semantic Technologies with OWLIM”, usually scheduled at roughly once per quarter.

    Products Relevant to Opportunity

    • OWLIM is an industrial-scale semantic database, using Semantic Web standards for inference and integration/consolidation of heterogeneous data.
    • KIM Platform is a semantic search engine, using text analysis to provide hybrid queries involving structured data and inference.
    • FactForge is a public service that represents a reason-able view to the web of data.
    • Linked Life Data is a platform for semantic data integration trough RDF warehousing and efficient reasoning that helps to resolve conflicts in the data.
    • Publishing platform – semantic publishing platform, ingesting and enriching thousands of news with linked data daily; enables the publishers and third-parties to explore innovative business models and alternative revenue streams.

    Research Transfer Process

    On the one hand, the research goes into products through the traditional ways:

    • creating prototypes in use case domains, and then
    • scaling these prototypes into systems for real usage.

    On the other hand, the developed technology within the project is applicable to other related areas of NLP services application. It can be either used as stand-alone applications, or be integrated into larger and more complex architectures. Both business opportunities have significant added value.

    The first direction is exemplified by the envisaged use case domains: Patents in medical domain and artefacts in Cultural heritage domain. The second direction goes to areas that apply strongly Question Answering, Information Retrieval and MT. Such areas are: Publishing, Social Media and Pharma. The related products are highly commercial and thus, precision and relevance of the retrieved information are crucial features for the clients. GF formalism would be useful for the smoothing of the multilingual retrieval and translation results. It must be noted that the component shared by all targeted products of Ontotext is the ontology-based knowledge that relates to LOD and multilingual settings.

    All the EU research projects that Ontotext has been involved in, have lead to the improvement of the current technology as well as to the creation of new products, that have been explored in commercial projects. In this way, we might view the Research as an Investigation, Preparation and Compilation phase, while the applications in Industry – as Adaptation, Harmonization and Real Setting evaluation phase. Below some synergies of the aforementioned kind are given:

    1. RENDER is an ongoing project that aims at providing a comprehensive conceptual framework and technological infrastructure for enabling, supporting, managing and exploiting information diversity in Web-based environments. It also would leverage very large amounts of content and metadata: news, blog and microblog streams, content and logs from Wikipedia, news archives, multimedia content and reader comments, discussion forums, etc. This data is managed by a highly scalable data management infrastructure, and enriched with machine-understandable descriptions and links referring to the Linked Open Data Cloud. This development would lead OWLIM and KIM technologies to handle diverse data, which would widen their data coverage and management.

    2. CUBIST is an ongoing project that aims at Combining and Uniting Business Intelligence and Semantic Technologies with a special focus on unstructured data mining. Being central to the project goals, the semantic technology supports a persistent layer – a semantic Data Warehouse. The project adds to the better Visual Analytics, whose improved characteristics would be important for providing more competitive user interfaces in industry.

    3. MediaCompaign is innovative in Ontology creation for cross-media modelling of media presence and campaigns; Semantic cross-market product data interlinking; Identification and tracking of new media campaigns in different media and countries. MediaCompaign focused mainly on advertisement campaigns and their impact on attitudes and opinions. Thus, the publishing services, provided by Ontotext, will be enriched with sentiment analysis additionally to the knowledge-based analysis. Thus, Ontotext will have a social-aware service.

    4. NoTube project concentrated on personalized semantic news; personalized TV guide with adaptive advertising as well as Internet TV in the Social Web. It relied on the key role of the semantic technologies, taking into account the community aspects and is built on multilinguality. The results strengthened the personalized component in the retrieved information in commercial publishing platforms.

    5. PHEME project (will start in October 2013) has as its main goal the development of scalable methods for Social Semantic Intelligence, across media and languages. I also aims at modeling not only facts and opinions, but also the parameters of reliability of the information sources. Additionally, PHEME focuses on more concrete and socially crucial cases in recent years, such as crowdsourcing, citizen journalism and bioinformatics. PHEME goes beyond official media campaigns - to social network dimensions and beyond the opinions – to rumour and misinformation detection. This project will lead to a large-scale social media bound OWLIM and KIM platforms. Also, it will add to its services the identification of misinformation, which would be very valuable facility in the personalized component for the end users.

    In all productizing areas, listed below, the following underlying NLP technology is assumed:

    • Multilingual semantics based question answering
    • Cross-language retrieval
    • Public Translation Service (combining GF + ML)

    Relevant Trends in business domain

    With the globalization processes and harmonization of large groups of documents in EU, the requirements for particular data management systems is rapidly growing. Additionally, virtual space has become more populated, shared, explored and multilingual. For example, virtual tours in famous museums; virtual storage and access to EU legislation; interactive online digests; electronic government; digital preservation storages; social networks etc.

    For these reasons, it is not surprising that some of the most active business domains at the moment are the Cultural heritage stakeholders (DARIAH, CLARIN, EUROPEANA); Pharma (Astra Zeneca); Media Publishing (BBC, NDP, Press Association) and Social Media (Pheme project).

    Ontotext is involved in all of the aforementioned domains through research projects and commercial projects.

    Ontotext : Productize MOLTO Technology in novel areas

    Publishing platforms and Digital Journalism

    The cross-media analytics is a typical case of business intelligence, developed at Ontotext. Ontotext’s technology covers preferably (but not only) publishing agencies (such as, Press Association, NDP, Oxford, etc.) and government data management (US government). From a language point of view, the company has been working systematically on commercial project for Dutch and English. However, lately, it started to expand the multilingual set to Bulgarian, German, Chinese, etc. Having in mind these facts, GF formalism as well as RDF-GF interoperability from MOLTO would be the natural extension of the information extractors, thus facilitating the interaction between the users’ queries and their machine processing. More precisely, the following extensions are envisaged: embedded translator service, tuned to the domain (sports, finance, politics, etc.); embedded converter from RDF representation to GF and then to language, and vice versa.

    Also, internally, the semantic annotation tool will be augmented with language localization modules that would support the annotators.

    Related markets: publishing; electronic government

    SWOT Analysis:

    • Strengths: Improvement of the existing multilingual modules; creation of new functionalities to the customers, such as viewing the same result in various languages; improving the annotation process and text analytics; better communication between the ontology and user queries.

    • Weaknesses: Domain adaptation of the MOLTO modules might be needed, when addressing a new domain or even a subdomain of a specific domain.

    • Opportunities: There might be the possibility to create a publishing platform of new generation, which provides a typological core for many languages and thus – is easily adaptable to new languages. Additionally, to see Dutch news in English within the publishing system itself, for example, would extremely facilitate the customers.

    • Threats: The online real time applications might be unstable initially due to the complex architecture.

    Social Media

    In Ontotext projects the existing LOD resources (such as, Linked Life Data) are applied for different socially aware domains and across languages. For example, the entity extraction tool LUPedia as well as the linked data concept store FactForge will be used in enhancing the socially marked knowledge. These modules will be extended by the language generation tool from MOLTO in order to improve the accuracy of the extracted information. This step is manageable, since the MOLTO rule-based translation technology is extended with the help of statistical approaches.

    Related markets: education, tourism

    SWOT Analysis:

    • Strengths: MOLTO gives the possibility of applying a structured approach to unstructured data for the purposes of good understanding of big amounts of data.

    • Weaknesses: MOLTO might support better some forms of Social Media (publicly available), while some others (restricted) – not so well.

    • Opportunities: The social media might be viewed as a network of subdomains and addressed by MOLTO technology in a step-by-step way.

    • Threats: No visible possibility is foreseen at the moment for using MOLTO modules directly in sentiment and opinion analysis.

    Pharma

    Ontotext regularly participates in projects that consider health care and life sciences data management (there is Life Science project running now). Here the available domain ontologies are explored together with the NLP processing. GF will be extremely useful since both the prescriptive and diagnosis languages are controlled. There is an additional level of translation here, namely: from the specialized prescription and anamnesis language of doctors into the common natural language of the users.

    Related markets: Medical producs sales, health care

    SWOT Analysis:

    • Strengths: MOLTO is best performing in controlled and structured domains. Pharma is a good example of such a combination. In addition, there is an already working prototype on Patents in this domain.

    • Weaknesses: Pharma would be better manageable from doctors’ production point of view, rather than from patient perspective, since professional language is better controlled.

    • Opportunities: Improvement of multilingual search and relevance of the search results.

    • Threats: Pharma is one of the well elaborated domains from a processing point of view. Thus, the real added value of MOLTO is to be tested in the future.

    Ontotext : Productize in the use case domains: Patents and Cultural Heritage

    Patents

    Part of the commercial projects, carried out within Ontotext, are connected to pharmacies. In this respect, the developed prototype in bio-medical and pharmaceutical domains will be employed directly in the workflow processes.

    Related markets: Administration, government, science, businesses

    SWOT Analysis:

    • Strengths: The usage of patents is a common and necessary activity in industry. Thus, the created structure model for handling patents in one specific domain would be applicable to patents in other domans, too. Retrieval services in a strongly cross-lingual context would be also very attractive features for exploitation.

    • Weaknesses: If the MOLTO modules are used in another domain of patents (for example science) some adaptation will be needed, although the patent structure itself would be stable beyond specific domains.

    • Opportunities: The usage of the patent service would facilitate and speed up the process of managing Pharma policies with respect with new development in healthcare.

    • Threats: The patent service might not cover all the query requirements of the users due to the limitations of the controlled language or the incompleteness of the corpus.

    Cultural Heritage

    There are many stakeholders in this area, since lately the related initiatives have grown considerably. Here we have in mind the specific ones: Europeana, British Museum, ConservationSpace and CLARIN.

    Europeana already provides search facilities. However, they cover only metadata and are not connected to ontologies. Also, the translation from one language to another is done via machine translation(MT) only, without any grammatical formalism behind it. British museum is a partner which would try the MOLTO services, profiled specially for museum objects. They already use Ontotext’s semantic repository OWLIM. This service supports semantic search, semantic RDF data sources, Web Publication. MOLTO will add to the better search functionality as well as to the multilingual information extraction. Similar projects are: Gothenburg City Museum (Sweden); Polish Digital National Museum; Yale Center for British Art (USA): Linked Open Data publishing of museum collection. ConservationSpace project is managed by the National Gallery of Art (USA) and 7 other institutional partners from the USA, UK and Denmark. It handles the data management. MOLTO services might also contribute to the better preservation of the documents through adopting the GF formalism as a mediator between the users' queries and SPARQL queries. Similar projects are: FP7 CHARISMA: Synergy for a Multidisciplinary Approach to Conservation/Restoration; FP7 3D-COFORM: 3D documentation and collection formation of tangible cultural heritage; CLARIN is a pan-European initiative, which aims at elaborating also a globally shared service, among other services, for exploration of cultural artefacts. Ontotext is a participant in this initiative. It might provide the same facility to the consortium as in the above opportunity. Similar projects: FP7 V-MUST: Virtual Museum Transnational Network, a Network of Excellence.

    Related markets: tourism, education

    SWOT Analysis:

    • Strengths: coverage of many languages with language specific mediated filtering (through GF formalism); high precision of the retrieved content due to the controlled language; easy adaptability to other areas of cultural artefacts and languages.

    • Weaknesses: Since one of the use cases in MOLTO considers museums, the application to other subdomains of Cultural heritage might need adaptation of the grammars and ontologies.

    • Opportunities: The service can be adopted by various virtual cultural databases and adapted to them.

    • Threats: There might not be available resources for certain languages or language variants; language generation might not be efficient enough for all languages.

    4. Conclusions

    The project dissemination activity has focused on three major stakeholders' groups: researchers in NLP, public sector, semantic web technologists. These have been reached by organizing events at international meetings, on online social platforms, and face to face. A major outcome of the project is the ongoing discussion between the academic developers of the MOLTO technologies and the commercial partners based on the work carried out. This discussion concerns the future mechanisms that should be created so that the MOLTO results can be successfully adopted for exploitation. It has become clear during the case studies and the evaluation of the project that the fast developing technologies used in MOLTO need to become mature before they can be used commercially. Moreover, it would be desirable to be able to offer professional support, consultancy services and training in order to promote the uptake of the project's translation services.

    D12.2 User studies for BI's explanation engine


    Contract No.: FP7-ICT-247914
    Project full title: MOLTO - EEU Multilingual Online Translation
    Deliverable: D 12.2 User studies for BI's explanation engine
    Security (distribution level): Public
    Contractual date of delivery: 31 May 2013 (M39)
    Actual date of delivery: 31 May 2013 (M39)
    Type: Report
    Status & version: Draft 0.9
    Author(s): Joris van Aart, Jouri Fledderman, Jeroen van Grondelle
    Task responsible: Be Informed
    Other contributors: Jeroen Daanen, Menno Gulpers, Emiel van Haandel, Herko ter Horst, Frank Smit, Xander Uiterlinden


    Abstract

    AttachmentSize
    D12.2 User studies for BI explanation engine.pdf755.43 KB