WP2: Grammar developer's tools

WP2 requirements

The deliverables promised from WP2:

ID		Due date	Dissemination level	Nature	Publication
D2.1	GF Grammar Compiler API	1 March, 2011	Public	Prototype
D2.2	Grammar IDE	1 September, 2011	Public	Prototype
D2.3	Grammar tool manual and best practices	1 March, 2012	Public	Regular Publication

[this comes from the MOLTO website:]

Objectives

The objective is to develop a tool for building domain-specific grammar-based multilingual translators. This tool will be accessible to users who have expertise in the domain of translation but only limited knowledge of the GF formalism or linguistics. The tool will integrate ontologies with GF grammars to help in building an abstract syntax. For the concrete syntax, the tool will enable simultaneous work on an unlimited number of languages and the addition of new languages to a system. It will also provide linguistic resources for at least 15 languages, among which at least 12 are official languages of the EU.

Description of work

The top-level user tool is an IDE (Integrated Development Environment) for the GF grammar compiler. This IDE provides a test bench and a project management system. It is built on top of three more general techniques: the GF Grammar Compiler API (Application Programmer’s Interface), the GF-Ontology mapping (from WP4), and the GF Resource Grammar Library. The API is a set of functions used for compiling grammars from scratch and also for extending grammars on the fly. The Library is a set of wide-coverage grammars, which is maintained by an open source project outside MOLTO but will be via MOLTO efforts made accessible for programmers on lower levels of linguistic expertise. Thus we rely on the available GF resource grammar library and its documentation, available through digitalgrammars.com/gf/lib. The API is also used in WP3, as a tool for limited grammar extension, mostly with lexical information but also for example-based grammar writing. UGOT designs APIs and the IDE, coordinates work on grammars of individual languages, and compiles the documentation. UHEL contributes to terminology management and work on individual languages. UPC contributes to work on individual languages. Ontotext works on the Ontology-Grammar interface and contributes to the ontology-related part of the IDE.

Here we try to make a bit clearer what the functionalities of the WP2 tools are, and how they relate to the translator's tool.

We surmise that the grammar compiler's IDE is meant primarily for grammarian/engineer roles, i.e. for extending the system to new domains and languages. But it may contain facilities or components which are also relevant for the translation tool. In many scenarios, we must allow the translator to extend the system, i.e. switch to some of the last four roles. Just how the translation tool is linked to the grammar IDE needs specifying.

What the average user can do to fix the translation depends on how user friendly we can get. Minimally, a translator only supplies a missing translation on the fly, and all necessary adaptation is handled by the system. Maximally, an ontology or grammar needs extending as a separate chore by hand, using the grammar IDE.

An author/editor/translator can be expected to translate with the given lingware. The next level of involvement is extending the translation. This may cause entries or rules to be added to a text, company, or domain specific ontology/lexicon/grammar. If the tool is used in an organization, roles may be distributed to different people and questions of division of labor and quality control (as addressed in TF) already arise.

For it is not only, even in the first place, a question of being able to change the grammar technically, but managing the changes. A change in the source may cause improvement in some languages, deterioration in others. The author can't possibly check the repercussions in all languages. Assume each user site makes its own local changes. How many different versions of MOLTO lingware will there be? One for each website maintained with MOLTO? – how can sites share problems and solutions? A picture of a MOLTO community not unlike the one envisaged for multilingual ontology management TF starts to form. The challenge is analogous to ontology evolution. There are hundreds of small university ontologies in Swoogle. Quality can be created in the crowd, but there must be an organisation for it (cf. Wikipedia).

The MOLTO worfklow and role play must be spelled out in the grammar tool manual (D 2.3) and the MOLTO translation tools / workflow manual (D 3.3). We should start writing these manuals now, to fix and share our ideas about the user interfaces.

The way disambiguation now works is that translation of a vague source against a finer grained target generates the alternative translations with disambiguating metatext to help choose the intended meaning. (try I love you in http://www.grammaticalframework.org/demos/phrasebook/. Compare to Boitet et al.'s 1993 dialogue based MT system Lidia e.g. http://www.springerlink.com/content/kn8029t181090028/)

This facility could link to the ontology as a source of disambiguating metatext, either from meta comments or directly verbalised from ontology).

Some of the GF 3.2 features, like parse ranking and example based grammar generation, have consequences to front end design, as enabling technology.

WP2 evaluation

[11 Productivity and usability]

Our case studies should show that it is possible to build a completely functional high-quality translation system for a new application in a matter of months—for small domains in just days.

The effort to create a system dynamically applicable to an unlimited number of documents will be essentially the same as the effort it currently takes to manually translate a set of static documents.

The expertise needed for producing a translation system will be low, essentially amounting to the skills of an average programmer who has practical knowledge of the targeted language and of the idiomatic vocabulary and syntax of the domain of translation.

1. localization of systems: the MOLTO tool for adding a language to a system can be learned in less than one day, and the speed of its use is in the same order of magnitude as translating an example text where all the domain concepts occur

The role requirements for extending the system remain quite high, not because of the requirements on the individual skills, but because it is less common to find their combination in one person.

The user requirements entail an important evaluation criterion: the guidance provided by MOLTO. It should also lead to system requirements, like online help, examples, profiling capabilities.

One part of MOLTO adaptivity is meant to come from the grammar IDE. Another part should come from ontologies. While the former helps extending GF “internally”, the latter should allow bringing in semantics and vocabulary from OWL ontologies. We discuss these two parts in this order.

[8 Grammar engineering for new languages in DoW]

In the MOLTO project, grammar engineering in GF will be further improved in two ways:

• An IDE (Integrated Development Environment), helping programmers to use the RGL and manage large projects.

• Example-Based Grammar Writing, making it possible to bootstrap a grammar from a set of example translations.

The former tool is a standard component of any library-based software engineering methodology. The latter technique uses the large-coverage RGL for parsing translation examples, which leads to translation rule suggestions.

The task of building a new language resource from scratch currently is described in http://grammaticalframework.org/doc/gf-lrec-2010.pdf. As this is largely a one-shot language engineering task outside of MOLTO (MOLTO was supposed to have its basic lingware done ahead of time), it should not call for evaluation here.

Building a multilingual application for a given abstract domain grammar by way of applying and extending concrete resource grammars can use a lighter process. The proposed example-based grammar writing process is described in the Phrasebook deliverable (http://www.molto-project.eu/node/1040). The tentative conclusions were:

• The grammarian need not be a native speaker of the language. For many languages, the grammarian need not even know the language, native informants are enough. However, evaluation by native speakers is necessary.

• Correct and idiomatic translations are possible.

• A typical development time was 2-3 person working days per language.

• Google translate helps in bootstrapping grammars, but must be checked. In particular, we found it unreliable for morphologically rich languages.

• Resource grammars should give some more support e.g. higher-level access to constructions like negative expressions and large-scale morphological lexica.

Effort and Cost

Based on this case study, we roughly estimated the effort used in constructing the necessary sources for each new language and compiled the following summarizing chart.

Language	Language skills	GF skills	Informed development	Informed testing	Impact of external tools	RGL Changes	Overall effort
Bulgarian	###	###	-	-	?	#	##
Catalan	###	###	-	-	?	#	#
Danish	-	###	+	+	##	#	##
Dutch	-	###	+	+	##	#	##
English	##	###	-	+	-	-	#
Finnish	###	###	-	-	?	#	##
French	##	###	-	+	?	#	#
German	#	###	+	+	##	##	###
Italian	###	#	-	-	?	##	##
Norwegian	#	###	+	-	##	#	##
Polish	###	###	+	+	#	#	##
Romanian	###	###	-	-	#	###	###
Spanish	##	#	-	-	?	-	##
Swedish	##	###	-	+	?	-	##

The phrasebook deliverable is one simple example what can be done to evaluate the grammar workpackage's promises. The results from the Phrasebook experiment may be positively biased because the test subjects were very well qualified. But this and similar tests can be repeated with more “ordinary people”, and changes in the figures followed as the grammar IDE is developing.

It could be instructive to repeat the exact same test with different subjects and compare the solutions, to see how much creativity was involved in the solutions. The less there is variation the better the chances to automate the process. Even failing that, analysis of the variant solutions could help suggest guidelines and best practices to the manual. Possible variation here also raises the issue of managing changes in a community of users.

WP4: Knowledge engineering

Ontotext contributions to MOLTO through WP4 are

• Semantic infrastructure

• Ontology-grammar interoperability

WP4 requirements

Semantic infrastructure

The semantic infrastructure in MOLTO will also act as a central multi-paradigm index for (i) conceptual models—upper-level and domain ontologies; (ii) knowledge bases; (iii) content and metadata as needed by the use cases (mathematical problems, patents, museum artefact descriptions); and provide NL-based and semantic (structured) retrieval on top of all modalities of the data modelled.

In addition to the traditional triple model for describing individual facts,

<subject, predicate, object>

the semantic infrastructure, will build on quintuple-based facts,

<subject, predicate, object, named graph, triple set>

The infrastructure will include: inference engine (TRREE7), semantic database (OWLIM8), semantic data integration framework (ORDI9) and a Multi-paradigm semantic retrieval engine, all of which are previous work, resulting from private (Ontotext) and public funding (TAO10. TripCom11). This approach will enable MOLTO’s baseline and use case driven knowledge modelling with the necessary expressivity of metadata-about-metadata descriptions for provenance of the diverse sources of structured knowledge (upper-level, domain specific and derived (from grammars) ontologies; thesauri; domain knowledge bases; content and its metadata)

From Ontotext webpages, we can guess that the infrastructure builds on the following technologies:

• KIM is a platform for semantic annotation, search, and analysis

• OWLIM is the most scalable RDF database with OWL inference

• PROTON is a top ontology developed by Ontotext.

Milestone MS2 says the knowledge representation infrastructure is opened for retrieval access to partners at M6. The infrastructure deliverable D4.1 is due at M8.

Grammar-ontology interoperability

[7 Grammar-ontology interoperability for translation and retrieval in DoW]

At the time of the TALK project, an emerging topic was the derivation of dialogue system grammars from OWL ontologies. A prototype tool for extracting GF abstract syntax modules from OWL ontologies was thereby built by Peter Ljunglöf at UGOT. This tool was implemented as a plug-in to the Protégé system for building OWL ontologies3 and intended to help programmers with OWL background to build GF grammars. Even though this tool remained as a prototype within the TALK project, it can be seen as a proof of concept for the more mature tools to be built in the MOLTO project.

A direct way to map between ontologies and GF abstract grammars is a mapping between OWL and GF syntaxes.

In slightly simplified terms, the OWL-to-GF mapping translates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows:

Class(pp:integer ...) <==> cat integer ;

ObjectProperty(pp:div <==> fun div :

domain(pp:integer) integer -> integer -> prop ;

range(pp:integer))

Less syntax-directed mappings may be more useful, depending on what information is relevant to pass between the two formalisms. The mapping is then also less generic, as it depends on the intended use and interpretation of the ontology. The mapping through SPARQL queries below is one example. A mapping over TF could be another one.

The GF-Protégé plug-in brings us to the development cost problem of translation systems. We have noticed that in the GF setting, building a multilingual translation system is equivalent to building a multilingual GF grammar, which in turn consists of two kinds of components:

• a language-independent abstract syntax, giving the semantic model via which translation is performed;

• for each language, a concrete syntax mapping abstract syntax trees to strings in that language.

In MOLTO, GF abstract syntax can also be derived from sources other than OWL (e.g. from OpenMath4 in the mathematical case study) or even written from scratch and then possibly translated into OWL ontologies, if the inference capabilities of OWL reasoning engines are desired. The CRM ontology (Conceptual Reference Model5) used in the museum case study is already available in OWL.

MOLTO’s ontology-grammar interoperability engine will thus help in the construction of the abstract syntax by automatically or semi-automatically deriving it from an existing ontology. The mechanical translation between GF trees and OWL representations then forms the basis of using GF for translation in the Semantic Web context, where huge data sets become available in RDF and OWL in initiatives like Open Linked Data (LOD).

The interoperability between GF and ontologies will also provide humans with natural ways of interaction with knowledge based systems in multiple languages, expressing their need for information in NL and receiving the matching knowledge expressed in NL as well:

Human -> NL -> GF -> ontology -> GF -> NL -> Human

providing an entirely new dimension to the usability of semantics-based retrieval systems, and opening extensive structured bodies of knowledge in human understandable ways.

Note also that the OWL to GF mapping also allows a wider human input to GF. OWL ontologies are written by humans (at present at least, by many more humans than GF grammars).

MOLTO website gives detail what is going to delivered first by way of ontology-GF interoperability. The first round uses GF grammar to translate NL questions to SPARQL query language (http://www.molto-project.eu/node/987).

The ontology-GF mapping here is a NL interface to PROTON ontologies, by way of parsing (fixed) NL to (fixed) GF trees and transforming the trees into SPARQL queries to run on the ontology DB.

Indirectly, this does define a mapping between (certain) GF trees and RDF models, using SPARQL in the middle. SPARQL is not RDF but a SPARQL query does retrieve a RDF model given a dataset, but the model depends on the dataset. With an OWL reasoner thrown in, we can get OWL query results.

What WP3 had in mind is a tool to translate between OWL models and GF grammars, i.e. convert OWL ontology content into GF abstract syntax. This tool is forthcoming next according to the MOLTO presentation slides (http://www.molto-project.eu/node/1008).

This was confirmed by email from Petar (https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanWP4).

The translation tools WP3 will consider using TermFactory multilingual ontology model and tools

as middleware between (non-linguistic) ontology and GF grammar. The idea is to (semi)automatically match or bridge third party ontologies to TF, a platform for collaborative development of ontology-based multilingual terminology. It then remains to define an automatic conversion between TF and GF.

The Varna meeting should adjudicate between WP3 and WP4 here.

A concrete subtask that arises here is to define an interface between the knowledge representation infrastructure (due Nov 2010) and TF (finished in ContentFactory project end of 2010).

WP4 evaluation

Since the aims are more related to use cases and framework development, than enhancing performance of existing technologies, the evaluation to be done during the project will be more of a qualitative than quantitative kind.

The evaluation of these features should reflect and demonstrate the multiple possibilities of GF that are gained through inter-operation with external ontologies. The evaluation of progress will exploit proof-of-concept demos and plans for further development. For further discussion, see https://kitwiki.csc.fi/twiki/bin/view/MOLTO/MoltoOntologyEvaluationPlanD91

WP5: Statistical and robust translation

WP5 requirements

Objectives

[From DoW]

The goal is to develop translation methods that complete the grammar-based methods of WP3 to extend their coverage and quality in unconstrained text translation. The focus will be placed on techniques for combining GF- based and statistical machine translation. The WP7 case study on translating Patents text is the natural scenario to test the techniques developed in this package. Existing corpora for the WP7 will be used to adapt SMT and grammar- based systems to the Patents domain. This research will be conducted on a variety of languages of the project (at least three).

Deliverables

Del. no	Del. title	Nature	Date
D 5.1	Description of the final collection of corpora	RP	M18
D 5.2	Description and evaluation of the combination prototypes	RP	M24
D 5.3	WP5 final report: statistical and robust MT	RP,Main	M30

Description of work

[10 Robust and statistical translation methods in DoW]

The concrete objectives in this proposal around robust and statistical MT are:

Extend the grammar-based approach by introducing probabilistic information and confidence scored predictions.
Construct a GF domain grammar and a domain-adapted state-of-the-art SMT system for the Patents use case.
Develop combination schemes to integrate grammar-based and statistical MT systems in a hybrid approach.
Fulfill the previous objectives on a variety of language pairs of the project (covering three languages at least).

Most of the objectives depend on the Patents corpus. Even the languages of study depend on the data that the new partner provide. In order to compensate the delay due to this both in WP5 and mainly in WP7 we started working here on hybrid approaches. The methodology now is to develop hybrid methods in a way independent of the domain and data sets used, so that they can be later adapted to patents.

Bilingual corpora are needed to create the necessary resources for training/adapting statistical MT systems and to extend the grammar-based paradigm with statistical information (1 and 2). We will compile and annotate general-purpose large bilingual and monolingual corpora for training basic SMT systems. This compilation will rely on publicly available corpora and resources for MT (e.g., the multilingual corpus with transcriptions of European Parliament Sessions).

Domain specific corpora will be needed to adapt the general purpose SMT system to the concrete domain of application in this project (Patents case study). This corpora will come from the compilation to be made at WP7, leaded by Mxw.

We already have the European Parliament corpus compiled and annotated for English and Spanish. Languages will probably finally be English, German, and Spanish or French, so as soon as this is confirmed the final general-purpose corpus can be easily compiled. The depth of the annotation will depend on the concrete languages and the available linguistic processors.

Combination of grammar-based and statistical paradigms is a novel and active research line in MT. (...) We plan explore several instantiations of the fallback approach. From simple to complex:

• Independent combination: in this case, the combination is set as a cascade of independent processors. When Grammar-based MT does not produce a complete translation, the SMT system is used to translate the input sentence. This external combination will be set as the baseline for the rest of combination schemes.

• Construction of a hybrid system based on both paradigms. In this case, a more ambitious approach will be followed, which consists of constructing a truly hybrid system which incorporates an inference procedure able to deal with multiple proposed fragment translations, coming from grammar-based and SMT systems. Again we envision several variants:

• Fix translation phrases produced by the partial GF analyses in the SMT search. In this variant we assume that the partial translations given by GF are correct so we can fix them and let SMT to fill the remaining gaps and do the appropriate reordering. This hard combination is easy to apply but not very flexible.

• Use translation phrase pairs produced by the partial GF analyses, together with their probabilities, to form an extra feature model for the Moses decoder (probability of the target sentence given the source).

• Use tree fragment pairs produced by the partial GF analyses, together with their probabilities, to feed a syntax based SMT model, such as the one by Carreras and Collins (2009) . In this case the search process to produce the most probable translation is a probabilistic parsing scheme.

The previous text describes the hybrid MT systems we consider to include. The baseline is clear. In fact, one can define three baselines: a raw GF system, a raw SMT system and the naïve combination of both. Regarding real hybrid systems there is much more to explore. Here we list four approaches to be pursued:

Hard integration. Force fixed GF translations within a SMT system.
Soft integration I. Led by SMT. GF partial output, as phrase pairs, is integrated as a discriminative probability feature model in a phrase-based SMT system.
Soft integration II. Led by SMT. GF partial output, as tree fragment pairs, is integrated as a discriminative probability model in a syntax-based SMT system.
Soft integration III. Led by GF. Complement with SMT options the GF translation structure and perform statistical search to find the final translation.

At the moment, we are able to obtain phrases and alignments from a GF-generated synthetic corpus. This is a first step for the hard integration of both paradigms, and also for the soft integration methods led by SMT. We are currently going deeper into the latter, as it is a domain independent study.

In the evaluation process, these families of methods will be compared to the baseline(s) introduced above according to several automatic metrics.

WP5 evaluation

WP5 is going to have its own internal evaluation complementary to that of WP9. Since statistical methods need of fast and frequent evaluations, most of the evaluation within the package will be automatic. For that, one needs to define the corpora and the set of automatic metrics to work with.

Corpora

Statistical methods are linked to patents data. This is the quasi-open domain where the hybridization is going to be tested. The languages of the corpus are not still completely defined, but by looking at other works with patents we guess they will probably be English, German, and French or Spanish.

Besides the large training corpus, we need at least two smaller data sets, one for development purposes and another one for testing. The order of magnitude of these sets is usually around 1,000 aligned segments or sentences. We expect to reach this size, but the final amount will depend on the available data.

Metrics

BLEU (Papineni et al. 2002) is the de facto metric used in most machine translation evaluation. We plan to use it together with other lexical metrics such as WER or NIST in the development process of the statistical and hybrid systems.

Lexical metrics have the advantage of being language-independent, since most of them are based on n-gram matching. However, they are not able to catch all the aspects of a language and they have been shown not to always correlate well with human judgements. So, whenever it is possible, it is a good practice to include syntactic and/or semantic metrics as well.

The IQmt package1 provides tools for (S)MT translation quality evaluation. For a few languages, it provides metrics to do this deep analysis. At the moment, the package supports English and Spanish, but other languages are planed to be included soon. We will use IQmt for our evaluation on the supported language pairs.

1http://www.lsi.upc.es/~nlp/IQMT/

7. Back end

WP2: Grammar developer's tools

WP2 requirements

Objectives

Description of work

WP2 evaluation

WP4: Knowledge engineering

WP4 requirements

Semantic infrastructure

Grammar-ontology interoperability

WP4 evaluation

WP5: Statistical and robust translation

WP5 requirements

Objectives

Deliverables

Description of work

WP5 evaluation

Corpora

Metrics