D9.2 MOLTO evaluation and assesment report


Contract No.: FP7-ICT-247914
Project full title: MOLTO - Multilingual Online Translation
Deliverable: D9.2 MOLTO evaluation and assesment report
Security (distribution level): Public
Contractual date of delivery: M36
Actual date of delivery: March 2013
Type: Report
Status & version: Draft
Author(s): Jussi Rautio, Maarit Koponen
Task responsible: UHEL
Other contributors: UPC


Abstract

  • Evaluation of the results. (manual, automatic, ..)
  • Evaluation of the grammars and the grammar writing process in terms of D2.3 Best practices document.

X. Grammar evaluation

The impact of MOLTO is not about just individual use cases. During the 3 years of the project, we have developed methods of efficient grammar writing, dividing the task such that grammar experts and domain experts get to do what they can best. These guidelines are documented in D2.3, Best practices.

We have conducted a grammar evaluation survey for people who have written grammars. The results of the survey and an overview of the practices are documented in Part 1.

We have also noted the time and measures for correcting grammars. Since the release of the first MOLTO demo (D10.2, tourist phrasebook), we have collected feedback and bug reports, and corrected the bugs. Part 2 describes these bugs and the effort that has been needed to fix them.

X.1 Grammars

The impact of MOLTO is not about just individual use cases. During the 3 years of the project, we have developed methods of efficient grammar writing, dividing the task such that grammar experts and domain experts get to do what they can best. These guidelines are documented in D2.3, Best practices.

Best practices document was published in October 2012, but many of the grammars are written before that. Here is first an overview of the best practices and whether the grammars are written accordingly.

Best practices

(This summary is copypaste from the document.)

To make your work reusable, and to enable a division of labour:
 Divide the grammar into a base module (syntactic) and domain extension (lexical).
To make it maximally simple to add languages:
 Consider defining the base part by a functor.
To avoid low-level hacking and guarantee grammatical correctness:
 In the concrete syntax, use only function applications and string tokens, maybe records - but no tables, no concatenation.
To guarantee that the grammar will continue to work in the future:
 Only use the API level of the resource grammar library.
For scalability:
 Choose solutions that remain stable when new languages are added.
A corollary:
 Never use lexical categories as linearization types.
A scalability tools:
 Use type synonyms and constructors rather than raw types for linearization.
To monitor your progress:
 Create a treebank for unit and regression testing, and use it often with the diagnostic tools.

The following tools are standard and well-tested in MOLTO’s and other applications:

  • the GF compiler and shell
  • the GF run-time for Haskell, Java, and C, as well as the web API
  • the RGL for the 15 MOLTO languages
  • the GF-Eclipse IDE * the use of smart paradigms for lexicon building

Phrasebook

It has two modules: Sentences, which contains phrases that can be defined by a functor over the resource grammar API. The phrases that are likely to have different implementations are in the module Words.

Semantic validity is handled with simple, restrictive abstract syntax. For example, an abstract syntax tree like

HowFarBy : Place -> ByTransport -> Question

guarantees that we can say "How far is the church by taxi" but not "How far is John by beer": the arguments need to be a place and a transport.

Module structure: Common constructions with a functor

Starting point for the grammar was a test corpus of sentences we want to express in the grammar. These sentences are used as a documentation for the abstract syntax:

AHasAge     : Person -> Number -> Action ;    -- I am seventy years
AHasChildren: Person -> Number -> Action ;    -- I have six children
AHasName    : Person -> Name   -> Action ;    -- my name is Bond

ACE-GF

ACE-GF: based on Attempto Controlled English. (ACE is ____.)

Acewiki working on ACE (acewiki subset), grammars for Cat, Dan, Dut, Eng (not ACE), Est, Fin, Fre, Ger, Ita, Lav, Nor, Pol, Ron, Rus, Spa, Swe, Urd (https://github.com/Attempto/ACE-in-GF/tree/master/grammars/acewiki_aceowl).

Grammar modules: ACE base, in addition domain lexicons (Geography).

(in AceWiki also normal grammars, not ace. But unrelated to ACE grammar.)

Museum

Query grammars


Grammar evaluation survey

Questionnaire

Basic information: 

Use of development tools:

Diagnostic tools
Compilation diagnostics: 
Grammar display modes: 

Testing
Tools for generation and testing: 

RGL
Resource grammar tools:

Grammar writing
Starting point for your grammar:

Basic unit of the grammar:

Semantic control:

Module structure: 

Concrete syntax:

Analysis of answers: ....

Some things answered in "Other", not in Best practices(?):

Other method for treebanks: Haskell code to store, edit and show differences in treebanks.

Other development tool: Haskell and shell scripts generating grammars

X.2 Grammar modification

Examples of grammar modification

case study: Phrasebook

Phrasebook was published as deliverable 10.2 in June 2010, third month of MOLTO. Initially it translated between 14 European languages (now 20 languages) and was written by 8 authors. These include people with varied GF skills, from 2-day GF course to major developers of GF. Some of the language versions were written by people with actually no skills in the language, using example-based grammar writing (see the report for more information).

During the 2.5 years, we have gotten feedback and bug reports. The issues can be divided in Phrasebook errors and resource grammar library (RGL) errors. Both of course show as errors in the application grammar, but the error needs to be fixed at a different level. Also the time spent fixing the problem and the expertise of the grammar writer is different between the two error types.

Feedback

Feedback has been given various ways. There is a feedback button in the demo for anonymous feedback; this has gone to ____ (WHERE) and has been assigned to ____ (WHO). The Phrasebook demo has been shown in various presentations, and sometimes during the presentation an audience members or the presenter has noticed a problem. The problem has been either fixed by the presenter, or in a case where the presenter lacks time, language skills or GF skills to fix the bug, it has been given to someone with skills and time.

Initially there was no project-wide reporting system, but since autumn 2012, UHEL has set up one in http://tfs.cc/trac. Each application grammar has an owner who gets a notification about new tickets, and can fix the bug or assign the job to someone.

Crowdsourcing is another possible source for bug detection. However, in order to profit from that we would need a large number of people browsing the site and our apps, which is not realistic. Most of the bug reports come from people already involved in MOLTO.

List of grammar issues

Here I list issues that I know of. This is not necessarily a complete list.

The difference between application grammar issue and RGL issue can be unclear; for instance, an incorrect morphology in the application grammar may result in using wrong RGL functions or there not being a correct RGL function in the first place. In a case where there exists a correct RGL function but the user has chosen a wrong one, I have classified the error as application grammar issue, as the fix has been made in the application grammar.

Application grammar issues

Spanish:

1) HowFar, HowFarFrom, HowFarBy ja HowFarFromBy

  • Error: Structure of "How far" questions. Initially had a structure that was more common in Latin America and sounded weird for speakers in Spain.
  • Fix: By copying the structure from French into the application grammar.
  • Time: < 30 minutes
  • Skills: Medium GF skills (have made a mini resource and some application grammars)

2) Plane

  • Error: The word for plane (avión) had wrong gender. The word had been defined in the application grammar and not in the resource grammar.
  • Fix: Changing the gender in the application grammar, mkN "avión" masculine.
  • Time: < 5 minutes
  • Skills: Medium GF skills

3) Fish

  • Error: The word for fish was a word that means live fish, whereas the context in Phrasebook needs a word for fish as a food. The word was taken from the RGL lexicon, which has only one fish_N, and its meaning is live fish.
  • Fix: Defining the word in the application grammar, mkN "pescado".
  • Time: < 5 minutes
  • Skills: Medium GF skills

4) Adjectives ending in consonant inflect wrong

  • Error: Wrong paradigm chosen in the RGL functions.
  • Initial fix: Choose right paradigm of mkA. With smart paradigms this means choosing the right number of arguments, which in this case is 5 as opposed to 1. Applied to 8 adjectives in the application grammar
  • Time: < 30 minutes
  • Skills: Medium GF skills

Catalan:

1) HowFar, HowFarFrom, HowFarBy ja HowFarFromBy

  • Same error as in Spanish. Due to Catalan having been copied from Spanish. Same fix.

Finnish:

1) Locative cases for geographical names

  • Error: All geographical names have the same locative case, which is wrong for some
  • Fix: Added parameters for the data structure of geographical names, so that the right locative case can be chosen.
  • Time: < 30 minutes?
  • Skills: Advanced GF skills, native speaker of Finnish

RGL errors

Spanish and Catalan:

1) Negative imperatives

  • Error: Negative imperatives formed by using the positive imperative and adding a negation particle. Really it should be done with subjunctive mood + negation particle.
  • Fix: Created an ImpNeg function in Spanish and Catalan RGL and used it in the application grammar.
  • Time: ~1 hour
  • Skills: Medium GF skills, fluent non-native Spanish & Catalan

2) Adjectives ending in consonant inflect wrong

  • Error: The same error Wrong paradigm chosen in the RGL functions.
  • Fix: Make new smart paradigm for these adjectives that takes only 2 forms. In Catalan a more throrough revision of the smart paradigm system.
  • Time: ~1 hour in Spanish
  • Time: half day in Catalan
  • Skills: Medium GF skills, fluent non-native Spanish & Catalan

French:

1) Wrong agreement in French superlative forms

  • Error: The superlative is formed with DetNP, which only produces masculine versions.
  • Fix: Make DetNPFem for all Romance languages, have the application grammar a construction based on the gender of the noun
  • Time: ~1 hour
  • Skills: Medium GF skills

Finnish:

1) Vowel harmony of possessive suffixes

  • Error: Vowel harmony of possessive suffixes not working, gives all words a back vowel variant
  • Fix: Implement new parameter for vowel harmony in the Finnish resource grammar, change cat for nouns and determiners, change functions that handle them
  • Time: ~1 day (if counting first attempt, that turned out being too slow, and redesign)
  • Skills: Medium GF skills

2) Wrong word forms in Finnish genetive+possessive suffix http://tfs.cc/trac/ticket/34 3) Pronoun problems with the modal verb "must" in Finnish http://tfs.cc/trac/ticket/23 4) Incorrect plural stem for "children" in Finnish http://tfs.cc/trac/ticket/27 5) Translation of modal verb + a location not working for Finnish. Modal verb problems also in Italian, Catalan and Russian. http://tfs.cc/trac/ticket/15

  • All these corrected by a user with advenced GF skills, time taken in total around half a day.