WP9 User requirements and evaluation - M24
Summary of progress
This WP is working on collecting evaluation plans from each site.
An extended D9.1E Evaluation plan has been written.
Highlights and planned evaluation methods
Progress evaluation has mainly been carried out by each site during development. This would be a good idea to collect this more systematically.
For the SMT/hybrid patent case, automatic measures (BLEU but also others - maybe check Cristina/UPC slides for examples) are probably mainly used.
In developing the GF grammars, informants (native speakers of the relevant languages) have been used during the grammar writing process to check and correct output. The informants have been given output to read and have informed the developer if sentences are correct or if not, how they should be corrected.
Moving forward, the final evaluations will need to include usability of the tools as well as quality evaluation of the output. (WP9 review slides have some examples of the user communities that might be mobilized for usability evaluation and the platforms that could be used. One thing that we were discussing wrt to mobilizing evaluators is that they need to be motivated to use the tools in some way?)
For output quality, final evaluations will likely involve both automatic and manual methods. For automatic methods, UPC's Asiya evaluation kit offers some syntactically and semantically oriented metrics in addition to the purely lexical ones like BLEU, but only for a couple of languages. As all automatic metrics rely on comparison to gold standard human translations, these need to be obtained for the test sets, if they are to be used.
Manual evaluation methods on the other hand require humans to do evaluations. For the patent case, evaluators need to have sufficient understanding of the material to be able to assess whether the translations are correct or not, particularly since we expect one of the strengths of the GF hybrid to be in correctly handling long formulae. Therefore plans have been made to hire professional patent translators of the languages in question to do the evaluation expectedly in June. Since Google is now also providing patent translations, that will be used as a point of comparison. The TAUS scale, fluency etc. could be used in this case.
For the museum case one manual evaluation approach was to produce museum descriptions in various languages that combine the simpler rules - e.g. "Painter painted Painting in City in Year on Canvas" etc. and then have the native speakers check the individual relations involved (Who painted? What did they paint? Where? When? etc.) and combine these into a measure of the overall fidelity. For this, evaluators do not necessarily need to be museum experts, any native speakers of the language in question should do. If you want a reference for this, an interesting description of such approach is in http://www.cs.ust.hk/~dekai/library/WU_Dekai/LoWu_Acl2011.pdf Other measures such as fluency, TAUS fitness scale could also be used.
Use of resources for Period 2
Node | Professor/Manager | PostDoc | PhD Student | Research Engineer/Intern |
---|---|---|---|---|
UGOT | 0 | 0 | 0 | 0 |
UPC | 2,16 (L. Màrquez, LluísP, D. Farwell) | 0,5 (C. España) | 0 | 0 |
UHEL | 0.5 (L. Carlson) | 0 | 9 (S. Nyrkkö) | 0 |
OntoText | 0 | 0 | 0 | 2(M.Chechev, K.Krustev) |
UZH | 0 | 0 | 0 | 0 |
BI | 0 | 0 | 0 | 0 |
Deviations from Annex I
- Printer-friendly version
- Login to post comments
- Slides
What links here
No backlinks found.