The talks and the tutorials are open for everyone! Coffee and meals are reserved for those who registered before 2 June.
Venue: EDIT Building, Chalmers University of Technology, Gothenburg
9:30 Invited talk: Martin Kay, The New Machine Translation - Getting blood from a stone
10:30 Coffee break
11:00 Kevin Brubeck Unhammer and Trond Trosterud, Evaluating North Sámi to Norwegian assimilation RBMT
11:35 Hrvoje Peradin, A rule-based machine translation from Serbo-Croatian to Macedonian
13:30 John J. Camilleri, An IDE for the Grammatical Framework
14:05 Víctor M. Sánchez-Cartagena, Miquel Esplà-Gomis, Felipe Sánchez-Martínez and Juan Antonio Pérez-Ortiz, Choosing the correct paradigm for unknown words in rule-based machine translation systems14:40 Víctor M. Sánchez-Cartagena, Felipe Sánchez-Martínez and Juan Antonio Pérez-Ortiz, An Open-Source Toolkit for Integrating Shallow-Transfer Rules into Phrase-Based Statistical Machine Translation
15:15 Coffee break
15:45 Cristina España-Bonet, Gorka Labaka, Arantza Diaz De Ilarraza, Lluis Marquez and Kepa Sarasola, Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization
17:15 Boat trip to Styrsö island, dinner. (Suggested route: 17:31 Chalmers tram 6, 17:49 Jaerntorget tram 9, 18:30 Saltholmen boat, arrival 18:43; return boat 21:44, back in town an hour later.)
9:00 Apertium hands-on tutorial (Francis Tyers)
14:00 GF resources and tools for machine translation
16:30 Concluding discussion17:00 End of official programme
Martin Kay, The New Machine Translation - Getting blood from a stone.
There is now a new sense of excitement in the air about machine translation. After fifty years of unfulfilled promises by linguists, the field has been taken over by computer scientists and reconstructed on scientific principles. A machine translation system requires massive amounts of data. Painstaking work with native informants, and playing examples off against counterexamples, takes far too long and is too unreliable. We now extract the massive amounts of data from massive quantities of naturally occurring text by sophisticated machine-learning techniques. If you doubt the value of this approach, you have only to look at Google Translate.
We should be thankful for this new turn of events because massive amounts of data and sophisticated machine-learning techniques have a vital role to play in machine translation. But, as I will show in this talk, they are not enough to finish the job because much of the information required to build a creditable translation system cannot be extracted from examples, even in principle, however massive the number of them that one collects or how sophisticated the techniques one applies. It cannot be extracted because it is not there to be extracted. As my mother would say: "You cannot get blood from a stone".
Kevin Brubeck Unhammer and Trond Trosterud, Evaluating North Sámi to Norwegian assimilation RBMT.
We describe the development and evaluation of a rule-based machine translation assimilation system from Northern Sámi to Norwegian Bokmål, built on a combination of Free and Open Source (FOSS) resources: the Apertium platform and the Giellatekno HFST lexicon and Constraint Grammar disambiguator.
We detail the integration of these and other resources in the system along with the construction of the lexical and structural transfer, and evaluate the translation quality using various methods. Finally, some future work is suggested.
Hrvoje Peradin, A rule-based machine translation from Serbo-Croatian to Macedonian.
This paper describes the development of a one-way machine transla- tion from Serbo-Croatian to Macedonian in the Apertium platform. Details of resources and development methods are given, as well as an evaluation, and general directives for future work.
John J. Camilleri, An IDE for the Grammatical Framework.
The GF Eclipse Plugin provides an integrated development environment (IDE) for developing grammars in the Grammatical Framework (GF). Built on top of the Eclipse Platform, it aides grammar-writing by providing instant syntax checking, semantic warnings and cross-reference resolution. Inline documentation and a library browser facilitate the use of existing resource libraries, and compilation and testing of grammars is greatly improved through single-click launch configurations and an in-built test case manager for running treebank regression tests. This IDE promotes grammar-based systems by making the tasks of writing grammars and using resource libraries more efficient, and provides powerful tools to reduce the barrier to entry to GF and encourage new users of the framework.
Víctor M. Sánchez-Cartagena, Miquel Esplà-Gomis, Felipe Sánchez-Martínez and Juan Antonio Pérez-Ortiz, Choosing the correct paradigm for unknown words in rule-based machine translation systems.
Previous work on an interactive system aimed at helping non-expert users to enlarge the monolingual dictionaries of rule-based machine translation (MT) systems worked by discarding those inflection paradigms that cannot generate a set of inflected word forms validated by the user. This method, however, cannot deal with the common case where a set of different paradigms generate exactly the same set of inflected word forms, although with different inflection information attached. In this paper, we propose the use of an n-gram-based model of lexical categories and inflection information to select a single paradigm in cases where more than one paradigm generates the same set of word forms. Results obtained with a Spanish monolingual dictionary show that the correct paradigm is chosen for around 75% of the unknown words, thus making the resulting system (available under an open-source license) of valuable help to enlarge the monolingual dictionaries used in MT involving non-expert users without technical linguistic knowledge.
Víctor M. Sánchez-Cartagena, Felipe Sánchez-Martínez and Juan Antonio Pérez-Ortiz, An Open-Source Toolkit for Integrating Shallow-Transfer Rules into Phrase-Based Statistical Machine Translation.
In this paper, we present an open-source toolkit to enrich a phrase- based statistical machine translation system (Moses) with phrase pairs generated from the linguistic resources of a shallow-transfer rule-based machine translation system (Apertium). A system built with this toolkit was not outperformed by any other participant of the shared translation task of the Sixth Workshop on Statistical Machine Translation (WMT 11). We release the toolkit with the hope that it will be useful to other MT practitioners.
Cristina España-Bonet, Gorka Labaka, Arantza Diaz De Ilarraza, Lluis Marquez and Kepa Sarasola, Deep evaluation of hybrid architectures: Use of different metrics in MERT weight optimization.
The process of developing hybrid MT systems is guided by the evaluation method used to compare different combinations of basic subsystems. This work presents a deep evaluation experiment of a hybrid architecture that tries to get the best of both worlds, rule-based and statistical. The differences between the results of automatic and human evaluation corroborates the inappropriateness of purely lexical automatic metrics to compare the outputs of systems that use very different translation approaches.
An examination of sentences with controversial results suggested that linguistic well-formedness in the output should be considered in evaluation. In that way, in a first step, we have defined a new simple metric that combines lexical information with PoS information. This metric has shown greater agreement with human assessment than BLEU in our previous manual evaluations. Now we have used this evaluation metrics throughout all the development cycle, and we devoted this article to test whether these metrics are able to improve the MERT parameter optimization.
The results are not conclusive. Although manual evaluation shows a slight improvement when using the proposed measure in optimization compared to results obtained with BLEU, the improvement is too small to draw clear conclusions. Due to that, we believe that we should first focus on integrating linguistically more representative features in the hybrid system itsef, so that linguistically informated metrics can do better their job. As further step these metrics could be specific to the designed hybrid system.