WP3
Translator's tools and lexicon extraction
Krasimir Angelov, Lauri Carlson, Ramona Enache, Inari Listenmaa, Aarne Ranta, Shafqat Virk
University of Helsinki, University of Gothenburg
MOLTO Final Review, Luxembourg
2013-06-11

Presenter Notes

Overview

  1. Introduction
  2. Translator's tools
  3. Lexicon extraction
    1. Background
    2. Lexicon sources
    3. Showcases
    4. TermFactory demo
  4. Dissemination

Presenter Notes

Introduction

Translator's tools

Lexicon extraction

  • Core technology for large-scale translation

Presenter Notes

Translator's tools

Presenter Notes

Translator's tools

Simple Translation Tool: translation editor / bilingual document authoring tool

Presenter Notes

Simple Translation Tool is a translation editor / bilingual document authoring tool. Supports workflow where the translator is authorized to do structural changes to the document.

Translator's tools

Simple Translation Tool: translation editor / bilingual document authoring tool

Presenter Notes

Simple Translation Tool is a translation editor / bilingual document authoring tool. Supports workflow where the translator is authorized to do structural changes to the document.

Translator's tools

MOLTO translations integrated to Pootle: traditional translation workflow

Presenter Notes

MOLTO tools in a traditional translator's workflow, in the translation platform Pootle.

Translator's tools

Demo installations at tfs.cc/pootle/ and cloud.grammaticalframework.org/translator

Demo video of Pootle

Pootle source code available as git repository at tfs.cc/git/pootle.git

Presenter Notes

Presenter Notes

Lexicon extraction

Presenter Notes

Lexicon extraction

Motivations

  • Bigger grammars, new domains
  • First experiments on free translation
  • Especially WP5, Statistical and Robust Parsing

Scenarios

  • Converting existing resources to GF lexicons
  • Searching and managing terms in TermFactory

Presenter Notes

Background

Types of lexicons

I. Monolingual

II. Multilingual

A. Uni-sense

B. Multi-sense

Presenter Notes

Monolingual lexicons

  • Extracted from a monolingual resource
  • Idiomatic, tailored for each language
  • Used as a resource or for a monolingual application
DictEng.gf:

    a_priori_Adv = mkAdv "a priori";
    aardvark_N = mkN "aardvark" ;
    ab_initio_Adv = mkAdv "ab initio";
    aback_Adv = mkAdv "aback";
    abandon_V2 = mkV2 (mkV "abandon");
DictGer.gf:

    a_priori_Adv = mkAdv "a priori" ;
    aachener_N = reg2N  "Aachener" "Aachener" masculine ;
    aalfang_N = reg2N  "Aalfang" "Aalfänge" masculine ;
    aasvogel_N = reg2N  "Aasvogel" "Aasvögel" masculine ;
    abaenderbar_A = regA "abänderbar" ;

Presenter Notes

Multilingual lexicons

  • Common abstract syntax, concrete syntaxes are translations of it
  • Used for multilingual applications
DictEngGer.gf:

    abandon_V2 = dirV2 (mkV "verlassen" "verlasst" 
                            "verließ" "verließe" "verlassen" );
    abase_V2 = dirV2 (mkV "erniedrigen" "erniedrigt"
                          "erniedrigte" "erniedrigte" "erniedrigt");
    abasement_N = mkN "Erniedrigung";

Presenter Notes

Uni-sense and multi-sense lexicons

  • Uni-sense
    • Entries by lemmas, not word senses
  • Multi-sense
    • Every distinct word sense gets an entry
    • Possible to combine with external word sense disambiguation tool
    • Example:
      LinkedDictGer:
          brother_08111676_N = reg2N "Bruder" "Brüder" masculine ;
          brother_08112052_N = reg2N "Bruder" "Brüder" masculine ;
          brother_08112265_N = reg2N "Kamerad" "Kameraden" masculine ;
      
    • Format: lemma_senseNumber_POS
      • Sense numbers independent of language

Presenter Notes

Data needed for a lexicon

GF lexicon is a part of grammar: we need

  • baseform: the uninflected form of the word (car, walk as opposed to cars, walking)
  • inflection paradigm: cat : cats, child : children
  • valency frame: word's arguments and their behaviour

Smart paradigms: infer the lower-level paradigm from the baseform

  • Predictability does not decrease when the complexity of morphology grows (Détrez and Ranta, EACL 2012)

Presenter Notes

Source formats

Annotated data

  • TermFactory RDF
    • Includes necessary information for GF grammars
    • Conversion from TF RDF to GF included in the platform
  • WordNet
  • Morphological lexicons

Presenter Notes

Source formats

Unannotated data

  • Domain ontologies
    • ~18,000 concepts extracted for patent case
  • Phrase tables
    • Phrase table: Entries of (source phrase, target phrase, probability) for a SMT system
    • Learned from parallel text data
    • ~4,000 words extracted for French, ~8,000 for German patent grammars
  • Unannotated word lists

Presenter Notes

Format conversions

  • TermFactory RDF format supports GF conversion
    • Different rules for each language, manual work
  • Other formats: scripts written for each source

Presenter Notes

Showcases

  1. Linked lexicons from WordNet, used for translation with word sense disambiguation
  2. TermFactory: term extraction from various sources, conversion to GF

Presenter Notes

Multi-sense lexicons from WordNet

  • Multilingual multi-sense lexicon from WordNets (Hindi and Universal WordNet)
  • External word sense disambiguation tool (IMS) in parsing
  • First results: slight improvement for Hindi and German
  • Results and documentation in Virk, 2013: Computational Linguistics Resources for Indo-Iranian Languages (PhD thesis, University of Gothenburg)

Presenter Notes

TermFactory

  • TermFactory: Terminology management platform created in UHEL
  • Collaborative work
  • XHTML based editing
  • RDF format with GF-specific predicates
  • Conversion to GF lexicon

Presenter Notes

Demo: tfs.cc

Presenter Notes

Thank you!

Presenter Notes

Dissemination

Presenter Notes