Lexicon extraction flagship

PDF generated from the Landslide: http://www.molto-project.eu/sites/default/files/Lexicon_extraction_in_MO...

Raw markdown code below.


Lexicon extraction in MOLTO

Krasimir Angelov, Lauri Carlson, Ramona Enache, Inari Listenmaa, Aarne Ranta, Shafqat Virk


Overview

  1. Introduction
  2. Types of lexicons
  3. Lexicon sources
  4. Showcases
  5. Future work

Introduction

Lexicon

Use cases


Types of lexicons

Monolingual

Multilingual


Monolingual lexicons

DictEng.gf:

    a_priori_Adv = mkAdv "a priori";
    aardvark_N = mkN "aardvark" ;
    ab_initio_Adv = mkAdv "ab initio";
    aback_Adv = mkAdv "aback";
    abactinal_A = mkA "abactinal" ;
    abandon_V2 = mkV2 (mkV "abandon");
DictGer.gf:

    a_priori_Adv = mkAdv "a priori" ;
    aachener_N = reg2N  "Aachener" "Aachener" masculine ;
    aal_N = reg2N  "Aal" "Aale" masculine ;
    aalfang_N = reg2N  "Aalfang" "Aalfänge" masculine ;
    aasvogel_N = reg2N  "Aasvogel" "Aasvögel" masculine ;
    abaenderbar_A = regA "abänderbar" ;

Multilingual lexicons

DictEngGer.gf:

    abandon_V2 = dirV2 (irregV "verlassen" "verlasst" 
                               "verließ" "verließe" "verlassen" );
    abase_V2 = dirV2 (irregV "erniedrigen" "erniedrigt"
                             "erniedrigte" "erniedrigte" "erniedrigt");
    abasement_N = mkN "Erniedrigung";

Uni-sense lexicons


Multi-sense lexicons

LinkedDictGer:
    brother_08111676_N = reg2N "Bruder" "Brüder" masculine ;
    brother_08112052_N = reg2N "Bruder" "Brüder" masculine ;
    brother_08112265_N = reg2N "Kamerad" "Kameraden" masculine ;

Multi-sense lexicons

Option 1: All synonyms in abstract syntax


Multi-sense lexicons

Option 2: One synonym per word sense


Source formats

Annotated data


Source formats

Unannotated data


Combining different sources


Showcases

  1. TermFactory: term extraction and conversion to GF
  2. WordNet: High-coverage lexicons for robust parsing
  3. Phrase tables: domain-specific lexicons for hybrid systems

TermFactory

    term1:fi-koira-N_-_ont-Dog
          syn:frame "N" ;
          gf:lin "mkN str" ;
          term:hasDesignation exp1:fi-koira-N ;
          term:hasReferent ont0:Dog .

    term1:fi-aviomies-N_-_ont-Husband
          syn:frame "jonkun N" ;
          gf:lin "mkN2 (mkN str) (casePrep genitive)" ;        
          term:hasDesignation exp1:fi-aviomies-N ;
          term:hasReferent ont0:Husband .

    term1:fi-kieltää-V_-_sem-Forbid
          syn:frame "V jotakuta olemasta" ;
          gf:lin "mkV2Vf (mkV str) (casePrep partitive) infElat" ;
          term:hasDesignation exp1:fi-kieltää-V ;
          term:hasReferent sem0:Forbid .

TermFactory

syn:frame and gf:lin. The former is a user-friendly way to annotate valency, the latter is a GF constructor. They have a mapping -->

    [] gf:mapping
        [ syn:frame "N" ; gf:lin "mkN str" ] ,
        [ syn:frame "jonkun N" ; 
          gf:lin "mkN2 (mkN str) (casePrep genitive)" ],
        [ syn:frame "V jotakuta olemasta" ; 
          gf:lin "mkV2Vf (mkV str) (casePrep partitive) infElat" ] .
    ont_Dog = mkN "koira" ;
    ont_Husband = mkN2 (mkN "aviomies") (casePrep genitive) ;
    sem_Forbid = mkV2Vf (mkV "kieltää") (casePrep partitive) infElat ;

High-coverage lexicons for robust parsing


Work by Shafqat Mumtaz Virk and K. V. S. Prasad


Parsing and linearization with WSD


Parsing and linearization with WSD

Explanation of the graph:


Robust parsing

Work by Krasimir Angelov, Aarne Ranta

How much do grammars leak? and Probabilistic Robust Parsing with Parallel Multiple Context-Free Grammars, submitted in COLING 2012 (not accepted)

-->

Phrase tables

Work by Ramona Enache


Ontology-based lexicon management in a multilingual translation system – a survey of use cases - PhD thesis, Chalmers/UGOT, forthcoming: Shafqat Virk, ___ - Conference papers

-->

Future work


Thank you!

AttachmentSize
Lexicon_extraction_in_MOLTO.pdf158.68 KB