Query Technology Flagship

Query Technologies. Patents Retrieval System

Captain: Maria Mateva

Sailors: Aarne Ranta, Ramona Enache, Meritxell Gonzàlez, Jordi Saludes

Other contributors: MOLTO Consortium

Presenter Notes

SMILE! (Test!)

Outline

  • Query Language Generation with GF
  • GF - Ontology Interoperability
  • MOLTO Prototypes (with Respect to Query Technology)
  • Patents Use Case - Demo

Presenter Notes

Query Language Generation with GF

MOLTO Overview: Machine Language Generation

CNL to Ontology via GF: NL queries to SPARQL

  • Mapping rules (WP4, molto-kri)
  • SPARQL as a GF concrete language. YAQL (WP7 molto-patents, WP8 molto-cultural-heritage, WP12 verbalization component in the Be Informed Studio)
  • GF proved to be successful in providing translation to machine languages as well

Ontology to NL description/answer via GF

  • RDF facts verbalization by GF description grammars: verbalizing RDF facts, verbalizing subjects/objects and predicates;
  • Semi-automatically generated answer/description grammar

GF abstract representation to Sage syntax

GF abstract representation to ACE syntax

Presenter Notes

GF-Ontology Interoperability

The task

Presenter Notes

Mapping Rules

From GF Abstract Trees to SPARQL

  • Small domain specific language for the purpose
  • A dedicated parser for it
  • Tables of names, types of entities, syntactic sugar
  • The rules are easy to write, but tedious to test and maintain
  • Used in molto-kri (WP4). Example:

    //all people and all organizations
    (QSet ?X) | single(X) && type(X) == "" && name(X) != Location -->
    construct WHERE {
    sparqlVar(name(X)) rdftype() class(name(X)) .
    sparqlVar(name(X)) property(hasAlias) sparqlVar(name(X)) ## "_alias". };

Presenter Notes

Concrete GF Grammars for SPARQL

The YAQL grammar

  • GF module wich provides a basis for generation of domain specific SPARQL grammars
  • Part of WP4, used in WP7, WP8, WP12
  • Straightforward abstract syntax generation from ontology, with just the minimum of lexical types

    Kind : usually CN
    Entity : usually NP
    Property : can be VP, AP, ClSlash
    Relation : VPSlash built from V2, AP, comparatives
    

Presenter Notes

Verbalization of RDF Facts via GF

Verbalization of subject-predicate-object triples For subjects and objects it can be made straight-forward - all are presented in GF by the Entity category. We might want to have more categories, corresponding to classes in the ontology.

  • abstract syntax

      cat Entity;  
      fun  
          Airline_T_1 : Entity;  
          Airline_T_10 : Entity;
    
  • concrete syntax(data from the English labels)

      lincat  
            Entity = Str ;  
      lin  
            Airline_T_1 = "Japan Airlines System Corporation" ;  
            Airline_T_10 = "Cathay Pacific Airways, Ltd." ;
    

Presenter Notes

Verbalization of RDF facts

Automation of Verbalization of Predicates with GF

  • Motivation: constistency of ontology classes and properties names (in English)
  • Research on 10 random public repositories with ~1700 unique predicates
  • Division of normalized predicate names into 3 groups and ~20 types
  • We suggest verbalization of the types via GF (in English)

Example

abstract syntax:

fun
    text: Entity -> Property -> Phrase;
    and: [Property] -> Property;
    activeInSector: Entity -> Property ;  
    childOrganizationOf : Entity -> Property ;  
    hasCapital : Entity -> Property ;  
    ...

Presenter Notes

Verbalization of RDF facts. Example

RDF result triples:

In GF(data from labels):

Company_T_36 = "4Developers LLC" ;
Country_T_4 = "United States" ;

In GF(manually translated predicates):

EN:  locatedIn x = MkVPS (mkTemp presentTense simultaneousAnt) positivePol (mkVP (mkA2 (mkA "located") (mkPrep "in")) (s2e x)) ;
SV:  locatedIn x = mkVPS (mkVP (mkV2 ligga_V (mkPrep "i")) (s2e x)) ;

Verbalization:

EN: 4Developers LLC is located in United States.
SV: 4Developers LLC ligger i United States.

Note: We skip to verbalize some of the triples(e.g. predicates like "rdf:type" and "rdfs:label")

Presenter Notes

Ontology Predicates Classification

with respect to GF verbalization

  • Example classes(types) available at molto-svn
  • Example-based approach - verbalization of each type with GF abstract and concrete pattern
  • The approach requires final expert's interaction
  • Future work:
    • Verbalization of queries for the same predicates
    • Translation in different languages via lexicons(done by UHEL and BI)
    • Automatic verbalization of all facts with GF means
    • It provides multilinguality on a low price but the quality achieved is not very high

Presenter Notes

MOLTO Prototypes

with Respect to Query Technology

Presenter Notes

The KRI Prototype

Queries over PROTON

  • The PROTON is an upper-level ontology by Ontotext
  • PROTON has 935668 statements and 260376 entities, mostly focused on the Person, Location, Organization, Job Title concepts
  • KRI prototype: http://molto.ontotext.com/
  • A set of predefined queries in 6 languages
  • Semi-autogenerated answer grammar
  • SPARQL generation by mapping-rules
  • Verbalizing ontology facts and predicates types

Presenter Notes

Mathematics Use Case

Querying a Computer Algebra System(Sage) by NL

  • A command line tool for computing using natural language and aural replies.
  • An embedded interface in the Sage notebook. Using natural language in a Sage cell
  • GF grammar with Command and Answer categories
  • Concrete grammars for each natural language supported and an extra one for the Sage language
  • User commands are translated to a Sage expression
  • The user receives a translation from the Sage language to the target natural language

Presenter Notes

Cultural Heritage Use Case

SPARQL Generation via GF grammar

  • Prototype at: http://museum.ontotext.com
  • A more dynamic approach compared to the Patents use case(WP7). A step forward with automation
  • Museum names are partially translated with DBpedia multilingual labels
  • 15 natural languages are translated to SPARQL queries
  • Painting descriptions are automatically generated from semantic results in 15 languages

Presenter Notes

Patents Retrieval System

The Tasks

  • Patents semantically enriched with annotations
  • Query biomedical patents, the related ontologies and the semantic annotations over them
  • Translate the patents to the three official EU languages and return to the user results in their prefered language
  • Prototype: http://molto-patents.ontotext.com/

Presenter Notes

Patents Retrieval System

Overview

  • 4600+ bio-medical patents semantically annotated by Ontotext
  • GATE 6 for the annotation pipeline
  • Queries in 3 natural languages(EN, FR, DE) by UGOT
  • GF 3.4 for natural language queries
  • Statistical machine translation of patents by UPC
  • Forest 1.4.1 for UI, generic RDF browser
  • OWLIM 5.3 as semantic repository
  • Apache Solr for free text search and document snipetting
  • molto-core for autocomplete(FSA)

Presenter Notes

Patents Use Case

Retrieval System Architecture

Presenter Notes

Patents Retrieval System

Latest Improvements on the Prototype

  • Improvements on the query language
    • removed queries ambiguities
    • improved ontology coverage
  • GF generation of SPARQL
  • New annotations on patents metadata
  • Free text search
  • Document snipetting improvements
  • Speed optimizations
  • Statistical machine translation of semantic annotations texts
  • Statistical machine translation of the drugs lexicons(!coming soon!)
  • UI improvements

Presenter Notes

Patents Retrieval System

Query Language. SPARQL Generation

  • Over ~50 different queries over 8 semantically annotated types
  • Concepts: Drug, Active ingredient, Dosage Form, Route of Administration, TE Code, Market, Applicant, Application Number, Patent Number, Publication Date, Application Date, etc.
  • Lexicons for the 8 types making it possible to ask over 20 mln different questions(mostly coming from "patents that mention X and Y" query)
  • GF-to-SPARQL example("patents that mention X and Y")

    Abstract syntax: 
        QShowConceptXY : Concept -> Concept -> Query ; 
    Concrete NL syntax(queries interface):
        QShowConceptXY c1 c2 = mentionP (mkNP and_Conj c1 c2) ; 
    Concrete SPARQL syntax:
        QShowConceptXY c1 c2 =  
            {s = "PREFIX pkm: <http://proton.semanticweb.org/protonkm#> $n  
                PREFIX psys: <http://proton.semanticweb.org/protonsys#> $n  
                CONSTRUCT { $n ?doc pkm:mentions ?x . $n  ?doc pkm:mentions ?y } $n 
                WHERE { $n  ?x psys:mainLabel " ++ c1.s ++". $n    
                    ?doc pkm:mentions ?x . $n ?y psys:mainLabel " ++ c2.s ++". $n     
                    ?doc pkm:mentions ?y . $n  }  " } ;
    

Presenter Notes

Patents Retrieval System

Results

  • Translated patents
  • RDF graph with followable URIs of entities

Presenter Notes

Patents Retrieval System

Translated documents with Semantic annotations

  • Semantic annotations(now translated in all languages)
  • Patents translations

Presenter Notes

Demonstration - Patents Use Case(WP7)

Presenter Notes

Questions?

Presenter Notes

Thanks for Your Attention!

Presenter Notes

MOLTO Publications

  • A. Ranta. Implementing Programming Languages. An Introduction to Compilers and Interpreters, with an appendix coauthored by Markus Forsberg, College Publications, London, 2012
  • The Patents Retrieval Prototype in the MOLTO project, Milen Chechev, Meritxell Gonzàlez, Lluís Màrquez, Cristina España-­Bonet. World Wide Web Conference 2012. 16th-­20th April 2012, Lyon, France
  • CNLs for multilingual queries in MOLTO, Olga Caprotti, Milen Chechev, Ramona Enache, Meritxell Gonzalez, Aarne Ranta, Jordi Saludes Third Workshop on Controlled Natural Language (CNL 2012). 29–31 August 2012, Zurich, Switzerland
  • MT Techniques in a Retrieval System of Semantically Enriched Patents(submitted), Meritxell Gonzàlez, Maria Mateva, Ramona Enache, Cristina España-Bonet, Lluís Màrquez. Machine Translation Summit XIV (MTSummit 2013), 2-6 September 2013, Nice, France

Reports

  • WP7 Semantic Infrastructure & Prototype Building. Milen Chechev. Ontotext AD, June 2011, Bulgaria
  • MOLTO-­Patents: recent issues, solutions and perspectives. Laura Tolosi, Maria Mateva , Ontotext AD, September 2012, Bulgaria

Presenter Notes

Presenter Notes