3.5 Ontotext Exploitation Plans

Introduction

Ontotext’s business model combines the development of products (including some open source versions) with the provision of research, consultancy and development services. Many commercial projects combine all four elements. For Ontotext MOLTO will bring the unique opportunity to strengthen its position in the semantic technologies and knowledge-driven text analytics market, with development and adoption of intelligence methods that support ontology-based multilinguality. This will be possible due to the fact that MOLTO adds to the semantic technologies the GF formalism, which operates as an interlingua on language level and thus, localizes the ontologies in appropriate ways. More precisely, the main directions of future development will be as follows:

  • Interoperability and grounding in Linked Open Data resources and domain ontologies
  • High throughput, multilingual text processing
  • Robust cross-lingual translation of various domain data within search and retrieval services

The business strategies will be as follows:

  • The company will put its technology (in this case, the KIM Semantic Annotation Platform, http://www.ontotext.com/kim; and Publishing tools) into a stronger multilingual context.
  • OntoText will strengthen its synergies between the semantic and world-knowledge infrastructure modules, and the MT services.
  • The task of Combining the GF model and the ontology standards would test and enrich the reasoning platform, developed and maintained at Ontotext.

Company Profile

Ontotext AD is the strongest semantic technologies company in Europe and a world-leading supplier of core semantic technology, text mining and web mining solutions.

  • Established in year 2000, today Ontotext has over 65 employees located in Bulgaria (Sofia and Varna), USA (Fairfield, CT) and UK (London);
  • After acquiring VC funding in 2008, at the end of 2010 Ontotext reached break-even and since then doubles its commercial revenues annually.
  • We are global leader in semantic database engines, successfully competing with ORACLE, IBM, and Microsoft in this field;
  • Our unique competences are backed by heavy investment in R&D – over the last 12 years we have invested more than 300 person-years in semantic technology. We know what works and what does not!

We have unmatched portfolio of world-class technology and expertise in:

  • Semantic Databases: high-performance RDF DBMS, scalable reasoning;
  • Semantic Search: text-mining (IE), Information Retrieval (IR) ;
  • Web Mining: focused crawling, screen scraping, data fusion;
  • Linked Data Management and Data Integration.

The main differentiator between Ontotext and other semantic technology vendors is that we deliver robust technology, proven in multiple high-profile projects that justify its maturity and usability. The best example in this direction is the usage of OWLIM (our RDF database engine) in the BBC FIFA World Cup 2010 website where most of the pages were generated dynamically through queries to OWLIM – millions of requests per day, hundreds of updates per hour, handled by a cluster of few servers. Following the success of this project, BBC extended the use of Ontotext technology for the BBC Sport website and for the London Olympics 2012 website.

Ontotext’s clients span across several sectors:

  • Pharma: AstraZeneca, UK, and UCB, Belgium;
  • Media and publishing: BBC and Press Association, UK; and Publicis, Germany;
  • Telecommunications: Telecom Italia and Korea Telecom;
  • Archives and cultural heritage: The National Archive, UK, the British Museum, the Dutch Public Library;
  • Government: Department of Deffence, USA; and House of Commons and TNA, UK; Natural Resource Canada.

Considering the substantial number of clients of Ontotext in UK, we are running in London regular open training courses “Semantic Technologies with OWLIM”, usually scheduled at roughly once per quarter.

Products Relevant to Opportunity

  • OWLIM is an industrial-scale semantic database, using Semantic Web standards for inference and integration/consolidation of heterogeneous data.
  • KIM Platform is a semantic search engine, using text analysis to provide hybrid queries involving structured data and inference.
  • FactForge is a public service that represents a reason-able view to the web of data.
  • Linked Life Data is a platform for semantic data integration trough RDF warehousing and efficient reasoning that helps to resolve conflicts in the data.
  • Publishing platform – semantic publishing platform, ingesting and enriching thousands of news with linked data daily; enables the publishers and third-parties to explore innovative business models and alternative revenue streams.

Research Transfer Process

On the one hand, the research goes into products through the traditional ways:

  • creating prototypes in use case domains, and then
  • scaling these prototypes into systems for real usage.

On the other hand, the developed technology within the project is applicable to other related areas of NLP services application. It can be either used as stand-alone applications, or be integrated into larger and more complex architectures. Both business opportunities have significant added value.

The first direction is exemplified by the envisaged use case domains: Patents in medical domain and artefacts in Cultural heritage domain. The second direction goes to areas that apply strongly Question Answering, Information Retrieval and MT. Such areas are: Publishing, Social Media and Pharma. The related products are highly commercial and thus, precision and relevance of the retrieved information are crucial features for the clients. GF formalism would be useful for the smoothing of the multilingual retrieval and translation results. It must be noted that the component shared by all targeted products of Ontotext is the ontology-based knowledge that relates to LOD and multilingual settings.

All the EU research projects that Ontotext has been involved in, have lead to the improvement of the current technology as well as to the creation of new products, that have been explored in commercial projects. In this way, we might view the Research as an Investigation, Preparation and Compilation phase, while the applications in Industry – as Adaptation, Harmonization and Real Setting evaluation phase. Below some synergies of the aforementioned kind are given:

  1. RENDER is an ongoing project that aims at providing a comprehensive conceptual framework and technological infrastructure for enabling, supporting, managing and exploiting information diversity in Web-based environments. It also would leverage very large amounts of content and metadata: news, blog and microblog streams, content and logs from Wikipedia, news archives, multimedia content and reader comments, discussion forums, etc. This data is managed by a highly scalable data management infrastructure, and enriched with machine-understandable descriptions and links referring to the Linked Open Data Cloud. This development would lead OWLIM and KIM technologies to handle diverse data, which would widen their data coverage and management.

  2. CUBIST is an ongoing project that aims at Combining and Uniting Business Intelligence and Semantic Technologies with a special focus on unstructured data mining. Being central to the project goals, the semantic technology supports a persistent layer – a semantic Data Warehouse. The project adds to the better Visual Analytics, whose improved characteristics would be important for providing more competitive user interfaces in industry.

  3. MediaCompaign is innovative in Ontology creation for cross-media modelling of media presence and campaigns; Semantic cross-market product data interlinking; Identification and tracking of new media campaigns in different media and countries. MediaCompaign focused mainly on advertisement campaigns and their impact on attitudes and opinions. Thus, the publishing services, provided by Ontotext, will be enriched with sentiment analysis additionally to the knowledge-based analysis. Thus, Ontotext will have a social-aware service.

  4. NoTube project concentrated on personalized semantic news; personalized TV guide with adaptive advertising as well as Internet TV in the Social Web. It relied on the key role of the semantic technologies, taking into account the community aspects and is built on multilinguality. The results strengthened the personalized component in the retrieved information in commercial publishing platforms.

  5. PHEME project (will start in October 2013) has as its main goal the development of scalable methods for Social Semantic Intelligence, across media and languages. I also aims at modeling not only facts and opinions, but also the parameters of reliability of the information sources. Additionally, PHEME focuses on more concrete and socially crucial cases in recent years, such as crowdsourcing, citizen journalism and bioinformatics. PHEME goes beyond official media campaigns - to social network dimensions and beyond the opinions – to rumour and misinformation detection. This project will lead to a large-scale social media bound OWLIM and KIM platforms. Also, it will add to its services the identification of misinformation, which would be very valuable facility in the personalized component for the end users.

In all productizing areas, listed below, the following underlying NLP technology is assumed:

  • Multilingual semantics based question answering
  • Cross-language retrieval
  • Public Translation Service (combining GF + ML)

Relevant Trends in business domain

With the globalization processes and harmonization of large groups of documents in EU, the requirements for particular data management systems is rapidly growing. Additionally, virtual space has become more populated, shared, explored and multilingual. For example, virtual tours in famous museums; virtual storage and access to EU legislation; interactive online digests; electronic government; digital preservation storages; social networks etc.

For these reasons, it is not surprising that some of the most active business domains at the moment are the Cultural heritage stakeholders (DARIAH, CLARIN, EUROPEANA); Pharma (Astra Zeneca); Media Publishing (BBC, NDP, Press Association) and Social Media (Pheme project).

Ontotext is involved in all of the aforementioned domains through research projects and commercial projects.

Ontotext : Productize MOLTO Technology in novel areas

Publishing platforms and Digital Journalism

The cross-media analytics is a typical case of business intelligence, developed at Ontotext. Ontotext’s technology covers preferably (but not only) publishing agencies (such as, Press Association, NDP, Oxford, etc.) and government data management (US government). From a language point of view, the company has been working systematically on commercial project for Dutch and English. However, lately, it started to expand the multilingual set to Bulgarian, German, Chinese, etc. Having in mind these facts, GF formalism as well as RDF-GF interoperability from MOLTO would be the natural extension of the information extractors, thus facilitating the interaction between the users’ queries and their machine processing. More precisely, the following extensions are envisaged: embedded translator service, tuned to the domain (sports, finance, politics, etc.); embedded converter from RDF representation to GF and then to language, and vice versa.

Also, internally, the semantic annotation tool will be augmented with language localization modules that would support the annotators.

Related markets: publishing; electronic government

SWOT Analysis:

  • Strengths: Improvement of the existing multilingual modules; creation of new functionalities to the customers, such as viewing the same result in various languages; improving the annotation process and text analytics; better communication between the ontology and user queries.

  • Weaknesses: Domain adaptation of the MOLTO modules might be needed, when addressing a new domain or even a subdomain of a specific domain.

  • Opportunities: There might be the possibility to create a publishing platform of new generation, which provides a typological core for many languages and thus – is easily adaptable to new languages. Additionally, to see Dutch news in English within the publishing system itself, for example, would extremely facilitate the customers.

  • Threats: The online real time applications might be unstable initially due to the complex architecture.

Social Media

In Ontotext projects the existing LOD resources (such as, Linked Life Data) are applied for different socially aware domains and across languages. For example, the entity extraction tool LUPedia as well as the linked data concept store FactForge will be used in enhancing the socially marked knowledge. These modules will be extended by the language generation tool from MOLTO in order to improve the accuracy of the extracted information. This step is manageable, since the MOLTO rule-based translation technology is extended with the help of statistical approaches.

Related markets: education, tourism

SWOT Analysis:

  • Strengths: MOLTO gives the possibility of applying a structured approach to unstructured data for the purposes of good understanding of big amounts of data.

  • Weaknesses: MOLTO might support better some forms of Social Media (publicly available), while some others (restricted) – not so well.

  • Opportunities: The social media might be viewed as a network of subdomains and addressed by MOLTO technology in a step-by-step way.

  • Threats: No visible possibility is foreseen at the moment for using MOLTO modules directly in sentiment and opinion analysis.

Pharma

Ontotext regularly participates in projects that consider health care and life sciences data management (there is Life Science project running now). Here the available domain ontologies are explored together with the NLP processing. GF will be extremely useful since both the prescriptive and diagnosis languages are controlled. There is an additional level of translation here, namely: from the specialized prescription and anamnesis language of doctors into the common natural language of the users.

Related markets: Medical producs sales, health care

SWOT Analysis:

  • Strengths: MOLTO is best performing in controlled and structured domains. Pharma is a good example of such a combination. In addition, there is an already working prototype on Patents in this domain.

  • Weaknesses: Pharma would be better manageable from doctors’ production point of view, rather than from patient perspective, since professional language is better controlled.

  • Opportunities: Improvement of multilingual search and relevance of the search results.

  • Threats: Pharma is one of the well elaborated domains from a processing point of view. Thus, the real added value of MOLTO is to be tested in the future.

Ontotext : Productize in the use case domains: Patents and Cultural Heritage

Patents

Part of the commercial projects, carried out within Ontotext, are connected to pharmacies. In this respect, the developed prototype in bio-medical and pharmaceutical domains will be employed directly in the workflow processes.

Related markets: Administration, government, science, businesses

SWOT Analysis:

  • Strengths: The usage of patents is a common and necessary activity in industry. Thus, the created structure model for handling patents in one specific domain would be applicable to patents in other domans, too. Retrieval services in a strongly cross-lingual context would be also very attractive features for exploitation.

  • Weaknesses: If the MOLTO modules are used in another domain of patents (for example science) some adaptation will be needed, although the patent structure itself would be stable beyond specific domains.

  • Opportunities: The usage of the patent service would facilitate and speed up the process of managing Pharma policies with respect with new development in healthcare.

  • Threats: The patent service might not cover all the query requirements of the users due to the limitations of the controlled language or the incompleteness of the corpus.

Cultural Heritage

There are many stakeholders in this area, since lately the related initiatives have grown considerably. Here we have in mind the specific ones: Europeana, British Museum, ConservationSpace and CLARIN.

Europeana already provides search facilities. However, they cover only metadata and are not connected to ontologies. Also, the translation from one language to another is done via machine translation(MT) only, without any grammatical formalism behind it. British museum is a partner which would try the MOLTO services, profiled specially for museum objects. They already use Ontotext’s semantic repository OWLIM. This service supports semantic search, semantic RDF data sources, Web Publication. MOLTO will add to the better search functionality as well as to the multilingual information extraction. Similar projects are: Gothenburg City Museum (Sweden); Polish Digital National Museum; Yale Center for British Art (USA): Linked Open Data publishing of museum collection. ConservationSpace project is managed by the National Gallery of Art (USA) and 7 other institutional partners from the USA, UK and Denmark. It handles the data management. MOLTO services might also contribute to the better preservation of the documents through adopting the GF formalism as a mediator between the users' queries and SPARQL queries. Similar projects are: FP7 CHARISMA: Synergy for a Multidisciplinary Approach to Conservation/Restoration; FP7 3D-COFORM: 3D documentation and collection formation of tangible cultural heritage; CLARIN is a pan-European initiative, which aims at elaborating also a globally shared service, among other services, for exploration of cultural artefacts. Ontotext is a participant in this initiative. It might provide the same facility to the consortium as in the above opportunity. Similar projects: FP7 V-MUST: Virtual Museum Transnational Network, a Network of Excellence.

Related markets: tourism, education

SWOT Analysis:

  • Strengths: coverage of many languages with language specific mediated filtering (through GF formalism); high precision of the retrieved content due to the controlled language; easy adaptability to other areas of cultural artefacts and languages.

  • Weaknesses: Since one of the use cases in MOLTO considers museums, the application to other subdomains of Cultural heritage might need adaptation of the grammars and ontologies.

  • Opportunities: The service can be adopted by various virtual cultural databases and adapted to them.

  • Threats: There might not be available resources for certain languages or language variants; language generation might not be efficient enough for all languages.