WP7 Case study: patents - M39
Summary of progress
The aim of "WP7:Patents Case Study" was to create a prototype for automatic translation and multilingual retrieval of patents. The online prototype is publicly available at: http://molto-patents.ontotext.com/.
This patents case study has set up the grounds where to put together several technologies in order to come up with a useful platform for multilingual patent retrieval system. The main challenges addressed in the prototype are a) to translate semantically enriched patent documents, including the original mark-up, b) to design the mechanisms to enable the multilingual indexing and retrieval of the patents, c) to define and develop a query language and the query grammar to enable a user-friendly interaction with the system, and d) to set up an on-line application for retrieval of patent document that serves as a testbed of our work.
The patents prototype combines semantic annotations, retrieval techniques and two different approaches for machine translation. The integration of different translation methodologies into the system has been crucial to increase its capabilities and make possible extended features and functionalities, with respect to preliminary version of the system.
For the massive translation of text, a statistical system has been trained and adapted to translate the text and transfer the semantic annotations into the target languages. One of the challenges in this task was to come up with a mechanism to translate the semantics of the source texts to the target files. As a result, the patent documents are semantically enriched and translated using the statistical system. Then, the multilingual documents are used to feed the databases and indexes of the retrieval system. What remains as a future challenge is the use of these annotations to still increase either the accuracy of the annotations or the quality of the translations.
On the other hand, a rule-based system is built in order to translate from (controlled) natural language to the semantic query language (SPARQL), in the interface. The GF has been proved an efficient way of generating the SPARQL queries, as if it was “Yet Another Query Language”. In other words, it allows to translate a natural language query from the user’s language to SPARQL, which makes the system accessible to a broader community rather than just skilled users. This automation facilitates also the interoperability between the query grammar and the ontologies and speeds up the development and maintenance of the querying subsystem.
Finally, the patent prototype is not comparable with the interfaces exposed by the European Patent Office, namely because they were conceived for different purposes. Nonetheless, the MOLTO patents prototype demonstrates that a patents retrieval system that addresses multilingualism by means of automatic translation techniques is commercially viable.
Highlights
The preliminary version of the prototype, described in Deliverable 7.1 had only original patent documents in the databases and the system was only available in English and French.
A complete version of the prototype, described in Deliverable 7.2, included resources also for German, and patent documents translated using the Statistical Machine Translation (SMT) system trained on the domain, and described in Deliverable 5.2.
The news introduced with respect to previous versions of the prototype are: 1. A new process for statistical-based translation of patents that allows to transfer the semantic annotations and the original mark-up in the source documents to the target language.
The development of the patent translator API in order to integrate the translation system into remote applications, such as online patent translation in the GF cloud.
The updates on the retrieval architecture in order to improve the response time, such as snippeting.
A new querying approach for SPARQL generation based on the grammar – ontology interoperability automation, driven by the Grammatical Framework.
A new query grammar for the biomedical patents domain, which has been improved in terms of coverage and compliance to the patent domain ontology that is behind the information retrieval system.
The new functionalities integrated in the user interface in order to improve the usability of the application, such as the integration of the free-text search as a back-off mechanism for the query language, based on free text search.
Some updates on the on-line user interface that address usability aspects and further functionalities.
Deviations from Annex I
The main objectives of the work package have been fulfilled:
i) create a commercially viable prototype of a system for MT and retrieval of patents in the bio-medical and pharmaceutical domains,
ii) allowing translation of patent abstracts and claims in at least 3 languages
iii) exposing several cross-language retrieval paradigms on top of them.
This workpackage started with six months of delay because the WP leader, Matrixware, left the MOLTO Consortium during Month 3.
After the re-scheduling, the tasks related to this workpackage were kept up to date according to the calendar.
The final version of the prototype was agreed to be delayed till M36 due multiple dependencies with other workpackages.
The new calendar allowed to incorporate the latest developments (grammar and ontologies interoperability in WP4 and hybrid translation from WP5), in the final demoed applications.
- Printer-friendly version
- Login to post comments
- Slides
What links here
No backlinks found.