WP7 Case study: patents - M24

Summary of progress

During this period, WP7 has done a step forward in the development of the prototype: the Patent MT and Retrieval Beta Prototype was first released in M21, and the final version of D7.1 has been delivered.

Due the incremental development of the prototype, most of the tasks span till M27, when the final prototype must be delivered. The following lines describe the progress of the following tasks:

  • 7.2 Patent Corpora
  • 7.3 Grammars for the patent domain
  • 7.4 Ontologies and Document Indexation
  • 7.5 Patents Retrieval System
  • 7.6 Machine Translation Systems
  • 7.7 Protoype (User Interface)

In relation to Task 7.2, the EPO provided a parallel corpus of patents from which only 66 patents belongs to the biomedical domain. We downloaded an alternative corpus of 7,705 document directly from their website (i.e. publicly available) The following summarizes the content of these documents: 4,274 out of the 7,705 documents have claims (6M lines), 2,058 out of them are trilingual (3M lines). 2,116 documents have claims written only in English, 66 have claims only in German (260K lines), 34 only in French (88K lines). There are no extra files having other combination of languages.

Regarding Task 7.4 and Task 7.5, the ontologies, indexes, databases and retrieval engines have been set up for the specific domain and using the patent documents described above. The semantic annotation process is carried out by a GATE pipeline on the English texts. We are working to export the annotations during the translation process in order to be able to show the annotations also in the French and German texts.

As for Task 7.3 and Task 7.6, the grammars development and SMT adaptation to the domain is being developed jointly with WP5 tasks. The grammars have been developed for English and French, and in the following will be developed also for German.

Finally, regarding Task 7.7, the interface allows accessing the system in three different ways: the controlled language, SPARQL and terms in the index. In the future we will include free text and a combination of it with the controlled language.

Highlights

Since M21 there is a fully functional version of the prototype at http://molto-patents.ontotext.com/. The demo allows querying the system in English and French. The patents in the database has original text in English, French and German.

The retrieval system can be queried in three different ways. The NL-based interface allows the user to query the system in English and French using written natural language. The SPARQL interface, more suitable for advanced users, allows to accurately browse the repository using SPARQL queries The keyword-based visual browsing interface uses the RelFinder tool in which the user can search for keywords using the autocomplete functionality. The results from the RelFinder search are visualised as graphs.

The visualization of the results displays the list of classes from the ontologies that match the query and the list of patent documents indexed under the matching criteria. The interface provides also a link to access the semantically annotated documents and the original patents. The interface that shows the annotated documents highlights on the text the words that are related to any semantic item. Colors are given according to the semantic annotations type. The right side of the page gives the list of semantic types and colors that are present in the text.

A paper about the Patent retrieval system was accepted at WWW2012 Conference, to be held in April.

Use of resources for Period 2

Node Professor/Manager PostDoc PhD Student Research Engineer/Intern
UGOT 0 0 3 (R. Enache) 3 (A. Slaski)
UPC 0 7,5 (M. Gonzales, C. España) 0 0
UHEL 0 0 0
OntoText 0 0 0 8,8 (M.Chechev, M.Damova, V.Zhikov, I.Kabakov)

Deviations from Annex I

In general lines, we are achieving the objectives related to WP7 tasks within the timeframe. However, due the several issues related to the gathering of the corpora, the databases of the retrieval system do not include yet automatic translations of the patent document but only real translations. The issue affects directly the annotation process of Tasks 7.5, but it does not imply a delay for the whole prototype. The estimation is that the automatic translations and annotations will be included in the final prototype.