WP7 Case study: patents - M30
Summary of progress
Due the incremental development of the prototype, most of the tasks have span till M30, when the final prototype is being completed.
The next lines describe the progress of the following tasks:
- 7.2 Patent Corpora
- 7.3 Grammars for the patent domain
- 7.4 Ontologies and Document Indexation
- 7.5 Patents Retrieval System
- 7.6 Machine Translation Systems
- 7.7 User Interface
In relation to Task 7.2, the patents downloaded from the EPO website have been automatically translated and semantically annotated. The complete collection of files is available in the MOLTO repository, and it consists of 1) the original patent documents, 2) the English version of the patent documents having the semantic annotations, and 3) the automatic translations of claims, abstracts and descriptions. These documents constitute the main content of the retrieval databases.
As for Task 7.4 and Task 7.5, the ontologies, indexes and databases have been updated with the new dataset of documents.
Regarding Task 7.6, we designed process for patents translation that allows for building a translated document having the same XML structure as the original patent. As a result, the interface of the prototype can show the translated patents using the same user-friendly view as for the original ones. The translation of the documents consists of a pipeline involving the following 5 steps: First, the patent files are preprocessed in order to extract the text contained into the sections in a structured manner (step 1). Then, the formatting marks inline with the text are replaced by placeholders (step 2). And then, the resulting text is segmented and tokenized as required by the translation system (step 3). Soon after, the raw text is translated using the SMT system (step 4). The translated text is post- processed in order to recover the original structure of the document (step 5), including original formatting, claims enumeration and images.
Regarding Task 7.3, the query grammars have been refactored using the set of primitives defined in the Query Library work conducted in WP4. In consequence, the English and French version of the patents query grammar were adapted to the new structure, and the German version has been developed from scratch. The new grammar is equivalent to the old one. The difference is the fact that it relies on the primitive query building functions defined in the Query Library. Developing a grammar using the Query Library requires less linguistic knowledge, but just selecting the right set of primitives that would be right for the task. In comparison to the previous patent query grammar, now it has fewer constructions, because of the fact that it is developed on top of the Query Library. As a consequence, the constructions are also more natural and the number of malformed constructions have decreased considerably. The current grammar consists of 31 patterns and it is able to parse/generate 359 query constructions in English, 111 in French and 147 in German.
Finally, regarding Task 7.7, the interface has been updated with the German version of the query grammars. Also, some basic tests have been carried out at two levels in order to assess the prototype functionalities. First, some deficiencies have been corrected regarding the usability of the interface, i.e., examples of the main page, the language selection and the visualization of the results in French and German. In addition, we studied the inherent logic of the queries and the expected results, so that the system returns results that can now be considered more appropriate or accurate.
The Deliverable 7.2 gives a detailed description of the modules and their functionalities.
Highlights
- A fully functional version of the final prototype is available by M30.
- The demo allows for querying the system in English, French and German.
- The patents in the database has original text in English, French and German and the translated documents.
- A fully completed pipeline for patent document translation.
- The new query library and its application to the patents use case has been presented at the Third Workshop on Controlled Natural Language (CNL 2012), being held in Zurich at the end of August. http://attempto.ifi.uzh.ch/site/cnl2012/
Deviations from Annex I
In general lines, we are achieving the objectives related to WP7. However, the Deliverable 7.2, planned for M27, has been delayed to M30 due several issues related to the gathering of the corpora, the pro/post process of the documents and the integration of the new query library. Also, we carried several basic tests in order to assess the behavior of the prototype in terms of query results and user interaction, which reported several deficiencies that have been corrected. Since D72 has been postponed, D73 is delayed accordingly from M33 to M36.
- Printer-friendly version
- Login to post comments
- Slides
What links here
No backlinks found.