The work will start with the provision of user requirements (WP9) and the preparation of a parallel patent corpus (EPO) to fuel the training of statistical MT (UPC). In parallel UGOT will work on grammars covering the domain and subsequently, together with UPC, apply the hybrid (WP2, WP5) MT on abstracts and claims.
Ontotext will provide semantic infrastructure with loaded existing structured data sets (WP4) from the patent domain (IPC, patent ontology, bio-medical and pharmaceutical knowledge bases, e.g. LLD). Based on the use case requirements, Ontotext will build a prototype (D7.1, D7.2) exposing multiple cross-lingual retrieval paradigms and MT of patent sections.
The accuracy will be regularly evaluated through both automatic (e.g. BLEU scoring) and human based (e.g. TAUS) means (WP9).
Task List
The work package is split into 9 major tasks as follows:
Task 7.1 User Requirements and Scenarios (Task Lead: UPC)
Task 7.2 Patent corpora (Task Lead: UPC)
Task 7.3 Grammars for the patent domain (Task Lead: UGOT)
Task 7.4 Ontologies and document indexation (Task Lead: Ontotext)
Task 7.5 Prototype (Task Lead: Ontotext)
Task 7.6 SMT and Hybrid MT (Task Lead: UPC)
Task 7.7 Prototype (user interface) (Tas Lead by Ontotext)
Task 7.8 Human evaluation (Task Lead: TBD)
Task 7.9 Patent Case Study: Final Report (Task Lead: UPC)
Month 10-15 plan
Task 7.2 starts in M10 and is due to provide a first set of corpora at the end of M16. Final revision depends on the availability of the EPO data.
Task 7.3 starts in M10 and is due to provide a preliminary report at the end of M16.
Month 16-21 plan
Task 7.1 starts at M15 and is due to provide a preliminar version at the beginning of M17.
Task 7.3 will produce a more complete report by the beginning of M19.
Task 7.4 starts at M16 and is due to provide a description of the type of queries at the end of M16.
Task 7.5 starts at M16 and is due to provide a description of the Prototype architecture at the end of M16.
Task 7.6 starts along with WP5 and will produce a SMT baseline for the Patents prototype.
The patents case study comprises two basic scenarios: the online patent retrieval and the
patent translation. In this prototype we tackle these two scenarios separately, as shown
in Figure 1, even though they can be viewed as a unique multilingual patent retrieval
paradigm. In the future, we plan to study how to automate the reciprocal inputs between
the two processes, i.e. the annotation of translations and the translation of semantically
annotated documents.
From a general perspective, two user roles may be defined in this case study: end-users
looking for information related to the patents and editors adding new patent documents
to a hypothetical repository.