This site is no longer maintained and is provided for reference only. Some functionality or links may not work. Please contact support for all enquiries.
Enhancing patent data acquisition |

WP13 - Enhancing patent data acquisition


The broad objective is to enhance the quality of patent related information, limited to life sciences by extracting and/or enriching data from patents. Concretely we aim at:

  • Develop expert tools to help patent applicants annotate biological sequence
  • expand text mining tools to extract relevant chemical and literature information from patents to populate specialized freely accessible databases

Task 1: Online sequence submission software

  • Within Patents, biological sequences are to be presented in a structured manner (INSDC XML, ). Biological sequences disclosed in patents are frequently poorly annotated, need to be re-submitted, thereby creating a burden for the applicant, the EPO and the scientific community at large.
  • Aim is to integrate a online sequence submission tool and expert system in EPO’s epoline environment.(EPO Online Services have been designed to allow applicants, attorneys and other users to conduct their business with the European Patent Office electronically in a state-of-the-art secure environment, protected by smart card or username/password access). We will first provide detailed specifications and requirements (including integration into EPO's production environment) and report those as D13.1. The web based expert system will enable the submission of sequences to a dedicated secured server. Verification of the data will be interactive and immediate. EMBL’s verifications criteria will apply as much as possible. The final product will delivered after 36 months (D13.04)

Task 2: Text mining, data extraction and database population

The second main development assignment will continue the tasks initiated during Felics towards text mining. Within Felics, the EPO’s ambitious aim was to extract names from chemical compounds disclosed in patents. We will persist and expand towards extraction of chemical compounds disclosed as tiff images. Resolved compound will populate ChEBI ( The extraction algorithms will be mainly enhanced by

  • Improve OCR : Enhance OCR output using an open source like CAPTCHA, so post-processing is improved. (For instance, IUPAC names are long and conform to a grammar so can be corrected for any OCR error.) We will access error probabilities/confidence values from within the OCR framework to make the data to process accurate.
  • Improve and customize further OSCAR (, software for chemical name extraction that was used within the Felics program
  • Develop OSRA ( a tool designed to convert graphical representations of chemical structures, as they appear in journal articles, patent documents. It needs better algorithmic enhancements, better software packaging.
    For this task we will use parallel development of the three environments. First specifications and initial testing will be reported in D13.02. The final delivery will occur at month 36 (D13.05)

Task 3: Cross referencing

In their description, patents do contain cross references to scientific articles. Still using text mining we aim at extracting relevant cross references to prior art literature and establish hyperlinks to those papers to be delivered as D13.06 (shared with WP14). Extracted information will also populate a database of cross references. Finally we will continue to apply text mining techniques to enrich sequence annotations.

  • We will assess the best text-mining tools, like pattern matching aided with machine learning methods (e.g. hidden Markov models or support vector machines); NLP (Natural Language Processing), including entity identification (using dictionary-based approach). We anticipate to use a combination of these techniques to optimize the outcome. The strategy will be reported in D13.03.
  • Detection of relevant information will require semi-automatic iterative analysis of the well-annotated set of publications (test set) and compare with the outcomes of the automated methods specifically develop for patent literature.
  • The result will consist in detailed annotations of sequences AND patent texts, discovering new relationships between sets of patents, biological sequences and literature.
  • Those information shall be then incorporated publicly available databases. Quality will be measure by the average increased number of annotations per entry in the sequence database and the average number of newly created hyperlinks.
    The cross-links will enhance the service functionality in WP4.

1. This work will require regular visits to the EBI from the EPO – we anticipate six per year.

2. EPO policy is to lease computers for externally funded tasks of this nature.

3. Software licences for software on the leased computers will need to be acquired for the project.

Workpackage Leader
Stephane Nauche