Skip to main content

The Cinderella of Biological Data Integration: Addressing Some of the Challenges of Entity and Relationship Mining from Patent Sources

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 6254))

Abstract

Most of the global corpus of medicinal chemistry data is only published in patents. However, extracting this from patent documents and subsequent integration with literature and database sources poses unique challenges. This work presents the investigation of an extensive full-text patent resource, including automated name-to-chemical structure conversion, licensed by AstraZeneca via a consortium arrangement with IBM. Our initial focus was identifying protein targets in patent titles linked to extracted bioactive compounds. We benchmarked target recognition strategies against target-assay-compound relationships manually curated from patents by GVKBIO. By analysis of word frequencies and protein names we assessed the false-negative problem of targets not specified in titles and false-positives from non-target proteins in titles. We also examined the time-signals for selected target and non-target names by year of patent publication. Our results exemplify problems and some solutions for extracting data from this source.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, P., Searls, D.: Can literature analysis identify innovation drivers in drug discovery? Nature reviews. Drug discovery 8, 865–878 (2009)

    Article  Google Scholar 

  2. Southan, C., Várkonyi, P., Muresan, S.: Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds. Journal of cheminformatics 1, 10 (2009)

    Google Scholar 

  3. Patent Abstracts, http://srs.ebi.ac.uk

  4. CiteXplore, http://www.ebi.ac.uk/citexplore

  5. Free Patents Online, http://www.freepatentsonline.com

  6. Google Patents, http://www.google.com/patents

  7. SureChem, http://www.surechem.org

  8. Webber, P.: A guide to drug discovery. Protecting your inventions: the patent system. Nature reviews. Drug discovery 2, 823–830 (2003)

    Google Scholar 

  9. Grandjean, N., Charpiot, B., Pena, C., Peitsch, M.: Competitive intelligence and patent analysis in drug discovery: Mining the competitive knowledge bases and patents. Drug Discovery Today: Technologies 2, 211–215 (2005)

    Article  Google Scholar 

  10. Granstrand, O.: The economics and managment of intellectual property: towards intellectual capitalism. Edward Elgar Publishing Limited (2000)

    Google Scholar 

  11. Grubb, P.: Patents for chemicals, phamaceuticals, and biotechnology. Oxford Univ. Press, New York (2004)

    Google Scholar 

  12. Chen, Y., Spangler, S., Kreulen, J., Boyer, S., Griffin, T.D., Alba, A., Behal, A., He, B., Kato, L., Lelescu, A., Zhang, L., Kieliszewski, C.: SIMPLE: A Strategic Information Mining Platform for IP Excellence, San Jose, CA, USA (2009)

    Google Scholar 

  13. Brecher, J.: Name=struct: A practical approach to the sorry state of real-life chemical nomenclature. Journal of Chemical Information and Computer Science 39, 943–950 (1999)

    Google Scholar 

  14. Rhodes, J., Boyer, S., Kreulen, J., Chen, Y., Ordonez, P.: Mining patents using molecular similarity search. In: Pacific Symposium on Biocomputing 2007, Maui, Hawaii, p. 304 (2007)

    Google Scholar 

  15. Sarma, J., Radha, K.: Database systems for knowledge-based discovery. In: Chemogenomics: Methods and Applications, vol. 575, pp. 159–172 (2009)

    Google Scholar 

  16. Wordle, http://www.wordle.net

  17. Stop words. Department of Computer Science, Cornell Univesity, ftp://ftp.cs.cornell.edu/pub/smart/english.stop

  18. Banville, D.: Mining chemical and biological information from the drug literature. Current Opinion in Drug Discovery & Development 12(3), 376–387 (2009)

    Google Scholar 

  19. Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biology 9, S1 (2008)

    Article  Google Scholar 

  20. Cohen, A., Hersh, W.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(2), 57–71 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Suriyawongkul, I., Southan, C., Muresan, S. (2010). The Cinderella of Biological Data Integration: Addressing Some of the Challenges of Entity and Relationship Mining from Patent Sources. In: Lambrix, P., Kemp, G. (eds) Data Integration in the Life Sciences. DILS 2010. Lecture Notes in Computer Science(), vol 6254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15120-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15120-0_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15119-4

  • Online ISBN: 978-3-642-15120-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics