Abstract
Most of the global corpus of medicinal chemistry data is only published in patents. However, extracting this from patent documents and subsequent integration with literature and database sources poses unique challenges. This work presents the investigation of an extensive full-text patent resource, including automated name-to-chemical structure conversion, licensed by AstraZeneca via a consortium arrangement with IBM. Our initial focus was identifying protein targets in patent titles linked to extracted bioactive compounds. We benchmarked target recognition strategies against target-assay-compound relationships manually curated from patents by GVKBIO. By analysis of word frequencies and protein names we assessed the false-negative problem of targets not specified in titles and false-positives from non-target proteins in titles. We also examined the time-signals for selected target and non-target names by year of patent publication. Our results exemplify problems and some solutions for extracting data from this source.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agarwal, P., Searls, D.: Can literature analysis identify innovation drivers in drug discovery? Nature reviews. Drug discovery 8, 865–878 (2009)
Southan, C., Várkonyi, P., Muresan, S.: Quantitative assessment of the expanding complementarity between public and commercial databases of bioactive compounds. Journal of cheminformatics 1, 10 (2009)
Patent Abstracts, http://srs.ebi.ac.uk
CiteXplore, http://www.ebi.ac.uk/citexplore
Free Patents Online, http://www.freepatentsonline.com
Google Patents, http://www.google.com/patents
SureChem, http://www.surechem.org
Webber, P.: A guide to drug discovery. Protecting your inventions: the patent system. Nature reviews. Drug discovery 2, 823–830 (2003)
Grandjean, N., Charpiot, B., Pena, C., Peitsch, M.: Competitive intelligence and patent analysis in drug discovery: Mining the competitive knowledge bases and patents. Drug Discovery Today: Technologies 2, 211–215 (2005)
Granstrand, O.: The economics and managment of intellectual property: towards intellectual capitalism. Edward Elgar Publishing Limited (2000)
Grubb, P.: Patents for chemicals, phamaceuticals, and biotechnology. Oxford Univ. Press, New York (2004)
Chen, Y., Spangler, S., Kreulen, J., Boyer, S., Griffin, T.D., Alba, A., Behal, A., He, B., Kato, L., Lelescu, A., Zhang, L., Kieliszewski, C.: SIMPLE: A Strategic Information Mining Platform for IP Excellence, San Jose, CA, USA (2009)
Brecher, J.: Name=struct: A practical approach to the sorry state of real-life chemical nomenclature. Journal of Chemical Information and Computer Science 39, 943–950 (1999)
Rhodes, J., Boyer, S., Kreulen, J., Chen, Y., Ordonez, P.: Mining patents using molecular similarity search. In: Pacific Symposium on Biocomputing 2007, Maui, Hawaii, p. 304 (2007)
Sarma, J., Radha, K.: Database systems for knowledge-based discovery. In: Chemogenomics: Methods and Applications, vol. 575, pp. 159–172 (2009)
Wordle, http://www.wordle.net
Stop words. Department of Computer Science, Cornell Univesity, ftp://ftp.cs.cornell.edu/pub/smart/english.stop
Banville, D.: Mining chemical and biological information from the drug literature. Current Opinion in Drug Discovery & Development 12(3), 376–387 (2009)
Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L., Valencia, A.: Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biology 9, S1 (2008)
Cohen, A., Hersh, W.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(2), 57–71 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Suriyawongkul, I., Southan, C., Muresan, S. (2010). The Cinderella of Biological Data Integration: Addressing Some of the Challenges of Entity and Relationship Mining from Patent Sources. In: Lambrix, P., Kemp, G. (eds) Data Integration in the Life Sciences. DILS 2010. Lecture Notes in Computer Science(), vol 6254. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15120-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-15120-0_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15119-4
Online ISBN: 978-3-642-15120-0
eBook Packages: Computer ScienceComputer Science (R0)