Computational Neuroscience
Automated cognome construction and semi-automated hypothesis generation

https://doi.org/10.1016/j.jneumeth.2012.04.019Get rights and content

Abstract

Modern neuroscientific research stands on the shoulders of countless giants. PubMed alone contains more than 21 million peer-reviewed articles with 40–50,000 more published every month. Understanding the human brain, cognition, and disease will require integrating facts from dozens of scientific fields spread amongst millions of studies locked away in static documents, making any such integration daunting, at best. The future of scientific progress will be aided by bridging the gap between the millions of published research articles and modern databases such as the Allen brain atlas (ABA). To that end, we have analyzed the text of over 3.5 million scientific abstracts to find associations between neuroscientific concepts. From the literature alone, we show that we can blindly and algorithmically extract a “cognome”: relationships between brain structure, function, and disease. We demonstrate the potential of data-mining and cross-platform data-integration with the ABA by introducing two methods for semi-automated hypothesis generation. By analyzing statistical “holes” and discrepancies in the literature we can find understudied or overlooked research paths. That is, we have added a layer of semi-automation to a part of the scientific process itself. This is an important step toward fundamentally incorporating data-mining algorithms into the scientific method in a manner that is generalizable to any scientific or medical field.

Highlights

► Understanding the human brain, cognition, and disease will require integrating millions of facts from dozens of fields. ► The peer-reviewed neuroscientific literature contains millions of articles, making any such integration daunting. ► Text-mining the peer-review literature allows us to automatically and statistically identify relationships between neuroscientific concepts. ► We introduce an algorithm that identifies possible new hypotheses. ► We have added a layer of semi-automation to a part of the scientific process itself by finding statistical anomalies in the peer-reviewed literature. ► We combined our data with the massive gene expression database made public with the Allen Brian Atlas to uncover biases in neuroscientific research.

Introduction

The scientific method begins with a hypothesis about our reality that can be tested via experimental observation. Hypothesis formation is iterative, building off prior scientific knowledge. Before one can form a hypothesis, one must have a thorough understanding of previous research to ensure that the path of inquiry is founded upon a stable base of established facts. But how can a researcher perform a thorough, unbiased literature review when over one million scientific articles are published annually (Björk et al., 2008)? The rate of scientific discovery has outpaced our ability to integrate knowledge in an unbiased, principled fashion. One solution may be via automated information aggregation (Akil et al., 2001). In this manuscript we show that, by calculating associations between concepts in the peer-reviewed literature, we can algorithmically synthesize scientific information and use that knowledge to help formulate plausible low-level hypotheses.

Neuroscience is a particularly complex discipline that relies upon expertise from many disparate fields (Akil et al., 2001). The aim of neuroscience is to understand relationships between brain, behavior, and disease; yet, no one person or group can possibly unify all neuroscientific understanding into a coherent framework. In this paper, we show that the literature contains a hidden network of connected facts that, by definition, recapitulate known neuroscientific relationships. Neuroanatomical, behavioral, and disease associations can be quantified and visualized to speed research and education or to discover understudied research paths (Yarkoni et al., 2010, Wren et al., 2004, Bilder et al., 2009). Rather than allowing our limited ability to review the entire scientific literature bias our hypotheses, we can algorithmically integrate millions of scientific research papers in a principled fashion.

To accomplish this, we used a co-occurrence algorithm to calculate the pair-wise association index (AI) between neuroscientific terms (and their synonyms) contained within more than 3.5 million papers indexed in PubMed (see Section 2). The primary assumption is that the frequency with which terms appeared together across the titles or abstracts of manuscripts is proportional to their probability of association. That is, we assumed an underlying structure within the peer-reviewed neuroscientific literature that we could leverage to our advantage. We conceive of our system as a proof-of-concept tool for knowledge discovery limited only by the size and quality of the inputs. We believe that, in its current state, when combined with the website search and visualization system we created to accompany it (http://www.brainscanr.com), it acts as a more sophisticated complement to normal PubMed searches. Furthermore, it provides, for the first time, a method for quantifying the relationship between disparate neuroscientific concepts, paving the way for researchers to incorporate statistical decision making into their future research.

Section snippets

Data collection

We populated a dictionary with phrases for 124 brain regions, 291 cognitive functions, and 47 diseases. Brain region names and associated synonyms were selected from BrainInfo (2007) (Bowden et al., 2007), Neuroscience Division, National Primate Research Center, University of Washington (Bowden and Dubach, 2003). Cognitive functions were obtained from (http://www.cognitiveatlas.org/) (Poldrack et al., 2011). Disease names are from (http://www.ninds.nih.gov/). The initial population of the

Cognome construction

In order to reconstruct this cognome, we calculated the probability of association between each term and every other term, giving an association matrix of size (n2n)/2 (Fig. 1 and Section 2). Once the full association matrix was calculated we constructed a full brain connectivity graph (Fig. 2 and Supporting Fig. S1), limited only by the dictionary used to define the search terms (see Supporting List 1 for the full list). We find relatively strong associations between all brain region terms.

Discussion

In this manuscript we demonstrate that, by mining the peer-review literature for associations between neuroscientific terms, we can recapitulate known scientific relationships. Furthermore, we introduce an algorithm for semi-automated hypothesis generation that can be used to speed research discovery. Although the current analysis is restricted to a limited dictionary of terms, the association and visualization methods are applicable to any search term or phrase found in PubMed, meaning that

Acknowledgements

We thank Curtis Chambers for technical assistance, and Leon Deouell, Amitai Shenhav, Avgusta Shestyuk, Kirstie Whitaker, and many brainSCANr beta testers for technical discussions. BV is funded by the National Institute of Neurological Disorders and Stroke (NS21135-22S1) and a National Institutes of Health Institutional Research and Academic Career Development Award (IRACDA)(GM081266-05).

References (30)

  • D.M. Bowden et al.

    NeuroNames 2002

    Neuroinformatics

    (2003)
  • D.M. Bowden et al.

    Creating neuroscience ontologies

    Methods Mol Biol

    (2007)
  • U. Dirnagl et al.

    Fighting publication bias: introducing the negative results section

    J Cereb Blood Flow Metab

    (2010)
  • Editors

    A critical look at connectomics

    Nat Neurosci

    (2010)
  • J.A. Evans et al.

    Metaknowledge

    Science

    (2011)
  • Cited by (27)

    • Social Media, Open Science, and Data Science Are Inextricably Linked

      2017, Neuron
      Citation Excerpt :

      Thus, the goal of PLOS was not simply to open access of scientific results to everyone, but to create a publishing system whose constituent publications could themselves be a source of data to be mined. While still nascent, a number of neuroscientific publications have done just that, often with the goal of automating meta-analyses (Yarkoni et al., 2011) or hypothesis generation (Voytek, 2016; Voytek and Voytek, 2012). Issues of open access and open data ignited the field in 2005 when John Ioannidis published “Why Most Published Research Findings Are False.”

    • LinkRbrain: Multi-scale data integrator of the brain

      2015, Journal of Neuroscience Methods
      Citation Excerpt :

      Brainscanner was developed in order to analyze abstracts from scientific papers published in peer-reviewed journals. It measures lexical associations between neuroscientific concepts, then extracts relationships between brain structures, functions, and diseases (Voytek and Voytek, 2010). Other tools such as the Genetic Association Database (GAD) (Becker et al., 2004), GeneCard (Safran et al., 2010), and GeneBank (Benson et al., 2005) collect, standardize, and archive valuable genetic information.

    • Text-Mining and Neuroscience

      2012, International Review of Neurobiology
      Citation Excerpt :

      Because the financial and time costs associated with developing a large curated document collection are often prohibitive, researchers will sometimes perform automated association mining, in which textual features are extracted from a large collection of input documents and used either to further one's understanding of the relationships between the documents themselves or to develop hypotheses that can be investigated on their own. Voytek and Voytek (2012), for example, used co-occurrences of brain region mentions, cognitive functions, and brain-related diseases to demonstrate that known relationships can be extracted in an automated and scalable way by using clustering algorithms. Importantly, they were able to extend this approach to semi-automatically generate hypotheses regarding “holes” in the literature associations between brain structure and function, or function and disease which are likely to exist, but lack support in the literature.

    • Towards a biologically annotated brain connectome

      2023, Nature Reviews Neuroscience
    View all citing articles on Scopus
    View full text