Computational NeuroscienceAutomated cognome construction and semi-automated hypothesis generation
Highlights
► Understanding the human brain, cognition, and disease will require integrating millions of facts from dozens of fields. ► The peer-reviewed neuroscientific literature contains millions of articles, making any such integration daunting. ► Text-mining the peer-review literature allows us to automatically and statistically identify relationships between neuroscientific concepts. ► We introduce an algorithm that identifies possible new hypotheses. ► We have added a layer of semi-automation to a part of the scientific process itself by finding statistical anomalies in the peer-reviewed literature. ► We combined our data with the massive gene expression database made public with the Allen Brian Atlas to uncover biases in neuroscientific research.
Introduction
The scientific method begins with a hypothesis about our reality that can be tested via experimental observation. Hypothesis formation is iterative, building off prior scientific knowledge. Before one can form a hypothesis, one must have a thorough understanding of previous research to ensure that the path of inquiry is founded upon a stable base of established facts. But how can a researcher perform a thorough, unbiased literature review when over one million scientific articles are published annually (Björk et al., 2008)? The rate of scientific discovery has outpaced our ability to integrate knowledge in an unbiased, principled fashion. One solution may be via automated information aggregation (Akil et al., 2001). In this manuscript we show that, by calculating associations between concepts in the peer-reviewed literature, we can algorithmically synthesize scientific information and use that knowledge to help formulate plausible low-level hypotheses.
Neuroscience is a particularly complex discipline that relies upon expertise from many disparate fields (Akil et al., 2001). The aim of neuroscience is to understand relationships between brain, behavior, and disease; yet, no one person or group can possibly unify all neuroscientific understanding into a coherent framework. In this paper, we show that the literature contains a hidden network of connected facts that, by definition, recapitulate known neuroscientific relationships. Neuroanatomical, behavioral, and disease associations can be quantified and visualized to speed research and education or to discover understudied research paths (Yarkoni et al., 2010, Wren et al., 2004, Bilder et al., 2009). Rather than allowing our limited ability to review the entire scientific literature bias our hypotheses, we can algorithmically integrate millions of scientific research papers in a principled fashion.
To accomplish this, we used a co-occurrence algorithm to calculate the pair-wise association index (AI) between neuroscientific terms (and their synonyms) contained within more than 3.5 million papers indexed in PubMed (see Section 2). The primary assumption is that the frequency with which terms appeared together across the titles or abstracts of manuscripts is proportional to their probability of association. That is, we assumed an underlying structure within the peer-reviewed neuroscientific literature that we could leverage to our advantage. We conceive of our system as a proof-of-concept tool for knowledge discovery limited only by the size and quality of the inputs. We believe that, in its current state, when combined with the website search and visualization system we created to accompany it (http://www.brainscanr.com), it acts as a more sophisticated complement to normal PubMed searches. Furthermore, it provides, for the first time, a method for quantifying the relationship between disparate neuroscientific concepts, paving the way for researchers to incorporate statistical decision making into their future research.
Section snippets
Data collection
We populated a dictionary with phrases for 124 brain regions, 291 cognitive functions, and 47 diseases. Brain region names and associated synonyms were selected from BrainInfo (2007) (Bowden et al., 2007), Neuroscience Division, National Primate Research Center, University of Washington (Bowden and Dubach, 2003). Cognitive functions were obtained from (http://www.cognitiveatlas.org/) (Poldrack et al., 2011). Disease names are from (http://www.ninds.nih.gov/). The initial population of the
Cognome construction
In order to reconstruct this cognome, we calculated the probability of association between each term and every other term, giving an association matrix of size (Fig. 1 and Section 2). Once the full association matrix was calculated we constructed a full brain connectivity graph (Fig. 2 and Supporting Fig. S1), limited only by the dictionary used to define the search terms (see Supporting List 1 for the full list). We find relatively strong associations between all brain region terms.
Discussion
In this manuscript we demonstrate that, by mining the peer-review literature for associations between neuroscientific terms, we can recapitulate known scientific relationships. Furthermore, we introduce an algorithm for semi-automated hypothesis generation that can be used to speed research discovery. Although the current analysis is restricted to a limited dictionary of terms, the association and visualization methods are applicable to any search term or phrase found in PubMed, meaning that
Acknowledgements
We thank Curtis Chambers for technical assistance, and Leon Deouell, Amitai Shenhav, Avgusta Shestyuk, Kirstie Whitaker, and many brainSCANr beta testers for technical discussions. BV is funded by the National Institute of Neurological Disorders and Stroke (NS21135-22S1) and a National Institutes of Health Institutional Research and Academic Career Development Award (IRACDA)(GM081266-05).
References (30)
- et al.
The relationship between study design, results, and reporting of randomized clinical trials of HIV infection
J Control Clin Trials
(1997) - et al.
Dynamic Neuroplasticity after Human Prefrontal Cortex Damage
Neuron
(2010) - et al.
Cognitive neuroscience 2.0: building a cumulative science of human brain function
Trends Cogn Sci
(2010) - et al.
Challenges and opportunities in mining neuroscience data
Science
(2001) - et al.
CoPub mapper: mining MEDLINE based on search term co-publication
BMC Bioinformatics
(2005) - et al.
Publication bias and dissemination of clinical research
J Natl Cancer Inst
(1989) - et al.
Cognitive ontologies for neuropsychiatric phenomics research
Cognit Neuropsychiatry
(2009) - et al.
Global annual volume of peer reviewed scholarly articles and the share available via different open access options
(2008) - et al.
A proposal for a coordinated effort for the determination of brainwide neuroanatomical connectivity in model organisms at a mesoscopic scale
PLoS Comput Biol
(2009) - et al.
The small world of psychopathology
PLoS One
(2011)
NeuroNames 2002
Neuroinformatics
Creating neuroscience ontologies
Methods Mol Biol
Fighting publication bias: introducing the negative results section
J Cereb Blood Flow Metab
A critical look at connectomics
Nat Neurosci
Metaknowledge
Science
Cited by (27)
What's next? Forecasting scientific research trends
2024, HeliyonSocial Media, Open Science, and Data Science Are Inextricably Linked
2017, NeuronCitation Excerpt :Thus, the goal of PLOS was not simply to open access of scientific results to everyone, but to create a publishing system whose constituent publications could themselves be a source of data to be mined. While still nascent, a number of neuroscientific publications have done just that, often with the goal of automating meta-analyses (Yarkoni et al., 2011) or hypothesis generation (Voytek, 2016; Voytek and Voytek, 2012). Issues of open access and open data ignited the field in 2005 when John Ioannidis published “Why Most Published Research Findings Are False.”
LinkRbrain: Multi-scale data integrator of the brain
2015, Journal of Neuroscience MethodsCitation Excerpt :Brainscanner was developed in order to analyze abstracts from scientific papers published in peer-reviewed journals. It measures lexical associations between neuroscientific concepts, then extracts relationships between brain structures, functions, and diseases (Voytek and Voytek, 2010). Other tools such as the Genetic Association Database (GAD) (Becker et al., 2004), GeneCard (Safran et al., 2010), and GeneBank (Benson et al., 2005) collect, standardize, and archive valuable genetic information.
Text-Mining and Neuroscience
2012, International Review of NeurobiologyCitation Excerpt :Because the financial and time costs associated with developing a large curated document collection are often prohibitive, researchers will sometimes perform automated association mining, in which textual features are extracted from a large collection of input documents and used either to further one's understanding of the relationships between the documents themselves or to develop hypotheses that can be investigated on their own. Voytek and Voytek (2012), for example, used co-occurrences of brain region mentions, cognitive functions, and brain-related diseases to demonstrate that known relationships can be extracted in an automated and scalable way by using clustering algorithms. Importantly, they were able to extend this approach to semi-automatically generate hypotheses regarding “holes” in the literature associations between brain structure and function, or function and disease which are likely to exist, but lack support in the literature.
Towards a biologically annotated brain connectome
2023, Nature Reviews NeuroscienceAutomated meta-analysis of the event-related potential (ERP) literature
2022, Scientific Reports