Automated cognome construction and semi-automated hypothesis generation

doi:10.1016/j.jneumeth.2012.04.019

Journal of Neuroscience Methods

Volume 208, Issue 1, 30 June 2012, Pages 92-100

https://doi.org/10.1016/j.jneumeth.2012.04.019 Get rights and content

Abstract

Modern neuroscientific research stands on the shoulders of countless giants. PubMed alone contains more than 21 million peer-reviewed articles with 40–50,000 more published every month. Understanding the human brain, cognition, and disease will require integrating facts from dozens of scientific fields spread amongst millions of studies locked away in static documents, making any such integration daunting, at best. The future of scientific progress will be aided by bridging the gap between the millions of published research articles and modern databases such as the Allen brain atlas (ABA). To that end, we have analyzed the text of over 3.5 million scientific abstracts to find associations between neuroscientific concepts. From the literature alone, we show that we can blindly and algorithmically extract a “cognome”: relationships between brain structure, function, and disease. We demonstrate the potential of data-mining and cross-platform data-integration with the ABA by introducing two methods for semi-automated hypothesis generation. By analyzing statistical “holes” and discrepancies in the literature we can find understudied or overlooked research paths. That is, we have added a layer of semi-automation to a part of the scientific process itself. This is an important step toward fundamentally incorporating data-mining algorithms into the scientific method in a manner that is generalizable to any scientific or medical field.

Highlights

► Understanding the human brain, cognition, and disease will require integrating millions of facts from dozens of fields. ► The peer-reviewed neuroscientific literature contains millions of articles, making any such integration daunting. ► Text-mining the peer-review literature allows us to automatically and statistically identify relationships between neuroscientific concepts. ► We introduce an algorithm that identifies possible new hypotheses. ► We have added a layer of semi-automation to a part of the scientific process itself by finding statistical anomalies in the peer-reviewed literature. ► We combined our data with the massive gene expression database made public with the Allen Brian Atlas to uncover biases in neuroscientific research.

Introduction

The scientific method begins with a hypothesis about our reality that can be tested via experimental observation. Hypothesis formation is iterative, building off prior scientific knowledge. Before one can form a hypothesis, one must have a thorough understanding of previous research to ensure that the path of inquiry is founded upon a stable base of established facts. But how can a researcher perform a thorough, unbiased literature review when over one million scientific articles are published annually (Björk et al., 2008)? The rate of scientific discovery has outpaced our ability to integrate knowledge in an unbiased, principled fashion. One solution may be via automated information aggregation (Akil et al., 2001). In this manuscript we show that, by calculating associations between concepts in the peer-reviewed literature, we can algorithmically synthesize scientific information and use that knowledge to help formulate plausible low-level hypotheses.

Neuroscience is a particularly complex discipline that relies upon expertise from many disparate fields (Akil et al., 2001). The aim of neuroscience is to understand relationships between brain, behavior, and disease; yet, no one person or group can possibly unify all neuroscientific understanding into a coherent framework. In this paper, we show that the literature contains a hidden network of connected facts that, by definition, recapitulate known neuroscientific relationships. Neuroanatomical, behavioral, and disease associations can be quantified and visualized to speed research and education or to discover understudied research paths (Yarkoni et al., 2010, Wren et al., 2004, Bilder et al., 2009). Rather than allowing our limited ability to review the entire scientific literature bias our hypotheses, we can algorithmically integrate millions of scientific research papers in a principled fashion.

To accomplish this, we used a co-occurrence algorithm to calculate the pair-wise association index (AI) between neuroscientific terms (and their synonyms) contained within more than 3.5 million papers indexed in PubMed (see Section 2). The primary assumption is that the frequency with which terms appeared together across the titles or abstracts of manuscripts is proportional to their probability of association. That is, we assumed an underlying structure within the peer-reviewed neuroscientific literature that we could leverage to our advantage. We conceive of our system as a proof-of-concept tool for knowledge discovery limited only by the size and quality of the inputs. We believe that, in its current state, when combined with the website search and visualization system we created to accompany it (http://www.brainscanr.com), it acts as a more sophisticated complement to normal PubMed searches. Furthermore, it provides, for the first time, a method for quantifying the relationship between disparate neuroscientific concepts, paving the way for researchers to incorporate statistical decision making into their future research.

Section snippets

Data collection

We populated a dictionary with phrases for 124 brain regions, 291 cognitive functions, and 47 diseases. Brain region names and associated synonyms were selected from BrainInfo (2007) (Bowden et al., 2007), Neuroscience Division, National Primate Research Center, University of Washington (Bowden and Dubach, 2003). Cognitive functions were obtained from (http://www.cognitiveatlas.org/) (Poldrack et al., 2011). Disease names are from (http://www.ninds.nih.gov/). The initial population of the

Cognome construction

In order to reconstruct this cognome, we calculated the probability of association between each term and every other term, giving an association matrix of size $(n^{2} - n) / 2$ (Fig. 1 and Section 2). Once the full association matrix was calculated we constructed a full brain connectivity graph (Fig. 2 and Supporting Fig. S1), limited only by the dictionary used to define the search terms (see Supporting List 1 for the full list). We find relatively strong associations between all brain region terms.

Discussion

In this manuscript we demonstrate that, by mining the peer-review literature for associations between neuroscientific terms, we can recapitulate known scientific relationships. Furthermore, we introduce an algorithm for semi-automated hypothesis generation that can be used to speed research discovery. Although the current analysis is restricted to a limited dictionary of terms, the association and visualization methods are applicable to any search term or phrase found in PubMed, meaning that

Acknowledgements

We thank Curtis Chambers for technical assistance, and Leon Deouell, Amitai Shenhav, Avgusta Shestyuk, Kirstie Whitaker, and many brainSCANr beta testers for technical discussions. BV is funded by the National Institute of Neurological Disorders and Stroke (NS21135-22S1) and a National Institutes of Health Institutional Research and Academic Career Development Award (IRACDA)(GM081266-05).

References (30)

J.P. Ioannidis et al.
The relationship between study design, results, and reporting of randomized clinical trials of HIV infection
J Control Clin Trials
(1997)
B. Voytek et al.
Dynamic Neuroplasticity after Human Prefrontal Cortex Damage
Neuron
(2010)
T. Yarkoni et al.
Cognitive neuroscience 2.0: building a cumulative science of human brain function
Trends Cogn Sci
(2010)
H. Akil et al.
Challenges and opportunities in mining neuroscience data
Science
(2001)
B.T.F. Alako et al.
CoPub mapper: mining MEDLINE based on search term co-publication
BMC Bioinformatics
(2005)
C.B. Begg et al.
Publication bias and dissemination of clinical research
J Natl Cancer Inst
(1989)
R.M. Bilder et al.
Cognitive ontologies for neuropsychiatric phenomics research
Cognit Neuropsychiatry
(2009)
B. Björk et al.
Global annual volume of peer reviewed scholarly articles and the share available via different open access options
(2008)
J.W. Bohland et al.
A proposal for a coordinated effort for the determination of brainwide neuroanatomical connectivity in model organisms at a mesoscopic scale
PLoS Comput Biol
(2009)
D. Borsboom et al.
The small world of psychopathology
PLoS One
(2011)

D.M. Bowden et al.

NeuroNames 2002

Neuroinformatics

(2003)

D.M. Bowden et al.

Creating neuroscience ontologies

Methods Mol Biol

(2007)

U. Dirnagl et al.

Fighting publication bias: introducing the negative results section

J Cereb Blood Flow Metab

(2010)

Editors

A critical look at connectomics

Nat Neurosci

(2010)

J.A. Evans et al.

Metaknowledge

Science

(2011)

Cited by (27)

What's next? Forecasting scientific research trends
2024, Heliyon
Scientific research trends and interests evolve over time. The ability to identify and forecast these trends is vital for educational institutions, practitioners, investors, and funding organizations. In this study, we predict future trends in scientific publications using heterogeneous sources, including historical publication time series from PubMed, research and review articles, pre-trained language models, and patents. We demonstrate that scientific topic popularity levels and changes (trends) can be predicted five years in advance across 40 years and 125 diverse topics, including life-science concepts, biomedical, anatomy, and other science, technology, and engineering topics. Preceding publications and future patents are leading indicators for emerging scientific topics. We find the ratio of reviews to original research articles informative for identifying increasing or declining topics, with declining topics having an excess of reviews. We find that language models provide improved insights and predictions into temporal dynamics. In temporal validation, our models substantially outperform the historical baseline. Our findings suggest that similar dynamics apply across other scientific and engineering research topics. We present SciTrends, a user-friendly webtool for predicting future publication trends for any topic covered in PubMed.
Social Media, Open Science, and Data Science Are Inextricably Linked
2017, Neuron
Citation Excerpt :
Thus, the goal of PLOS was not simply to open access of scientific results to everyone, but to create a publishing system whose constituent publications could themselves be a source of data to be mined. While still nascent, a number of neuroscientific publications have done just that, often with the goal of automating meta-analyses (Yarkoni et al., 2011) or hypothesis generation (Voytek, 2016; Voytek and Voytek, 2012). Issues of open access and open data ignited the field in 2005 when John Ioannidis published “Why Most Published Research Findings Are False.”
LinkRbrain: Multi-scale data integrator of the brain
2015, Journal of Neuroscience Methods
Citation Excerpt :
Brainscanner was developed in order to analyze abstracts from scientific papers published in peer-reviewed journals. It measures lexical associations between neuroscientific concepts, then extracts relationships between brain structures, functions, and diseases (Voytek and Voytek, 2010). Other tools such as the Genetic Association Database (GAD) (Becker et al., 2004), GeneCard (Safran et al., 2010), and GeneBank (Benson et al., 2005) collect, standardize, and archive valuable genetic information.
LinkRbrain is an open-access web platform for multi-scale data integration and visualization of human brain data. This platform integrates anatomical, functional, and genetic knowledge produced by the scientific community.
The linkRbrain platform has two major components: (1) a data aggregation component that integrates multiple open databases into a single platform with a unified representation; and (2) a website that provides fast multi-scale integration and visualization of these data and makes the results immediately available.
LinkRbrain allows users to visualize functional networks or/and genetic expression over a standard brain template (MNI152). Interrelationships between these components based on topographical overlap are displayed using relational graphs. Moreover, linkRbrain enables comparison of new experimental results with previous published works.
Previous tools and studies illustrate the opportunities of data mining across multiple tiers of neuroscience and genetic information. However, a global systematic approach is still missing to gather cognitive, topographical, and genetic knowledge in a common framework in order to facilitate their visualization, comparison, and integration.
LinkRbrain is an efficient open-access tool that affords an integrative understanding of human brain function.
Text-Mining and Neuroscience
2012, International Review of Neurobiology
Citation Excerpt :
Because the financial and time costs associated with developing a large curated document collection are often prohibitive, researchers will sometimes perform automated association mining, in which textual features are extracted from a large collection of input documents and used either to further one's understanding of the relationships between the documents themselves or to develop hypotheses that can be investigated on their own. Voytek and Voytek (2012), for example, used co-occurrences of brain region mentions, cognitive functions, and brain-related diseases to demonstrate that known relationships can be extracted in an automated and scalable way by using clustering algorithms. Importantly, they were able to extend this approach to semi-automatically generate hypotheses regarding “holes” in the literature associations between brain structure and function, or function and disease which are likely to exist, but lack support in the literature.
The wealth and diversity of neuroscience research are inherent characteristics of the discipline that can give rise to some complications. As the field continues to expand, we generate a great deal of data about all aspects, and from multiple perspectives, of the brain, its chemistry, biology, and how these affect behavior. The vast majority of research scientists cannot afford to spend their time combing the literature to find every article related to their research, nor do they wish to spend time adjusting their neuroanatomical vocabulary to communicate with other subdomains in the neurosciences. As such, there has been a recent increase in the amount of informatics research devoted to developing digital resources for neuroscience research. Neuroinformatics is concerned with the development of computational tools to further our understanding of the brain and to make sense of the vast amount of information that neuroscientists generate (French & Pavlidis, 2007). Many of these tools are related to the use of textual data. Here, we review some of the recent developments for better using the vast amount of textual information generated in neuroscience research and publication and suggest several use cases that will demonstrate how bench neuroscientists can take advantage of the resources that are available.
Towards a biologically annotated brain connectome
2023, Nature Reviews Neuroscience
Automated meta-analysis of the event-related potential (ERP) literature
2022, Scientific Reports

View all citing articles on Scopus

View full text

Computational NeuroscienceAutomated cognome construction and semi-automated hypothesis generation

Abstract

Highlights

Introduction

Section snippets

Data collection

Cognome construction

Discussion

Acknowledgements

J Control Clin Trials

Neuron

Trends Cogn Sci

Challenges and opportunities in mining neuroscience data

Science

CoPub mapper: mining MEDLINE based on search term co-publication

BMC Bioinformatics

Publication bias and dissemination of clinical research

J Natl Cancer Inst

Cognitive ontologies for neuropsychiatric phenomics research

Cognit Neuropsychiatry

Global annual volume of peer reviewed scholarly articles and the share available via different open access options

A proposal for a coordinated effort for the determination of brainwide neuroanatomical connectivity in model organisms at a mesoscopic scale

PLoS Comput Biol

The small world of psychopathology

PLoS One

NeuroNames 2002

Neuroinformatics

Creating neuroscience ontologies

Methods Mol Biol

Fighting publication bias: introducing the negative results section

J Cereb Blood Flow Metab

A critical look at connectomics

Nat Neurosci

Metaknowledge

Science

Computational Neuroscience
Automated cognome construction and semi-automated hypothesis generation