Question answering for Biology

doi:10.1016/j.ymeth.2014.10.023

Methods

Volume 74, 1 March 2015, Pages 36-46

https://doi.org/10.1016/j.ymeth.2014.10.023 Get rights and content

Abstract

Biologists often pose queries to search engines and biological databases to obtain answers related to ongoing experiments. This is known to be a time consuming, and sometimes frustrating, task in which more than one query is posed and many databases are consulted to come to possible answers for a single fact. Question answering comes as an alternative to this process by allowing queries to be posed as questions, by integrating various resources of different nature and by returning an exact answer to the user. We have surveyed the current solutions on question answering for Biology, present an overview on the methods which are usually employed and give insights on how to boost performance of systems in this domain.

Introduction

When planning or analyzing experiments, scientists look for related and previous findings in the literature to obtain external evidence on current observations. For instance, biologists often seek information regarding genes/proteins (biomarkers) expressed in a particular cell or tissue of a particular organism in the scope of a particular disease. Finding published answers to such questions requires dealing with a variety of synonyms for the genes and diseases and posing queries to different databases and search engines. Further, it often also involves screening hundreds of publications or data returned for the queries.

The task of searching for relevant information in a collection of documents, such as web pages using search engines or scientific publications using PubMed¹, is generally called information retrieval (IR) [1]. In IR, queries are usually expressed in terms of some keywords and answering a query does not usually take into account synonyms, i.e., when a certain concepts has more that one name, and homonyms, i.e., when the same name refers to more than one concept. IR systems typically return a list of documents potentially relevant to the query, including related metadata (e.g., journal name and year of publication) and snippets of text matching the query keywords.

In contrast to IR, question answering (QA) [2], [3] aims to support finding information in a collection of unstructured and structured data, e.g., texts and databases, respectively. Furthermore, QA systems take questions expressed in natural language (e.g., English) and generate a precise answer by linguistically and semantically processing both the questions and data sources under consideration. In particular, a question answering system distinguishes from IR systems in three main aspects (cf. Table 1): (1) queries can be posed using natural language instead of keywords; (2) results do not consist of passages but are generated according to what has been specifically requested, be it a single answer or a short summary; (3) answers are based on the integration of data from textual documents as well as from a variety of knowledge sources.

The first aspect aims to facilitate usage for non-IR experts, i.e., users do not need to be concerned on how to best pose a query to receive a precise answer. For instance, when questioning about the participation of a certain gene in a pathway, e.g., the gene p53 in WNT-signaling cascade, users would usually write both terms in the search field of a search engine. In case of not finding the answer in any of the top list documents, the user could consider entering synonyms for both the gene (e.g., “TP53”) and the pathway (e.g., “WNT signaling pathway”). In order to cope with this problem, some IR systems allow the use of ontological terms instead of keywords for a more precise retrieval of relevant documents. For instance, GoPubmed² automatically suggests candidates terms in the Medical Subject Headings (MeSH) and Gene Ontology during keywords typing. However, understanding of ontological concepts is not straightforward for scientists not familiar to them. The use of natural language is a more intuitive way to inquire for information, by posing questions (how, what, when, where, which, who, etc.) or requests (show me, tell me, etc.). For instance, for the example above, users could simply write the question “Is p53 part of the WNT-signaling cascade?”. Of course, allowance of free questions requires very advanced natural language processing (NLP) techniques.

The second characteristic of QA systems is to provide precise answers instead of only presenting potentially relevant documents. When using IR systems, figuring out the answer to a query requires reading the documents returned by the system. QA systems strive to simply return the answer “No” to the question above along with a list of references that gave evidence for this answer. This requires QA systems to perform a deep linguistic analysis of both the question and the potential relevant passages, also considering the meaning of terms. Not only synonyms, hypernyms and hyponyms must be considered during answer construction, but also disambiguation of entities should be performed whenever necessary, such as figuring out whether the word “WNT” refers to part of the pathway name or to a mention of a member of the WNT gene family.

Third, QA is not limited to textual resources and can include integration of data resources by converting natural language questions to an appropriate query language for searching for answers in databases, for instance in RDF triples [4]. Data extracted from different sources need to be assembled into a coherent single answer by means of exploring interlinks, dealing with contradictions and joining equal or equivalent answers. Currently, conversion of biomedical natural language questions to RDF triples are being evaluated in the BioASQ challenge (cf. Section 3.2.1) and question answering over linked data for three biomedical databases is being assessed in one of the Question Answering for Linked Data (QALD) shared tasks (cf. Section 3.2.4). Further, a prototype of the LODQA system [5] (cf. Section 3.1.2) converts questions to SPARQL queries for submission to BioPortal end point.

The technology behind information systems has evolved from simple Boolean keywords-based queries to complex linguistic processing of both the question and textual passages. Fig. 1 shows an overview on the evolution of these techniques and illustrates the complexity of question answering systems. Many of the current information systems available for querying PubMed implement some of these techniques (cf. survey in [6]).

Question answering has been successful applied for other domains, and examples of such systems are START³ and Wolfram Alpha⁴. Recent interest in question answering has been also motivated by the IBM’s Watson system [7], which beat human participants in a game show. Various researchers advocate that QA system can provide many benefits to the biological domain and they expect that these systems can boost scientific productivity [8]. Indeed, a study carried out with physicians showed that they do trust in the answers provided by QA systems [9]. However, Life Sciences also poses many challenges to QA systems, specially: (1) highly complex and large terminology, (2) exponential growth of data and hundreds of on-line databases, and (3) high degree of contradictions. Often, answering a question not only requires identifying relevant facts in a single document or database, but merging parts of the answers from distinct sources. Nevertheless, research on question answering in Biology is still scarce, in contrast to the medical domain (cf. Section 4).

The first community-based challenge which included a task related to biomedical question answering took place in 2006 and 2007 and consisted in the evaluation of passage retrieval and restricted to topics related to Genomics (cf. TREC Genomics in Section 3.2.3). Later on, in 2012 and 2013, the Question Answering for Machine Reading Evaluation (QA4MRE) Alzheimer Disease challenge assessed systems on the machine reading task, which consists of multiple choice questions related to a single document (cf. Section 3.2.2). Use of RDF in biomedical QA tasks is currently being evaluated in the QALD challenge (cf. Section 3.2.4) and the more comprehensive challenge related to biomedical QA so far, BioASQ (cf. Section 3.2.1), has been running since last year (2013).

In this work, we present an overview on question answering systems and techniques for the biological domain. In the next section, we give an overview on the most common components of a question answering system. Section 3 describes current systems and results obtained in shared tasks. Section 4 discusses the state-of-art of QA systems for the medical domain and give insights on which improvements could be achieved in biological field in the near future.

Section snippets

Fundamental techniques

Question answering systems are usually composed of three steps [10], [11] (Fig. 2): question processing; candidate processing, and answer processing. The first step receives the input entered by the user, i.e., a natural language question, and includes pre-processing of the question, identification of the question type and the type of answer to be required (e.g., the entity type) and building an input to the next step. In the candidate processing step, relevant documents, passages or raw data

State-of-the-art

Question answering for Biology is a very recent topic and recent development focuses on the improvement of underlying algorithms and on the evaluation and comparison of systems. Currently, only one system (cf. Section 3.1.1) and one prototype (cf. Section 3.1.2) are actually functioning. Work has specifically been boosted by two community-based shared tasks (cf. Sections 3.2.1 BioASQ, 3.2.2 QA4MRE Alzheimer Disease). In this section we present a brief summary on the available systems, the

Discussion

We presented an overview on the architecture of question answering systems, on the common methods used for individual steps during answer processing and on the state of the art within the biological domain. Mature solutions are still scarce but improvements in this field are promising according to various past and present challenges. In this section we discuss aspects which we find necessary to improve performance and resources which are still underused. For comparison, we also give an overview

Conclusions

In this survey we presented an overview on question answering for Biology. We discussed the desired functionalities that such a system should provide to users, as opposed to classical information retrieval systems, e.g., PubMed. We provided an overview on the methods behind previous solutions on this field, including practical examples for a better understanding of the processes. Finally, we discussed the current state of the art on question answering systems for the biological domain,

References (65)

H. Yu
J. Biomed. Inform.
(2007)
S.J. Athenikos et al.
Comput. Methods Programs Biomed.
(2010)
Y. Cao
J. Biomed. Inform.
(2011)
C.D. Manning et al.
Introduction to Information Retrieval
(2008)
L. Hirschman et al.
Nat. Lang. Eng.
(2001)
M. Bauer et al.
Human Genomics
(2012)
V. Lopez
Scaling up question-answering to linked data
K.B. Cohen, J. dong Kim. Evaluation of SPARQL query generation from natural language questions, in: Natural Language...
Z. Lu
Database
(2011)
D.A. Ferrucci
AI Magazine
(2010)

J.D. Wren

Bioinformatics

(2011)

D. Jurafsky et al.

Speech and Language Processing

(2013)

C.F. Baker, C.J. Fillmore, J.B. Lowe, The Berkeley Framenet project, in: Proceedings of the 36th Annual Meeting of the...

A. Dolbey, M. Ellsworth, J. Scheffczyk, BioFrameNet: a domain-specific FrameNet extension with links to biomedical...

R. Tsai

BMC Bioinform.

(2007)

G.A. Miller

Commun. ACM

(1995)

J. Hakenberg et al.

Bioinformatics

(2008)

R. Leaman et al.

Bioinformatics

(2013)

A.R. Aronson et al.

J. Am. Med. Inform. Assoc.

(2010)

T. Rocktäschel et al.

Bioinformatics

(2012)

M. Gerner et al.

BMC Bioinform.

(2010)

T. Nunes et al.

Bioinformatics

(2013)

T.G.O. Consortium

Nucleic Acids Res.

(2013)

J.W. Ely

Br. Med. J.

(2000)

C. Schardt et al.

BMC Med. Inform. Decis. Mak.

(2007)

M. Gleize, et al., Selecting answers with structured lexical expansion and discourse relations LIMSI’s participation at...

P. Thomas et al.

Nucleic Acids Res.

(2012)

N.F. Noy

Nucleic Acids Res.

(2009)

R. Huang et al.

Natural language question answering over RDF data

D. Weissenborn, G. Tsatsaronis, M. Schroeder, Answering factoid questions in the biomedical domain, in: A.-C.N. Ngomo,...

M.-D. Olvera-Lobo et al.

J. Inform. Sci.

(2011)

C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in: Proc. ACL Workshop on Text Summarization...

Cited by (34)

Perspectives for self-driving labs in synthetic biology
2023, Current Opinion in Biotechnology
Self-driving labs (SDLs) combine fully automated experiments with artificial intelligence (AI) that decides the next set of experiments. Taken to their ultimate expression, SDLs could usher a new paradigm of scientific research, where the world is probed, interpreted, and explained by machines for human benefit. While there are functioning SDLs in the fields of chemistry and materials science, we contend that synthetic biology provides a unique opportunity since the genome provides a single target for affecting the incredibly wide repertoire of biological cell behavior. However, the level of investment required for the creation of biological SDLs is only warranted if directed toward solving difficult and enabling biological questions. Here, we discuss challenges and opportunities in creating SDLs for synthetic biology.
A literature review on question answering techniques, paradigms and systems
2020, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
This task can generate a dataset or a neural model which will provide the source for the answer extraction. The retrieved data can be ranked according to its relevance for the question (Neves and Leser, 2015). The Answer Processing is the most challenging task on a Question Answering system.
Question Answering (QA) systems enable users to retrieve exact answers for questions posed in natural language.
This study aims at identifying QA techniques, tools and systems, as well as the metrics and indicators used to measure these approaches for QA systems and also to determine how the relationship between Question Answering and natural language processing is built.
The method adopted was a Systematic Literature Review of studies published from 2000 to 2017.
130 out of 1842 papers have been identified as describing a QA approach developed and evaluated with different techniques.
Question Answering researchers have concentrated their efforts in natural language processing, knowledge base and information retrieval paradigms. Most of the researches focused on open domain. Regarding the metrics used to evaluate the approaches, Precision and Recall are the most addressed.
SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions
2020, Artificial Intelligence in Medicine
Citation Excerpt :
a QA system would provide as an answer particular name, i.e., Anorexia Athletica, for anorexia in gymnasts. A typical QA system can be viewed as a pipeline composed of three main processing phases [15–18] namely question processing, document processing, and answer processing each of which has to deal with specific challenges and issues. While QA in the open-domain is a longstanding challenge widely studied over the last decades [19–22], it still requires further efforts in the biomedical domain so as to enable a user to have a better chance to find an accurate answer to his human natural language question.
Question answering (QA), the identification of short accurate answers to users questions written in natural language expressions, is a longstanding issue widely studied over the last decades in the open-domain. However, it still remains a real challenge in the biomedical domain as the most of the existing systems support a limited amount of question and answer types as well as still require further efforts in order to improve their performance in terms of precision for the supported questions. Here, we present a semantic biomedical QA system named SemBioNLQA which has the ability to handle the kinds of yes/no, factoid, list, and summary natural language questions.
This paper describes the system architecture and an evaluation of the developed end-to-end biomedical QA system named SemBioNLQA, which consists of question classification, document retrieval, passage retrieval and answer extraction modules. It takes natural language questions as input, and outputs both short precise answers and summaries as results. The SemBioNLQA system, dealing with four types of questions, is based on (1) handcrafted lexico-syntactic patterns and a machine learning algorithm for question classification, (2) PubMed search engine and UMLS similarity for document retrieval, (3) the BM25 model, stemmed words and UMLS concepts for passage retrieval, and (4) UMLS metathesaurus, BioPortal synonyms, sentiment analysis and term frequency metric for answer extraction.
Compared with the current state-of-the-art biomedical QA systems, SemBioNLQA, a fully automated system, has the potential to deal with a large amount of question and answer types. SemBioNLQA retrieves quickly users’ information needs by returning exact answers (e.g., “yes”, “no”, a biomedical entity name, etc.) and ideal answers (i.e., paragraph-sized summaries of relevant information) for yes/no, factoid and list questions, whereas it provides only the ideal answers for summary questions. Moreover, experimental evaluations performed on biomedical questions and answers provided by the BioASQ challenge especially in 2015, 2016 and 2017 (as part of our participation), show that SemBioNLQA achieves good performances compared with the most current state-of-the-art systems and allows a practical and competitive alternative to help information seekers find exact and ideal answers to their biomedical questions. The SemBioNLQA source code is publicly available at https://github.com/sarrouti/sembionlqa.
A study about the future evaluation of Question-Answering systems
2017, Knowledge-Based Systems
Citation Excerpt :
Most of these studies introduced new methods of measuring similarities between questions and answers, relying on techniques aimed to discover deep semantic relationships such as Latent Dirichlet Allocation (LDA) [80], Distributional Semantic Models (DSMs) [45] and Latent Semantic Analysis [81], etc. There have been also some works focused on developing QA technologies for restricted domains such as legal [82] and biomedicine [83] documents. In the case of the biomedicine domain, it attracted such interest that there was an evaluation campaign focus on it [24].
Evaluation campaigns of Question Answering (QA) systems have contributed to the development of such technologies. These campaigns have promoted some changes oriented to overcome results. However, at this period we see how systems have reached an upper bound, as well as systems are still far away from answering complex questions. In this paper, we overview the main QA evaluations over free text, paying special attention to the changes encouraged at such campaigns. We observe that systems still return a high proportion of incorrect answers and that the changes are almost not included in traditional approaches. Moreover, we analyze QA collections in order to obtain better insights about the main challenges for current QA systems. We detect that QA systems find very difficult to deal with different rewordings in questions and documents, as well as to infer information that is not explicitly mentioned in texts. Based on those observations, we recommend a set of directions for future evaluations, suggesting the application of textual inference and knowledge bases as a way for improving results.
A passage retrieval method based on probabilistic information retrieval and UMLS concepts in biomedical question answering
2017, Journal of Biomedical Informatics
Citation Excerpt :
Unlike IR, where full documents are considered relevant to the information request, question answering (QA) systems aim at returning precise short answers instead of entire documents to queries posed as natural language questions [7]. Ideally, a QA system consists of three main processing phases, that can be studied and developed independently [8–11]: (1) question processing, (2) document processing, and (3) answer processing. Fig. 1 illustrates the generic architecture of a biomedical QA system.
Passage retrieval, the identification of top-ranked passages that may contain the answer for a given biomedical question, is a crucial component for any biomedical question answering (QA) system. Passage retrieval in open-domain QA is a longstanding challenge widely studied over the last decades. However, it still requires further efforts in biomedical QA. In this paper, we present a new biomedical passage retrieval method based on Stanford CoreNLP sentence/passage length, probabilistic information retrieval (IR) model and UMLS concepts.
In the proposed method, we first use our document retrieval system based on PubMed search engine and UMLS similarity to retrieve relevant documents to a given biomedical question. We then take the abstracts from the retrieved documents and use Stanford CoreNLP for sentence splitter to make a set of sentences, i.e., candidate passages. Using stemmed words and UMLS concepts as features for the BM25 model, we finally compute the similarity scores between the biomedical question and each of the candidate passages and keep the N top-ranked ones.
Experimental evaluations performed on large standard datasets, provided by the BioASQ challenge, show that the proposed method achieves good performances compared with the current state-of-the-art methods. The proposed method significantly outperforms the current state-of-the-art methods by an average of 6.84% in terms of mean average precision (MAP).
We have proposed an efficient passage retrieval method which can be used to retrieve relevant passages in biomedical QA systems with high mean average precision.
Contributions to the Improvement of Question Answering Systems in the Biomedical Domain
2023, arXiv

View all citing articles on Scopus

View full text

Question answering for Biology

Abstract

Introduction

Section snippets

Fundamental techniques

State-of-the-art

Discussion

Conclusions

J. Biomed. Inform.

Comput. Methods Programs Biomed.

J. Biomed. Inform.

Introduction to Information Retrieval

Nat. Lang. Eng.

Human Genomics

Scaling up question-answering to linked data

Database

AI Magazine

Bioinformatics

Speech and Language Processing

BMC Bioinform.

Commun. ACM

Bioinformatics

Bioinformatics

J. Am. Med. Inform. Assoc.

Bioinformatics

BMC Bioinform.

Bioinformatics

Nucleic Acids Res.

Br. Med. J.

BMC Med. Inform. Decis. Mak.

Nucleic Acids Res.

Nucleic Acids Res.

Natural language question answering over RDF data

J. Inform. Sci.