Elsevier

Methods

Volume 74, 1 March 2015, Pages 36-46
Methods

Question answering for Biology

https://doi.org/10.1016/j.ymeth.2014.10.023Get rights and content

Abstract

Biologists often pose queries to search engines and biological databases to obtain answers related to ongoing experiments. This is known to be a time consuming, and sometimes frustrating, task in which more than one query is posed and many databases are consulted to come to possible answers for a single fact. Question answering comes as an alternative to this process by allowing queries to be posed as questions, by integrating various resources of different nature and by returning an exact answer to the user. We have surveyed the current solutions on question answering for Biology, present an overview on the methods which are usually employed and give insights on how to boost performance of systems in this domain.

Introduction

When planning or analyzing experiments, scientists look for related and previous findings in the literature to obtain external evidence on current observations. For instance, biologists often seek information regarding genes/proteins (biomarkers) expressed in a particular cell or tissue of a particular organism in the scope of a particular disease. Finding published answers to such questions requires dealing with a variety of synonyms for the genes and diseases and posing queries to different databases and search engines. Further, it often also involves screening hundreds of publications or data returned for the queries.

The task of searching for relevant information in a collection of documents, such as web pages using search engines or scientific publications using PubMed1, is generally called information retrieval (IR) [1]. In IR, queries are usually expressed in terms of some keywords and answering a query does not usually take into account synonyms, i.e., when a certain concepts has more that one name, and homonyms, i.e., when the same name refers to more than one concept. IR systems typically return a list of documents potentially relevant to the query, including related metadata (e.g., journal name and year of publication) and snippets of text matching the query keywords.

In contrast to IR, question answering (QA) [2], [3] aims to support finding information in a collection of unstructured and structured data, e.g., texts and databases, respectively. Furthermore, QA systems take questions expressed in natural language (e.g., English) and generate a precise answer by linguistically and semantically processing both the questions and data sources under consideration. In particular, a question answering system distinguishes from IR systems in three main aspects (cf. Table 1): (1) queries can be posed using natural language instead of keywords; (2) results do not consist of passages but are generated according to what has been specifically requested, be it a single answer or a short summary; (3) answers are based on the integration of data from textual documents as well as from a variety of knowledge sources.

The first aspect aims to facilitate usage for non-IR experts, i.e., users do not need to be concerned on how to best pose a query to receive a precise answer. For instance, when questioning about the participation of a certain gene in a pathway, e.g., the gene p53 in WNT-signaling cascade, users would usually write both terms in the search field of a search engine. In case of not finding the answer in any of the top list documents, the user could consider entering synonyms for both the gene (e.g., “TP53”) and the pathway (e.g., “WNT signaling pathway”). In order to cope with this problem, some IR systems allow the use of ontological terms instead of keywords for a more precise retrieval of relevant documents. For instance, GoPubmed2 automatically suggests candidates terms in the Medical Subject Headings (MeSH) and Gene Ontology during keywords typing. However, understanding of ontological concepts is not straightforward for scientists not familiar to them. The use of natural language is a more intuitive way to inquire for information, by posing questions (how, what, when, where, which, who, etc.) or requests (show me, tell me, etc.). For instance, for the example above, users could simply write the question “Is p53 part of the WNT-signaling cascade?”. Of course, allowance of free questions requires very advanced natural language processing (NLP) techniques.

The second characteristic of QA systems is to provide precise answers instead of only presenting potentially relevant documents. When using IR systems, figuring out the answer to a query requires reading the documents returned by the system. QA systems strive to simply return the answer “No” to the question above along with a list of references that gave evidence for this answer. This requires QA systems to perform a deep linguistic analysis of both the question and the potential relevant passages, also considering the meaning of terms. Not only synonyms, hypernyms and hyponyms must be considered during answer construction, but also disambiguation of entities should be performed whenever necessary, such as figuring out whether the word “WNT” refers to part of the pathway name or to a mention of a member of the WNT gene family.

Third, QA is not limited to textual resources and can include integration of data resources by converting natural language questions to an appropriate query language for searching for answers in databases, for instance in RDF triples [4]. Data extracted from different sources need to be assembled into a coherent single answer by means of exploring interlinks, dealing with contradictions and joining equal or equivalent answers. Currently, conversion of biomedical natural language questions to RDF triples are being evaluated in the BioASQ challenge (cf. Section 3.2.1) and question answering over linked data for three biomedical databases is being assessed in one of the Question Answering for Linked Data (QALD) shared tasks (cf. Section 3.2.4). Further, a prototype of the LODQA system [5] (cf. Section 3.1.2) converts questions to SPARQL queries for submission to BioPortal end point.

The technology behind information systems has evolved from simple Boolean keywords-based queries to complex linguistic processing of both the question and textual passages. Fig. 1 shows an overview on the evolution of these techniques and illustrates the complexity of question answering systems. Many of the current information systems available for querying PubMed implement some of these techniques (cf. survey in [6]).

Question answering has been successful applied for other domains, and examples of such systems are START3 and Wolfram Alpha4. Recent interest in question answering has been also motivated by the IBM’s Watson system [7], which beat human participants in a game show. Various researchers advocate that QA system can provide many benefits to the biological domain and they expect that these systems can boost scientific productivity [8]. Indeed, a study carried out with physicians showed that they do trust in the answers provided by QA systems [9]. However, Life Sciences also poses many challenges to QA systems, specially: (1) highly complex and large terminology, (2) exponential growth of data and hundreds of on-line databases, and (3) high degree of contradictions. Often, answering a question not only requires identifying relevant facts in a single document or database, but merging parts of the answers from distinct sources. Nevertheless, research on question answering in Biology is still scarce, in contrast to the medical domain (cf. Section 4).

The first community-based challenge which included a task related to biomedical question answering took place in 2006 and 2007 and consisted in the evaluation of passage retrieval and restricted to topics related to Genomics (cf. TREC Genomics in Section 3.2.3). Later on, in 2012 and 2013, the Question Answering for Machine Reading Evaluation (QA4MRE) Alzheimer Disease challenge assessed systems on the machine reading task, which consists of multiple choice questions related to a single document (cf. Section 3.2.2). Use of RDF in biomedical QA tasks is currently being evaluated in the QALD challenge (cf. Section 3.2.4) and the more comprehensive challenge related to biomedical QA so far, BioASQ (cf. Section 3.2.1), has been running since last year (2013).

In this work, we present an overview on question answering systems and techniques for the biological domain. In the next section, we give an overview on the most common components of a question answering system. Section 3 describes current systems and results obtained in shared tasks. Section 4 discusses the state-of-art of QA systems for the medical domain and give insights on which improvements could be achieved in biological field in the near future.

Section snippets

Fundamental techniques

Question answering systems are usually composed of three steps [10], [11] (Fig. 2): question processing; candidate processing, and answer processing. The first step receives the input entered by the user, i.e., a natural language question, and includes pre-processing of the question, identification of the question type and the type of answer to be required (e.g., the entity type) and building an input to the next step. In the candidate processing step, relevant documents, passages or raw data

State-of-the-art

Question answering for Biology is a very recent topic and recent development focuses on the improvement of underlying algorithms and on the evaluation and comparison of systems. Currently, only one system (cf. Section 3.1.1) and one prototype (cf. Section 3.1.2) are actually functioning. Work has specifically been boosted by two community-based shared tasks (cf. Sections 3.2.1 BioASQ, 3.2.2 QA4MRE Alzheimer Disease). In this section we present a brief summary on the available systems, the

Discussion

We presented an overview on the architecture of question answering systems, on the common methods used for individual steps during answer processing and on the state of the art within the biological domain. Mature solutions are still scarce but improvements in this field are promising according to various past and present challenges. In this section we discuss aspects which we find necessary to improve performance and resources which are still underused. For comparison, we also give an overview

Conclusions

In this survey we presented an overview on question answering for Biology. We discussed the desired functionalities that such a system should provide to users, as opposed to classical information retrieval systems, e.g., PubMed. We provided an overview on the methods behind previous solutions on this field, including practical examples for a better understanding of the processes. Finally, we discussed the current state of the art on question answering systems for the biological domain,

References (65)

  • H. Yu

    J. Biomed. Inform.

    (2007)
  • S.J. Athenikos et al.

    Comput. Methods Programs Biomed.

    (2010)
  • Y. Cao

    J. Biomed. Inform.

    (2011)
  • C.D. Manning et al.

    Introduction to Information Retrieval

    (2008)
  • L. Hirschman et al.

    Nat. Lang. Eng.

    (2001)
  • M. Bauer et al.

    Human Genomics

    (2012)
  • V. Lopez

    Scaling up question-answering to linked data

  • K.B. Cohen, J. dong Kim. Evaluation of SPARQL query generation from natural language questions, in: Natural Language...
  • Z. Lu

    Database

    (2011)
  • D.A. Ferrucci

    AI Magazine

    (2010)
  • J.D. Wren

    Bioinformatics

    (2011)
  • D. Jurafsky et al.

    Speech and Language Processing

    (2013)
  • C.F. Baker, C.J. Fillmore, J.B. Lowe, The Berkeley Framenet project, in: Proceedings of the 36th Annual Meeting of the...
  • A. Dolbey, M. Ellsworth, J. Scheffczyk, BioFrameNet: a domain-specific FrameNet extension with links to biomedical...
  • R. Tsai

    BMC Bioinform.

    (2007)
  • G.A. Miller

    Commun. ACM

    (1995)
  • J. Hakenberg et al.

    Bioinformatics

    (2008)
  • R. Leaman et al.

    Bioinformatics

    (2013)
  • A.R. Aronson et al.

    J. Am. Med. Inform. Assoc.

    (2010)
  • T. Rocktäschel et al.

    Bioinformatics

    (2012)
  • M. Gerner et al.

    BMC Bioinform.

    (2010)
  • T. Nunes et al.

    Bioinformatics

    (2013)
  • T.G.O. Consortium

    Nucleic Acids Res.

    (2013)
  • J.W. Ely

    Br. Med. J.

    (2000)
  • C. Schardt et al.

    BMC Med. Inform. Decis. Mak.

    (2007)
  • M. Gleize, et al., Selecting answers with structured lexical expansion and discourse relations LIMSI’s participation at...
  • P. Thomas et al.

    Nucleic Acids Res.

    (2012)
  • N.F. Noy

    Nucleic Acids Res.

    (2009)
  • R. Huang et al.

    Natural language question answering over RDF data

  • D. Weissenborn, G. Tsatsaronis, M. Schroeder, Answering factoid questions in the biomedical domain, in: A.-C.N. Ngomo,...
  • M.-D. Olvera-Lobo et al.

    J. Inform. Sci.

    (2011)
  • C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in: Proc. ACL Workshop on Text Summarization...
  • Cited by (34)

    • Perspectives for self-driving labs in synthetic biology

      2023, Current Opinion in Biotechnology
    • A literature review on question answering techniques, paradigms and systems

      2020, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      This task can generate a dataset or a neural model which will provide the source for the answer extraction. The retrieved data can be ranked according to its relevance for the question (Neves and Leser, 2015). The Answer Processing is the most challenging task on a Question Answering system.

    • SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions

      2020, Artificial Intelligence in Medicine
      Citation Excerpt :

      a QA system would provide as an answer particular name, i.e., Anorexia Athletica, for anorexia in gymnasts. A typical QA system can be viewed as a pipeline composed of three main processing phases [15–18] namely question processing, document processing, and answer processing each of which has to deal with specific challenges and issues. While QA in the open-domain is a longstanding challenge widely studied over the last decades [19–22], it still requires further efforts in the biomedical domain so as to enable a user to have a better chance to find an accurate answer to his human natural language question.

    • A study about the future evaluation of Question-Answering systems

      2017, Knowledge-Based Systems
      Citation Excerpt :

      Most of these studies introduced new methods of measuring similarities between questions and answers, relying on techniques aimed to discover deep semantic relationships such as Latent Dirichlet Allocation (LDA) [80], Distributional Semantic Models (DSMs) [45] and Latent Semantic Analysis [81], etc. There have been also some works focused on developing QA technologies for restricted domains such as legal [82] and biomedicine [83] documents. In the case of the biomedicine domain, it attracted such interest that there was an evaluation campaign focus on it [24].

    • A passage retrieval method based on probabilistic information retrieval and UMLS concepts in biomedical question answering

      2017, Journal of Biomedical Informatics
      Citation Excerpt :

      Unlike IR, where full documents are considered relevant to the information request, question answering (QA) systems aim at returning precise short answers instead of entire documents to queries posed as natural language questions [7]. Ideally, a QA system consists of three main processing phases, that can be studied and developed independently [8–11]: (1) question processing, (2) document processing, and (3) answer processing. Fig. 1 illustrates the generic architecture of a biomedical QA system.

    View all citing articles on Scopus
    View full text