Using OWL to model biological knowledge

https://doi.org/10.1016/j.ijhcs.2007.03.006Get rights and content

Abstract

Much has been written of the facilities for ontology building and reasoning offered for ontologies expressed in the Web Ontology Language (OWL). Less has been written about how the modelling requirements of different areas of interest are met by OWL-DL's underlying model of the world. In this paper we use the disciplines of biology and bioinformatics to reveal the requirements of a community that both needs and uses ontologies. We use a case study of building an ontology of protein phosphatases to show how OWL-DL's model can capture a large proportion of the community's needs. We demonstrate how Ontology Design Patterns (ODPs) can extend inherent limitations of this model. We give examples of relationships between more than two instances; lists and exceptions, and conclude by illustrating what OWL-DL and its underlying description logic either cannot handle in theory or because of lack of implementation. Finally, we present a research agenda that, if fulfilled, would help ensure OWL's wider take up in the life science community.

Introduction

In this paper we investigate the ontological needs of biology and the associated discipline of bioinformatics. Much has been written about what knowledge representation languages such as the description logic (DL) variant of the Web Ontology Language (OWL) can offer domain experts in terms of modelling facilities (Dean et al., 2002). Much less has been written about what particular domains need to capture in such modelling languages. In this paper, we will put forth the knowledge modelling requirements of biology and bioinformatics. This will highlight the limits of modern description logics (DL) as knowledge representation languages. The expressive restrictions of DLs are well known (Baader et al., 2003, Chapter 1), in this article, we take the perspective of the needs of a domain, rather than a computer science research agenda.

OWL-DL is underpinned by a DL (Baader et al., 2003), a fragment of first order logic. This means that an OWL-DL ontology is expressed in a formalism with well-defined semantics and over which automated reasoning can take place. We will describe OWL-DL's use in this context and how it captures biology and bioinformatics domain knowledge in ontologies. One major question to be asked is whether the logical approach followed by OWL-DL suits the description of the natural world, with all its complexities and inconsistencies.

Bioinformatics is the use of computational and mathematical techniques to store, manage and analyse biological data to answer biological problems (Kaminski, 2000). At the centre of bioinformatics is the analysis of DNA and protein sequences. Its goal is to characterise nucleic acid sequences (genes) and their products, primarily proteins. Biology, however, is unlike physics and much of chemistry in that—although it contains many laws and models—few of these are reduced to a mathematical form. It is not possible to take a protein's sequence of amino acids, apply some formula, and derive a set of characteristics such as location, functionality, forms of modification, regulation, etc.

Instead of mathematical laws, bioinformaticians use similarity. The central dogma of bioinformatics is that if an uncharacterised sequence is sufficiently similar to one that has been characterised, then the understanding can be transferred from the characterised to the uncharacterised. Many tools are provided for comparing sequences against databases of other sequences (Attwood and Miller, 2001). This search for similarity is, however, not simply done on the basis of some statistical measures. A good bioinformatician will use all the facts recorded about the entity and the nature of the matches between the sequences in order to infer any biological relationship. This is why both biology and bioinformatics have been characterised as a “knowledge based discipline” (Baker et al., 1999).

As a consequence of needing to record this knowledge in a consistent and computationally amenable form, ontologies of various kinds have become very important in bioinformatics (The Gene Ontology Consortium, 2000, Stevens et al., 2003).4 Molecular biologists wish to describe and record a wide range of knowledge items. These include, but are not limited to:

  • names of things;

  • classifications (such as species);

  • the size (absolute and ranges, both real and integers), shape and numbers of things;

  • functions, processes and behaviours of things;

  • structure and substance (atoms, molecules, tissues, etc.);

  • evidence (both experimental and literature) for facts about the world;

  • patterns (regular expressions in sequences indicative of some feature, etc.);

  • parts of things to describe anatomy, composition of molecules and assemblies of molecules, etc.;

  • the order of things and their transformation, such as life cycle stages, metabolic pathway reactions, exons in genes;

  • degree of match and similarity of things.

The biology community has realised a need for ontology. OWL is a recommendation for the representation of ontology. It is pertinent, therefore, to examine OWL's ability to fulfil the ontological needs of the biological domain. As we will see, OWL-DL has its limitations in meeting these goals. the motivation for its use, however, in attempting to form ontologies of molecular biology are strong. OWL's ability to model incomplete, irregular knowledge fits well our incomplete, irregular knowledge of biology. OWL-DL's computational qualities of consistency checking and classification are also invaluable in creating coherent and useful ontological models of a very complex domain (Rector et al., 2001, Wroe et al., 2003).

In Section 2 we describe the approach followed by OWL-DL and its modelling constructs and the application of automated reasoning to OWL-DL ontologies. This section can be skipped by those familiar with OWL-DL. We then present a protein family as a case study for ontological modelling in Section 3 and an ontology of that family in Section 4 to describe what can be straightforwardly captured in OWL-DL. Then, in Section 5 we show how some of the limitations of OWL-DL can be circumvented with the use of Ontology Design Patterns. Finally, in Section 6 we discuss what cannot be captured in OWL-DL and use Section 7 to provide a general discussion of the limitations of OWL-DL to represent knowledge in the life sciences.

Section snippets

The OWL-DL model of the World

DLs are a decidable fragment of first order logic and thus have a well-defined, two-valued semantics, i.e., they allow us to express what is universally true (Baader et al., 2003). In OWL-DL, the basic unit of an ontology is a class, which represents a set of individuals, its instances. Moreover, we consider properties, which represent (binary) relations between individuals. Individuals, together with the information about which individual is an instance of which class, and how the individuals

A knowledge case study

Proteins are divided into broad functional classifications called families. Protein phosphatases and protein kinases are two families that control the phosphorylation events in a cell (Alberts et al., 1989).6 Biologists classify phosphatases according to their functionality and evolutionary relationships to each other. Tertiary structure units

A phosphatase family ontology

We developed a phosphatase ontology to help semi-automatically support a phosphatase protein family database (Wolstencroft et al., 2005a, Wolstencroft et al., 2005c) and to automatically classify proteins found in a genome (Wolstencroft et al., 2005b, Wolstencroft et al., 2006).

In this ontology, the classes of phosphatase were defined in terms of their p-domain composition. Fig. 3 shows how the p-domain composition of each protein can be sufficient to recognise to which phosphatase sub-family

Using OWL with Ontology Design Patterns

In this section, we will concentrate on those limitations of OWL that can be worked around by using Ontology Design Patterns (ODPs).9 We do not exhaustively explore ODPs for OWL, but illustrate how they can use OWL-DL's current expressivity to work around some of the inherent

The boundary of the OWL World

In this section, we cross the boundary of the OWL-DL view of the world and explore aspects of biology that OWL-DL cannot represent. Some of these aspects cannot be expressed in OWL-DL or any decidable description logic, because they are known to lead to undecidability, semantic problems, or currently unmanageable computational complexity. Of these other aspects cannot be expressed in OWL-DL, but it is known that an extension of OWL-DL with the corresponding expressive means would be possible. These

Discussion

In this paper, we have explored the ontological requirements posed by biology and bioinformatics and how well OWL-DL's model matches those requirements. There are obviously large areas of the world of biology that can be represented using OWL-DL with great success. It is possible to create OWL-DL descriptions of molecular biology that are both ontologically good and useful in driving applications. Yet, it is relatively easy to find features of biology that do not fit into this strict, universal

References (37)

  • T. Attwood et al.

    Which craft is best in bioinformatics?

    Computers and Chemistry

    (2001)
  • B. Alberts et al.

    Molecular Biology of the Cell

    (1989)
  • J.N. Andersen et al.

    Structural and evolutionary relationships among protein tyrosine phosphatase domains

    Molecular and Cellular Biology

    (2001)
  • Baader, F., Hanschke, P., 1991. A schema for integrating concrete domains into concept languages. In: Proceedings of...
  • Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (Eds.), 2003. The Description Logic...
  • P. Baker et al.

    An ontology for bioinformatics applications

    Bioinformatics

    (1999)
  • S. Bechhofer et al.

    The OWL instance store: system description

  • D. Calvanese et al.

    Reasoning in expressive description logics

  • T.R. Cech

    Self-splicing of group i introns

    Annual Review of Biochemistry

    (1990)
  • Dean, M., Connolly, D., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Stein,...
  • Eiter, T., Lukasiewicz, T., Schindlauer, R., Tompits, H., 2004. Combining answer set programming with description...
  • Evren Sirin, B.P., 2004. Pellet: an OWL DL reasoner. In: Volker Haaslev, R.M. (Ed.), Proceedings of the International...
  • E. Gamma et al.

    Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional Computing Series

    (1995)
  • V. Haarslev et al.

    RACER system description

  • Holger Knublauch, A.R., Musen, M., 2004. Editing description logic ontologies with the protege-owl plugin. In:...
  • Horrocks, I. FacT++., web site,...
  • Horrocks, I., Kutz, O., Sattler, U., 2005. The irresistible SRIQ. In: OWL: Experiences and Directions (Workshop),...
  • N. Kaminski

    Bioinformatics. A user's perspective

    American Journal of Respiratory Cell Molecular Biology

    (2000)
  • Cited by (78)

    • Deep hiearchical multi-label classification applied to chest X-ray abnormality taxonomies

      2020, Medical Image Analysis
      Citation Excerpt :

      Yet, pushing raw performance further will likely require models that depart from standard multi-label classifiers. For instance, despite their importance to clinical understanding and interpretation (Stevens et al., 2007; Humphreys and Lindberg, 1993; Stearns et al., 2001), taxonomies of disease patterns are not typically incorporated into CXR CAD systems, or for other medical CAD domains for that matter. This observation motivates our work, which uses hierarchical multi-label classification (HMLC) to both push raw area under the curve (AUC) performance further and also to provide more meaningful predictions that leverage clinical taxonomies.

    • Rational closure for all description logics

      2019, Artificial Intelligence
      Citation Excerpt :

      As such, their inferences are monotonic. There is recurring evidence in the literature that non-monotonic inferences, especially inheritance with exception and overriding, would greatly help in modeling biomedical knowledge, policies, and other important application domains for DLs [30,31,4]. Consequently, many nonmonotonic extensions of DLs have been proposed to address these needs, for example [1,2,15,27,5,13,22,23,4].

    • Improving comprehension of knowledge representation languages: A case study with Description Logics

      2019, International Journal of Human Computer Studies
      Citation Excerpt :

      Unlike knowledge in corporate databases, knowledge on the WWW is rarely complete. The OWA also makes DLs appropriate for certain application areas, e.g. biological research (Stevens et al., 2007). However, the OWA does present difficulties.

    • A new semantics for overriding in description logics

      2015, Artificial Intelligence
      Citation Excerpt :

      Supporting default attributes and exceptions was important enough to look for alternative representation methods, based on classical DLs. The simplest examples can be dealt with by means of ontology design patterns [44,47]. However, these solutions do not scale to more complex examples with multiple exception dimensions, as discussed in [44]: The number of additional concepts introduced by the patterns may grow exponentially.

    • Semantic Web Methods for Data Integration in Life Sciences

      2024, Biological Data Integration: Computer and Statistical Approaches
    View all citing articles on Scopus
    1

    Mikel Egaña Aranguren is funded by an EPSRC studentship.

    2

    Katy Wolstencroft's work on the phosphatase ontology was funded by an MRC studentship.

    3

    Matthew Horridge and Nick Drummond are funded by JISC - Semantic Grid and Autonomic Computing Programme grant.

    View full text