Modelling knowledge strategy for solving the DNA sequence annotation problem through CommonKADS methodology

https://doi.org/10.1016/j.eswa.2012.12.088Get rights and content

Abstract

Finding the genes that exist within a DNA sequence and assigning them biological features and functions is one of the biggest challenges of Genomics. This task, called annotation, has to be as accurate and reliable as possible, because this information will be applied in other researches. Ideally, each sequence should be annotated and validated by a human expert, who has the knowledge to infer the most appropriate annotation. Nevertheless, the huge amount of genomic data produced by the new sequencing technologies prevents this practice. Developing expert systems that are able to annotate sequences automatically and emulate the expert involvement in certain key points of the process would enhance the annotation quality. In this work, the CommonKADS methodology is innovatively applied for this purpose. It is used to structure and model the knowledge required to build an expert system able to deal with the functional part of sequence annotation, i.e. establishing the biological purpose of the sequence. This approach provides the first general framework for the aforementioned problem, which can be easily extended to related issues.

Highlights

► Functional annotation aims to predict the biological function of the DNA sequence. ► We modelled the knowledge required to functionally annotate a DNA sequence through CommonKADS. ► We design and implemented a expert system prototype based on the model created. ► The expert system is based on rules obtained during the knowledge elicitation process.

Introduction

With the advent of new sequencing technologies, organisms and even complete ecosystems can be sequenced at low cost and in short periods of time. Generating genomic data is no longer a problem, but to process, analyse and apply these data in a significant and useful way still are (Friedberg, 2006). The huge amount of genomic data slows down or even prevents the execution of whatever process that needs the human intervention during processing or analysis stages.

Gene annotation is one of the major challenges of the Genomics. This task consists in finding the genes that exist within a DNA sequence and assign them biological features, such as the name of the protein they code for or the biological process they are involved in. The structural annotation discovers where are the known genes, genetic markers and other landmarks. The functional annotation aims to predict the biological function of genes and proteins. This paper takes into account only the functional annotation.

The annotation assigned to a sequence should be as accurate and reliable as possible, since it could be used in further biological, medical, or pharmaceutical researches. Moreover, uploading annotated sequences to public databases and using this information to annotate other sequences is a common practice in the Bioinformatics community. Therefore, a miss-annotation could be propagated to future annotations (Friedberg, 2006).

The process to annotate a sequence involves the execution of pipelines composed by many Bioinformatics programs. Highly skilled professionals analyse their outputs and, based on their Biological and Biochemistry knowledge, infer the most appropriate information for each sequence. In spite of the trustful and accurate character of the manual annotation, this process is extremely time consuming, labour-intensive, and expensive, and therefore, it is not suitable for great volumes of data. In order to avoid the drawbacks of manual annotation, automated annotation methods need to be employed for the vast majority of genomes (Edwards, Stajich, & Hansen, 2009).

This approach sacrifices some quality in the annotation to get results faster and cheaper by removing the experts. However, given the potential impact of miss-annotations, its results has to be manually revised (or curated), recreating again the manual bottleneck.

One possible solution to this problem is to design an Expert System (ES) that is capable of emulating the human expertise during the annotation process. However, develop such system is a laborious and complex task that requires a deep comprehension of the applications domain. Moreover, it requires being able to elicit the relevant knowledge and providing it in a suitable format for automated processing.

Obtaining the knowledge required to solve the studied problem is a crucial phase in the development of any Knowledge-Based System (KBS), and in particular ES. The success of the system depends to a large extent on the accuracy of the information acquired. There are knowledge elicitation techniques that can facilitate this process, such as structured interviews, protocol analysis, and laddered grid (Burton, Shadbolt, Rugg, & Hedgecock, 1990).

The development of Knowledge-Based Systems (KBS) requires the ability to understand and structure the knowledge in order to incorporate it into the ES. Knowledge Engineering provides methodologies that facilitate this process and consequently helps the systems design and implementation. CommonKADS (Schreiber et al., 2000) is a leading methodology to support structured knowledge. This methodology is a flexible and powerful tool that can be employed in any context-based problem.

This work consists in using CommonKADS approach to analyze and structure the knowledge required to develop an ES for functionally annotating DNA sequence without taking into account the genomic context. As far we can ascertain, this is the first formal description of the application of this methodology in the bioinformatic field. Therefore, here is presented a novel general framework to the functional annotation problem that can be adapted for different pipelines and extended to related matters. The knowledge employed here was obtained using the knowledge elicitation technique structured interview and was extract from experts in Biology and Bioinformatics.

The paper is organised as follows. Section 2 provides the background needed to understand the annotation problem, and describes the current state of the art in tools for annotation. The presentation of the proposed approach starts in Section 3 with an overview of the CommonKADS methodology and the knowledge elicitation techniques applied. Section 4 describes the knowledge modelling process for the aforementioned problem, while Section 5 characterises the main requirements for the related system. Finally, Section 6 discusses the conclusions and future work.

Section snippets

Biological background

The hereditary information of all living being, with exception of virus, is stored in a macromolecule called Deoxyribonucleic Acid (DNA). This molecule consists of two complementary long strands, linked through hydrogen bonds, twisted around each other, forming a double helix shape. The strands are mainly composed of smaller molecules called nucleotides. Each nucleotide, in turn, consists of a deoxyribose sugar, one phosphate and one of the four nitrogen-rich bases: Adenine (A), Guanine (G),

CommonKADS

CommonKADS (Schreiber et al., 2000) is a flexible methodology that offers a set of tools to model KBS regarding not only the knowledge needed but also its context and proposal. Through this methodology, it is possible to identify the strategy that best fits the problem to solve. Apart from that, it establishes the methodological bases to tackle the problem in a general way, allowing these bases to be applied to any similar problem, independently of its complexity. Other benefits of using

Knowledge model

The main focus of this work is to organise and structure the knowledge needed to design the ES proposed. Since one of the major challenges of Knowledge Engineering is to discover structures that are capable to model the knowledge in a schematic way, methodologies that are able to accomplish this goal excel in this field.

CommonKADS, as depicted in Fig. 2, supports a Knowledge Model (KM) that facilitates the structure of a knowledge-intensive information-processing task. This model is structured

Analysis and implementation

The system for functional annotation will be a rule-based ES according to the schema proposed in Fig. 6. The previous sections describe the acquisition of the knowledge about the domain of the application, and model the generic task to be carried out by that system. Its rules were obtained through interviews held with experts, and complemented with the previous knowledge in the domain of the engineer. The development of the system also requires specifying additional requirements such as its

Conclusion and future works

The technological advances in the last ten years have increased the volume of biological data generated all over the world. Nowadays, the challenge is how to process, analyse, store and use these data in an efficient way. A key example of this situation is the gene annotation process, one of the most difficult tasks of Genomics, which has to identify and characterise the genes in a DNA sequence.

Ideally, the gene annotation process should be carried out by human experts, who evaluate each

Acknowledgements

This work has been supported by Grant CSD2007-00002 Consolider-Ingenio 2010, MICINN (Spain).

Thanks are due to the anonymous referees for their very valuable comments and suggestions.

References (34)

  • A. Burton et al.

    The efficacy of knowledge elicitation techniques: A comparison across domains and levels of expertise

    Knowledge Acquisition

    (1990)
  • R. Akerkar et al.

    Knowledge-based systems

    (2010)
  • A. Alonso et al.

    Ingenierı´a del conocimiento

    (2004)
  • S.F. Altschul et al.

    Gapped BLAST and PSI-BLAST: A new generation of protein database search programs

    Nucleic Acids Research

    (1997)
  • E.M. Awad et al.

    Knowledge management

    (2004)
  • D.A. Benson et al.

    GenBank

    Nucleic Acids Research

    (2011)
  • B. Berger et al.

    Protein folding in the hydrophobic–hydrophilic (HP) is NP-complete

  • F.C. Bernstein et al.

    European Journal of Biochemistry

    (1977)
  • P. Crescenzi et al.

    On the complexity of protein folding

  • V. Curwen et al.

    The ensembl automatic gene annotation system

    Genome Research

    (2004)
  • Durbin, R., Haussler, D., Stein, L., Lewis, S., & Krog, A. (2011). GFF (general feature format) specifications...
  • D. Edwards et al.

    Bioinformatics tools and applications

    (2009)
  • R. Finn et al.

    The Pfam protein families database

    Nucleic Acids Research

    (2010)
  • P. Flicek et al.

    Ensembl 2011

    Nucleic Acids Research

    (2011)
  • I. Friedberg

    Automated protein function prediction – The genomic challenge

    Briefings in Bioinformatics

    (2006)
  • J.C. Giarratano et al.

    Expert systems: Principles and programming

    (1998)
  • P. Gouret et al.

    Figenix: Intelligent automation of genomic annotation: Expertise integration in a new software platform

    BMC Bioinformatics

    (2005)
  • Cited by (7)

    • ISIEM: A methodology to deploy a knowledge-based system to support bidding process

      2021, Computers and Industrial Engineering
      Citation Excerpt :

      Although the CommonKADS methodology has today more than 20 years, it is still up to date for KBS creation. More recent works concern its application: Sutton and Patkar (2009) and Xavier et al. (2013) in medical domain, or Saleh et al. (2018) who propose an enhancement of the methodology in order to improve its operational reusability. Concerning its link with our work, the separation in three phases ([1] Current situation analysis, [2] Knowledge modelling, [3] Implementation) is particularly relevant and will be transposed to the proposed methodology.

    • A rule-based expert system for inferring functional annotation

      2015, Applied Soft Computing Journal
      Citation Excerpt :

      The MASSA approach tries to facilitate its evolution to adapt it to new information and techniques for functional annotation. For this purpose, this research bases its development on the well-known methodology CommonKADS [12] for knowledge-based systems, as described in [11]. INFAES is the RBES of MASSA.

    • COMMONKADS for Knowledge Based System Development: A Literature Study

      2022, 2022 International Conference on Information Technology Systems and Innovation, ICITSI 2022 - Proceedings
    • Knowledge Model to Manage Customer Satisfaction Based on Claims

      2017, Proceedings - 14th IEEE International Conference on E-Business Engineering, ICEBE 2017 - Including 13th Workshop on Service-Oriented Applications, Integration and Collaboration, SOAIC 207
    View all citing articles on Scopus
    View full text