Modelling knowledge strategy for solving the DNA sequence annotation problem through CommonKADS methodology
Highlights
► Functional annotation aims to predict the biological function of the DNA sequence. ► We modelled the knowledge required to functionally annotate a DNA sequence through CommonKADS. ► We design and implemented a expert system prototype based on the model created. ► The expert system is based on rules obtained during the knowledge elicitation process.
Introduction
With the advent of new sequencing technologies, organisms and even complete ecosystems can be sequenced at low cost and in short periods of time. Generating genomic data is no longer a problem, but to process, analyse and apply these data in a significant and useful way still are (Friedberg, 2006). The huge amount of genomic data slows down or even prevents the execution of whatever process that needs the human intervention during processing or analysis stages.
Gene annotation is one of the major challenges of the Genomics. This task consists in finding the genes that exist within a DNA sequence and assign them biological features, such as the name of the protein they code for or the biological process they are involved in. The structural annotation discovers where are the known genes, genetic markers and other landmarks. The functional annotation aims to predict the biological function of genes and proteins. This paper takes into account only the functional annotation.
The annotation assigned to a sequence should be as accurate and reliable as possible, since it could be used in further biological, medical, or pharmaceutical researches. Moreover, uploading annotated sequences to public databases and using this information to annotate other sequences is a common practice in the Bioinformatics community. Therefore, a miss-annotation could be propagated to future annotations (Friedberg, 2006).
The process to annotate a sequence involves the execution of pipelines composed by many Bioinformatics programs. Highly skilled professionals analyse their outputs and, based on their Biological and Biochemistry knowledge, infer the most appropriate information for each sequence. In spite of the trustful and accurate character of the manual annotation, this process is extremely time consuming, labour-intensive, and expensive, and therefore, it is not suitable for great volumes of data. In order to avoid the drawbacks of manual annotation, automated annotation methods need to be employed for the vast majority of genomes (Edwards, Stajich, & Hansen, 2009).
This approach sacrifices some quality in the annotation to get results faster and cheaper by removing the experts. However, given the potential impact of miss-annotations, its results has to be manually revised (or curated), recreating again the manual bottleneck.
One possible solution to this problem is to design an Expert System (ES) that is capable of emulating the human expertise during the annotation process. However, develop such system is a laborious and complex task that requires a deep comprehension of the applications domain. Moreover, it requires being able to elicit the relevant knowledge and providing it in a suitable format for automated processing.
Obtaining the knowledge required to solve the studied problem is a crucial phase in the development of any Knowledge-Based System (KBS), and in particular ES. The success of the system depends to a large extent on the accuracy of the information acquired. There are knowledge elicitation techniques that can facilitate this process, such as structured interviews, protocol analysis, and laddered grid (Burton, Shadbolt, Rugg, & Hedgecock, 1990).
The development of Knowledge-Based Systems (KBS) requires the ability to understand and structure the knowledge in order to incorporate it into the ES. Knowledge Engineering provides methodologies that facilitate this process and consequently helps the systems design and implementation. CommonKADS (Schreiber et al., 2000) is a leading methodology to support structured knowledge. This methodology is a flexible and powerful tool that can be employed in any context-based problem.
This work consists in using CommonKADS approach to analyze and structure the knowledge required to develop an ES for functionally annotating DNA sequence without taking into account the genomic context. As far we can ascertain, this is the first formal description of the application of this methodology in the bioinformatic field. Therefore, here is presented a novel general framework to the functional annotation problem that can be adapted for different pipelines and extended to related matters. The knowledge employed here was obtained using the knowledge elicitation technique structured interview and was extract from experts in Biology and Bioinformatics.
The paper is organised as follows. Section 2 provides the background needed to understand the annotation problem, and describes the current state of the art in tools for annotation. The presentation of the proposed approach starts in Section 3 with an overview of the CommonKADS methodology and the knowledge elicitation techniques applied. Section 4 describes the knowledge modelling process for the aforementioned problem, while Section 5 characterises the main requirements for the related system. Finally, Section 6 discusses the conclusions and future work.
Section snippets
Biological background
The hereditary information of all living being, with exception of virus, is stored in a macromolecule called Deoxyribonucleic Acid (DNA). This molecule consists of two complementary long strands, linked through hydrogen bonds, twisted around each other, forming a double helix shape. The strands are mainly composed of smaller molecules called nucleotides. Each nucleotide, in turn, consists of a deoxyribose sugar, one phosphate and one of the four nitrogen-rich bases: Adenine (A), Guanine (G),
CommonKADS
CommonKADS (Schreiber et al., 2000) is a flexible methodology that offers a set of tools to model KBS regarding not only the knowledge needed but also its context and proposal. Through this methodology, it is possible to identify the strategy that best fits the problem to solve. Apart from that, it establishes the methodological bases to tackle the problem in a general way, allowing these bases to be applied to any similar problem, independently of its complexity. Other benefits of using
Knowledge model
The main focus of this work is to organise and structure the knowledge needed to design the ES proposed. Since one of the major challenges of Knowledge Engineering is to discover structures that are capable to model the knowledge in a schematic way, methodologies that are able to accomplish this goal excel in this field.
CommonKADS, as depicted in Fig. 2, supports a Knowledge Model (KM) that facilitates the structure of a knowledge-intensive information-processing task. This model is structured
Analysis and implementation
The system for functional annotation will be a rule-based ES according to the schema proposed in Fig. 6. The previous sections describe the acquisition of the knowledge about the domain of the application, and model the generic task to be carried out by that system. Its rules were obtained through interviews held with experts, and complemented with the previous knowledge in the domain of the engineer. The development of the system also requires specifying additional requirements such as its
Conclusion and future works
The technological advances in the last ten years have increased the volume of biological data generated all over the world. Nowadays, the challenge is how to process, analyse, store and use these data in an efficient way. A key example of this situation is the gene annotation process, one of the most difficult tasks of Genomics, which has to identify and characterise the genes in a DNA sequence.
Ideally, the gene annotation process should be carried out by human experts, who evaluate each
Acknowledgements
This work has been supported by Grant CSD2007-00002 Consolider-Ingenio 2010, MICINN (Spain).
Thanks are due to the anonymous referees for their very valuable comments and suggestions.
References (34)
- et al.
The efficacy of knowledge elicitation techniques: A comparison across domains and levels of expertise
Knowledge Acquisition
(1990) - et al.
Knowledge-based systems
(2010) - et al.
Ingenierı´a del conocimiento
(2004) - et al.
Gapped BLAST and PSI-BLAST: A new generation of protein database search programs
Nucleic Acids Research
(1997) - et al.
Knowledge management
(2004) - et al.
GenBank
Nucleic Acids Research
(2011) - et al.
Protein folding in the hydrophobic–hydrophilic (HP) is NP-complete
- et al.
European Journal of Biochemistry
(1977) - et al.
On the complexity of protein folding
- et al.
The ensembl automatic gene annotation system
Genome Research
(2004)
Bioinformatics tools and applications
The Pfam protein families database
Nucleic Acids Research
Ensembl 2011
Nucleic Acids Research
Automated protein function prediction – The genomic challenge
Briefings in Bioinformatics
Expert systems: Principles and programming
Figenix: Intelligent automation of genomic annotation: Expertise integration in a new software platform
BMC Bioinformatics
Cited by (7)
ISIEM: A methodology to deploy a knowledge-based system to support bidding process
2021, Computers and Industrial EngineeringCitation Excerpt :Although the CommonKADS methodology has today more than 20 years, it is still up to date for KBS creation. More recent works concern its application: Sutton and Patkar (2009) and Xavier et al. (2013) in medical domain, or Saleh et al. (2018) who propose an enhancement of the methodology in order to improve its operational reusability. Concerning its link with our work, the separation in three phases ([1] Current situation analysis, [2] Knowledge modelling, [3] Implementation) is particularly relevant and will be transposed to the proposed methodology.
A rule-based expert system for inferring functional annotation
2015, Applied Soft Computing JournalCitation Excerpt :The MASSA approach tries to facilitate its evolution to adapt it to new information and techniques for functional annotation. For this purpose, this research bases its development on the well-known methodology CommonKADS [12] for knowledge-based systems, as described in [11]. INFAES is the RBES of MASSA.
COMMONKADS for Knowledge Based System Development: A Literature Study
2022, 2022 International Conference on Information Technology Systems and Innovation, ICITSI 2022 - ProceedingsE-Claim Processing System
2020, SSRNKnowledge Model to Manage Customer Satisfaction Based on Claims
2017, Proceedings - 14th IEEE International Conference on E-Business Engineering, ICEBE 2017 - Including 13th Workshop on Service-Oriented Applications, Integration and Collaboration, SOAIC 207