Elsevier

Computers in Biology and Medicine

Volume 69, 1 February 2016, Pages 144-151
Computers in Biology and Medicine

On fuzzy semantic similarity measure for DNA coding

https://doi.org/10.1016/j.compbiomed.2015.12.017Get rights and content

Highlights

  • FSSM coding scheme centers codons׳ clustering and genetic code context.

  • FSSM exploits natural characteristics of nucleotides in codons.

  • FSSM reveals a strong correlation between nucleotides in codons.

  • FSSM attains a significant enhancement in coding regions identification.

Abstract

A coding measure scheme numerically translates the DNA sequence to a time domain signal for protein coding regions identification. A number of coding measure schemes based on numerology, geometry, fixed mapping, statistical characteristics and chemical attributes of nucleotides have been proposed in recent decades. Such coding measure schemes lack the biologically meaningful aspects of nucleotide data and hence do not significantly discriminate coding regions from non-coding regions.

This paper presents a novel fuzzy semantic similarity measure (FSSM) coding scheme centering on FSSM codons׳ clustering and genetic code context of nucleotides. Certain natural characteristics of nucleotides i.e. appearance as a unique combination of triplets, preserving special structure and occurrence, and ability to own and share density distributions in codons have been exploited in FSSM. The nucleotides׳ fuzzy behaviors, semantic similarities and defuzzification based on the center of gravity of nucleotides revealed a strong correlation between nucleotides in codons. The proposed FSSM coding scheme attains a significant enhancement in coding regions identification i.e. 36–133% as compared to other existing coding measure schemes tested over more than 250 benchmarked and randomly taken DNA datasets of different organisms.

Introduction

Deoxyribonucleic acid (DNA) is considered as a core material in living species responsible for growth and genetic transfer of traits [1], [2]. Genes are the segments of DNA sequence that encode protein. Protein performs most of the functions in organisms. The coding regions (exons) are sequence of nucleotides that actually code for protein while non-coding regions (introns) do not code for protein [3], [4]. The coding regions identification is tightly coupled with 1/f background noise which diffuses the boundaries of two regions in such a way that viable discernment of coding regions from non-coding regions is overly hindered. Digital signal processing (DSP) approaches have been widely used in genome sequence analysis especially in area of protein coding regions identification [5], [6]. The solutions relying on DSP approaches require the representation of nucleotide bases to numerical values and translation of nucleotides׳ sequences into time domain signals [7], [8], [9]. It has been deeply observed [10], [11], [12] that coding measure scheme holds a significant contribution in efficient identification of protein coding regions from non-coding regions by suppressing 1/f noise. A large number of such schemes have been proposed by researchers in recent decade.

The first coding measure scheme was proposed by Silverman et al. [13] called as "Tetrahedron mapping scheme". Later Voss [7] presented another coding measure scheme called as "Binary indictor sequence". This scheme was objectively employed [14], [15], [16] in DNA sequence analysis using digital signal processing approaches. Later certain drawbacks that appeared as a frequency leakage by application of Voss coding scheme [7] were addressed by Nair et al. [8] proposing a coding measure scheme by replacing the four binary indicator sequences by just one sequence called as "EIIP indicator sequence". The energy of delocalized electrons in amino acids and nucleotides had been calculated as the Electron–ion interaction pseudo potential (EIIP). This coding measure was popularly employed [9], [17] for addressing issues involved in DNA sequence analysis and coding regions identification. The significant results shown by the authors [8] based on EIIP indicator sequence was further improved by MK. Hota and VK Srivastava [17] by proposing a new indicator sequence named "Complex indicator sequence". This coding measure scheme has been reviewed as equally significant coding measure scheme [10], [11], [12] like EIIP in terms of reducing computational overhead involved in adopting Voss coding scheme for coding regions identification.

Various coding measure schemes proposed so far are mainly based on fixed mapping methods [18], [19], [20] chemical property based methods [21], [22] and statistical property based [23], [24], [25]. Researchers now observe the need to address the inter-nucleotide distance and genetic code context (e.g. nucleotide/amino acid information, compositions and densities of nucleotides, their relevant positions and orders in codons etc.) to propose DNA coding measure schemes. Yin et al. [10] introduced numerical representation of DNA sequences based on the genetic code context in the sequences. The authors mapped 20 amino acids to 20 unique complex numbers. The real parts and imaginary parts of these complex numbers were aided from the specific characteristics of amino acids and a coding measure scheme was proposed. In the same direction, the most recent work has been done by Mujiono et al. [26] to consider inter-nucleotide distance with hierarchical clustering to identify the fractal patterns in DNA sequences. The authors have employed numerical mapping of DNA sequence to digital signal for calculation of inter-nucleotide distance.

Coding measure schemes based on genetic code context would significantly help for DNA sequence analysis and coding regions identification [12]. We could not find a satisfactory work done in the context of DNA coding measure scheme based on genetic code context. Similarly, despite an exhaustive review of literature to seek clustering approaches used for codons׳ clustering, only a few numbers of publications that employed clustering algorithms were noticed. Such approaches (a few examples are [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39]) were based on distance as a similarity measure to cluster DNA/RNA sequences (none of the approach addresses codon׳s clustering). Mostly hierarchical or K-Means algorithms with their variations have been used to address classification/clustering sequences. The clustering approaches based on such algorithms compute the distance (Euclidean/Edit/Minkowski) between datum and centroid of cluster. Since nucleotides in codons own special biological properties and they preserve their unique structure, the common clustering algorithms based on distance as a similarity measure do not provide gainful applications in codons׳ clustering.

Section snippets

Methodology

We propose a novel fuzzy semantic similarity measure (FSSM) coding scheme centering FSSM codons׳ clustering and genetic code context of nucleotides. FSSM owns genetically meaningful characteristics of nucleotides in codons i.e. nucleotides׳ density distribution, specific positions of nucleotides in codons and nucleotides׳ usage in terms of their distribution.

Results and discussions

Performance evaluation of different coding measure schemes for coding regions identification has been performed at nucleotide level. In this context, following important evaluation measures have been employed which are defined as,

Discriminationmeasure(D)=LowestamplitudeofExonHighestamplitudeofIntron(1)
Sensitivity(Sn)=TPTP+FN(2)
Specificity(Sp)=TPTP+FP(3)
Predictionaccuracy(P)=TP+TNTP+FP+TN+FN(4)
Approximatecorrelation(AC)=(ACP0.5)*2(5)
where,
ACP=14*(TPTP+FN+TPTP+FP+TNTN+FN+TNTN+FP)(6)
Empty Cell

Here

Conclusion

This paper presented a novel fuzzy semantic similarity measure (FSSM) codons׳ clustering algorithm for DNA coding measure scheme. The FSSM algorithm exploited nucleotides׳ fuzzy membership and their natural characteristics in terms of fuzzy behaviors and semantic similarities revealing a strong correlation between them. The FSSM coding measure scheme achieved significant enhancement in coding regions identification as compared to other existing coding measure schemes. More than 250 benchmarked

Conflict of interests

Authors have no conflict of interests.

References (47)

  • R.F. Voss

    Evolution of long-range fractal correlations and 1/f noise in DNA base sequences

    Phys. Rev. Lett.

    (1992)
  • S. Nair et al.

    A coding measure scheme employing electron-ion interaction pseudopotential (EIIP)

    Bioinformation

    (2006)
  • MK. Hota, VK Srivastava, DSP technique for gene and exon prediction taking complex indicator sequence, in: Proceedings...
  • C. Yin, S. Yau, Numerical representation of DNA sequences based on genetic code context and its applications in...
  • M. Akhtar et al.

    Signal processing in sequence analysis: advances in eukaryotic gene prediction

    IEEE J. Sel. Top. Signal Process.

    (2008)
  • Hon Keung Kwan, Swarna bai Arniker, Numerical representation of DNA Sequences, in: Proceedings of International...
  • D.G. Grandhi, C. Vijay Kumar, 2-Simplex mapping for identifying the protein coding regions in DNA, in: Proceedings of...
  • J.P. Mena-Chalco et al.

    Identification of protein coding regions using the modified gabor-wavelet transform

    IEEE/ACM Trans. Comput. Biol. Bioinform.

    (2007)
  • Changchuan Yin et al.

    Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence

    J. Theor. Biol.

    (2007)
  • M.K. Hota, V.K. Srivastava, DSP technique for gene and exon prediction taking EIIP indicator sequence, in: Proceedings...
  • B. Demeler et al.

    Neural network optimization for E. coli promoter prediction

    Nucleic Acids Res.

    (1991)
  • P. Lio et al.

    Finding pathogenicity islands and gene transfer events in genome data

    Bioinformatics

    (2000)
  • R. Ranawana et al.

    A neural network based multi-classifier system for gene identification in DNA sequence

    Neural Comput. Appl.

    (2005)
  • Cited by (0)

    View full text