On fuzzy semantic similarity measure for DNA coding
Introduction
Deoxyribonucleic acid (DNA) is considered as a core material in living species responsible for growth and genetic transfer of traits [1], [2]. Genes are the segments of DNA sequence that encode protein. Protein performs most of the functions in organisms. The coding regions (exons) are sequence of nucleotides that actually code for protein while non-coding regions (introns) do not code for protein [3], [4]. The coding regions identification is tightly coupled with 1/f background noise which diffuses the boundaries of two regions in such a way that viable discernment of coding regions from non-coding regions is overly hindered. Digital signal processing (DSP) approaches have been widely used in genome sequence analysis especially in area of protein coding regions identification [5], [6]. The solutions relying on DSP approaches require the representation of nucleotide bases to numerical values and translation of nucleotides׳ sequences into time domain signals [7], [8], [9]. It has been deeply observed [10], [11], [12] that coding measure scheme holds a significant contribution in efficient identification of protein coding regions from non-coding regions by suppressing 1/f noise. A large number of such schemes have been proposed by researchers in recent decade.
The first coding measure scheme was proposed by Silverman et al. [13] called as "Tetrahedron mapping scheme". Later Voss [7] presented another coding measure scheme called as "Binary indictor sequence". This scheme was objectively employed [14], [15], [16] in DNA sequence analysis using digital signal processing approaches. Later certain drawbacks that appeared as a frequency leakage by application of Voss coding scheme [7] were addressed by Nair et al. [8] proposing a coding measure scheme by replacing the four binary indicator sequences by just one sequence called as "EIIP indicator sequence". The energy of delocalized electrons in amino acids and nucleotides had been calculated as the Electron–ion interaction pseudo potential (EIIP). This coding measure was popularly employed [9], [17] for addressing issues involved in DNA sequence analysis and coding regions identification. The significant results shown by the authors [8] based on EIIP indicator sequence was further improved by MK. Hota and VK Srivastava [17] by proposing a new indicator sequence named "Complex indicator sequence". This coding measure scheme has been reviewed as equally significant coding measure scheme [10], [11], [12] like EIIP in terms of reducing computational overhead involved in adopting Voss coding scheme for coding regions identification.
Various coding measure schemes proposed so far are mainly based on fixed mapping methods [18], [19], [20] chemical property based methods [21], [22] and statistical property based [23], [24], [25]. Researchers now observe the need to address the inter-nucleotide distance and genetic code context (e.g. nucleotide/amino acid information, compositions and densities of nucleotides, their relevant positions and orders in codons etc.) to propose DNA coding measure schemes. Yin et al. [10] introduced numerical representation of DNA sequences based on the genetic code context in the sequences. The authors mapped 20 amino acids to 20 unique complex numbers. The real parts and imaginary parts of these complex numbers were aided from the specific characteristics of amino acids and a coding measure scheme was proposed. In the same direction, the most recent work has been done by Mujiono et al. [26] to consider inter-nucleotide distance with hierarchical clustering to identify the fractal patterns in DNA sequences. The authors have employed numerical mapping of DNA sequence to digital signal for calculation of inter-nucleotide distance.
Coding measure schemes based on genetic code context would significantly help for DNA sequence analysis and coding regions identification [12]. We could not find a satisfactory work done in the context of DNA coding measure scheme based on genetic code context. Similarly, despite an exhaustive review of literature to seek clustering approaches used for codons׳ clustering, only a few numbers of publications that employed clustering algorithms were noticed. Such approaches (a few examples are [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39]) were based on distance as a similarity measure to cluster DNA/RNA sequences (none of the approach addresses codon׳s clustering). Mostly hierarchical or K-Means algorithms with their variations have been used to address classification/clustering sequences. The clustering approaches based on such algorithms compute the distance (Euclidean/Edit/Minkowski) between datum and centroid of cluster. Since nucleotides in codons own special biological properties and they preserve their unique structure, the common clustering algorithms based on distance as a similarity measure do not provide gainful applications in codons׳ clustering.
Section snippets
Methodology
We propose a novel fuzzy semantic similarity measure (FSSM) coding scheme centering FSSM codons׳ clustering and genetic code context of nucleotides. FSSM owns genetically meaningful characteristics of nucleotides in codons i.e. nucleotides׳ density distribution, specific positions of nucleotides in codons and nucleotides׳ usage in terms of their distribution.
Results and discussions
Performance evaluation of different coding measure schemes for coding regions identification has been performed at nucleotide level. In this context, following important evaluation measures have been employed which are defined as,(1) (2) (3) (4) (5) where, (6) Empty Cell
Here
Conclusion
This paper presented a novel fuzzy semantic similarity measure (FSSM) codons׳ clustering algorithm for DNA coding measure scheme. The FSSM algorithm exploited nucleotides׳ fuzzy membership and their natural characteristics in terms of fuzzy behaviors and semantic similarities revealing a strong correlation between them. The FSSM coding measure scheme achieved significant enhancement in coding regions identification as compared to other existing coding measure schemes. More than 250 benchmarked
Conflict of interests
Authors have no conflict of interests.
References (47)
- et al.
A measure of DNA periodicity
J. Theor. Biol.
(1986) - et al.
Fourier and wavelet transform analysis, a tool for visualizing regular patterns in DNA sequences
J. Theor. Biol.
(2000) - et al.
Domains, motifs and clusters in the protein universe
Curr. Opin. Chem. Biol.
(2003) - et al.
Identification of protein-coding regions in DNA Sequences using a time–frequency filtering approach
Genom. Proteom. Bioinform.
(2011) - B. Alberts, A. Johnson, J. Lewis, Portions of DNA sequence are transcribed into RNA, 4th edition, in: Molecular Biology...
Genomic signal processing
IEEE Signal Process. Mag.
(2001)- et al.
DNA Computing Models
(2008) - et al.
Structure–function studies of the RNA polymerase II elongation complex
Acta Crystallogr.
(2009) - et al.
The structure of DNA in the nucleosome core
Nature
(2003) - B. Alberts, A. Johnson, A.J. Lewis, DNA replication mechanisms, 4th edition, in: Molecular Biology of the Cell,...