Alignment-free distance measure based on return time distribution for sequence analysis: Applications to clustering, molecular phylogeny and subtyping
Graphical abstract
Highlights
► Novel alignment-free method is proposed and applied for molecular phylogeny. ► Method also finds its application in viral sero-/geno-typing. ► Method is robust, scalable and faster for the genomic sequence analysis. ► Enables compact representation of genomic sequences. ► Provides a new way for pattern recognition and data mining.
Introduction
Genome sequencing has entered a new phase with the advent of third generation sequencing technologies. As a result, a large volume of molecular sequence data is made available. Bioinformatics provides various tools, techniques and approaches to decipher the function(s) using sequence and structure analyses with an objective to link genotype and phenotype. Clustering and molecular phylogeny have always remained one of the key activities to study molecular evolution and facilitate classification of taxonomic units (Bohlin et al., 2008). It has resulted in the discovery of new species (Junglen et al., 2009) and reorganization of genera, subfamilies and families (Ghabrial and Nibert, 2009, Kolaskar and Naik, 2000, Vellinga et al., 2003). It has also helped in the understanding of sequence–structure–function relationships (Chen et al., 2004). Phylogenetic analysis of related genes (Martin and McInerney, 2009) and phylogenetic profiling are now extensively used in gene annotation and inference of the metabolic networks (Wu et al., 2006). Various statistical measures for the analyses of nucleic acid and protein sequences have also been proposed to discern, classify, and assess the significance between sequences (Karlin and Ghandour, 1985). Sequence alignment algorithms have been developed to identify the sites of conservation, variation, and indels to assess the similarities between sequences (Doolittle, 1996, Doolittle, 1990), which serves as a basis for molecular phylogeny analysis (MPA). Most commonly used methods of MPA require computation of distances between sequences based on their alignments. Distance based MPA methods include Unweighted Pair Group Method with Arithmetic Mean (UPGMA), Transformed Distance (TD) method, Fitch and Margoliash’s (FM) method, Neighborliness method, Neighbor-Joining (N-J) (Farris, 1972, Fitch and Margoliash, 1967, Saitou and Nei, 1987, Sattath and Tversky, 1977, Sneath and Sokal, 1973), etc. Other methods for MPA are character based and include Maximum parsimony and Maximum likelihood (Felsenstein, 1996, Yang and Rannala, 1997). Both groups of methods require multiple sequence alignment (MSA) (Thompson et al., 1994) as a starting point of analysis. The resultant tree topology is validated using bootstrap analysis. The implementation of these methods is available in the form of software packages (Gilbert, 2004), in public domain. PHYLIP (http://evolution.genetics.washington.edu/phylip.html), MEGA (Kumar et al., 2008), MrBayes (Ronquist and Huelsenbeck, 2003), etc. are the commonly used phylogeny inference packages.
However, alignment based MPA methods are computationally intensive and hence can handle limited sequence data. Memory requirements for sequence alignments increases exponentially with increasing data (Stamatakis et al., 2004). Therefore, molecular phylogeny studies using whole genome sequences become rather difficult. Furthermore, the heuristic appeal and introduction of gaps in biological sequences during alignment are also questioned at times as they generate equiprobable optimal alignment solutions leading to subsequent statistical variations in inferred phylogenies (Batzoglou, 2005, Dwivedi and Gadagkar, 2009, Morrison and Ellis, 1997, Nick, 1998, Phillips et al., 2000, Rosenberg, 2005, Wheeler, 1995, Wong et al., 2008). It has also been found that as the alignment accuracy decreases, topological correctness of trees also decreases (Ogdenw and Rosenberg, 2006). To overcome the limitations of alignment procedures, alignment-free methods for sequence analysis have been proposed. The extensive review of these is available in (Vinga and Almeida, 2003). Effective analyses of large DNA sequences that are independent of alignments require some form of statistical summarization to facilitate numerical computations (Chaudhuri and Das, 2002). Some of the approaches include the use of odd ratios of observed to expected frequencies (Karlin and Burge, 1995), chaos game representation (CGR) of DNA sequences in the form of fractal images (Deschavanne et al., 1999) and relative oligonucleotide (up to 14-mers) frequencies for comparative sequence analysis (Weinel et al., 2002). Statistical methods based on oligonucleotide frequencies have been used to test similarities in bacterial and archaeal genomes (Bohlin et al., 2008). Machine learning approaches have also been applied in the same direction. Neural network based approach was used to classify DNA of unknown organisms by using tetra nucleotide frequency distributions (Abe et al., 2003). Hidden Markov models were used for sequence alignment and tree reconstruction (Edgar and Sjolander, 2003). Markov chain model based tetra nucleotide Zero-Order Markov (ZOMs) were compared with di-nucleotide ZOMs by Karlin and co-workers (Karlin and Burge, 1995), for 16s rRNA trees to investigate which of the methods is more consistent with the phylogenetic signal (Pride et al., 2003). Recent alignment-free methods proposed for MPA are based on K-word frequencies, string composition vector, feature frequency profiles, hidden Markov models, Kolmogorov complexity, information theory, etc. (Dai and Wang, 2008, Reyes-Prieto et al., 2011, Fu et al., 2005, Li et al., 2001, Qi et al., 2004b, Sims et al., 2009, Vinga and Almeida, 2003).
However, the k-tuple distance methods reported earlier are based only on word frequencies and ignored the relative order of its components, which is known to incorporate important information. In this paper, a new alignment-free distance measure based on return time distributions (RTDs) of the k-mers is proposed. The attempt is to make efficient use of the information content of the genomes in the form of ‘return times’ of k-mers, which takes into account the frequency as well as relative order resulting in the derivation of a novel distance measure. Technique of RTD is a routinely used in stochastic modeling and theory of stochastic processes in the study of physical, chemical and biological systems (Ciupe et al., 2009, Ding and Yang, 1995, Harris et al., 2002, Nair and Mahalakshmi, 2005, Piero, 2007). ‘Return time’ is defined as the time required by the system to visit the same state from which it started without visiting it in between epoch. The ‘return time’ in the context of nucleotide sequence is the number of nucleotides in between the reappearance of a particular nucleotide(s).
The inter-mononucleotide distances have also been used for genome analyses (Afreixo et al., 2009) with the assumption that the inter nucleotide distances follow geometric distributions. The proposed RTD based method does not assume any distributional forms for RTDs. A prototype of this method is reported recently (Kolekar et al., 2010). Phylogenetic tree reported in this study is deposited in TreeBASE, StudyID: S11226 (Morell, 1996). A short communication in the form of a poster abstract, describing case study of subtyping of Dengue viruses (DENVs) is archived in Nature Precedings (http://precedings.nature.com/documents/5590/version/1). The RTD-based method was then generalized to demonstrate its applications for whole genome based MPA and virus subtyping (Kolekar et al., 2010). The method was also successfully applied for genotyping of Mumps viruses based on sequences of small hydrophobic (SH) gene (Kolekar et al., 2011).
Molecular phylogenetics of viruses has been reviewed extensively (Grace and Jonathan, 2002). It has been reported (Belshaw et al., 2007, Holmes, 2009) that RNA viruses provide a model system to study general evolutionary relationship of genome attributes (Holmes and Grenfell, 2009). Hence, the proposed method was validated for its application in MPA using the viruses that belong to the family Flaviviridae. The members of Flaviviridae offer opportunity to validate performance of RTD based method for clustering, phylogeny as well as subtyping of DENVs using genomic sequences, as an important application to diagnosis of Dengue infections.
Section snippets
Datasets
The following datasets were compiled and used to validate performance of proposed RTD-based alignment-free distance measure for clustering, molecular phylogeny analysis (MPA), serotyping and genotyping. We use the collective term “subtyping” for sero-/geno-typing of DENVs, which are classified into four serotypes (1–4) and each serotype is further subdivided into multiple genotypes.
Results and discussion
The RTD provides a compact representation of genomic sequences and is easy for both, computation and implementation. For example, the RTDs for Cytosine (C) nucleotide for Dengue virus type 1 (Genus: Flavivirus; GenBank accession no. U88537), Bovine viral diarrhea virus 2-C413 (Genus: Pestivirus; GenBank accession no. AF002227), Hepatitis C virus subtype 1a (Genus: Hepacivirus; GenBank accession no. AF009606) and GB virus C (Genus: Pegivirus; GenBank accession no. AF070476) are summarized using
Conclusions
A novel alignment-free method based on return time distribution is designed and validated for applications in molecular phylogeny and subtyping of viruses. The case study of the family Falviviridae indicates that RTD is capable of identifying the patterns of recurrence of nucleotides at varying levels of sequence similarity. It can be applied for phylogenetic analysis of viral families where viral species diverge up to 70% and for the subtyping of viruses where strains are closely related
Acknowledgments
This work was supported by Department of Biotechnology, Government of India under Centre Of Excellence (COE) Grant to Bioinformatics Centre, University of Pune, Pune (India). PSK acknowledges the BioInformatics National Certification (BINC) fellowship awarded by Department of Biotechnology, Government of India. The authors of LifePrint and CVTree methods are gratefully acknowledged for providing server configurations. Assistance provided by Ms. Pratibha Nadkarni and Ms. Smita Saxena towards
References (90)
Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods
Methods Enzymol.
(1996)- et al.
Stochastic modelling of the particle residence time distribution in circulating fluidised bed risers
Chem. Eng. Sci.
(2002) - et al.
Molecular genotyping of dengue viruses by phylogenetic analysis of the sequences of individual genes
J. Virol. Methods
(2008) - et al.
Multiple sequence alignment in phylogenetic analysis
Mol. Phylogenet. Evol.
(2000) - et al.
Genetic analysis of Dengue 3 virus subtype III 5’ and 3’ non-coding regions
Virus Res.
(2008) - et al.
Evolutionary history of dengue virus type 4: insights into genotype phylodynamics
Infect. Genet. Evol.
(2011) - et al.
Molecular evolution of dengue viruses: contributions of phylogenetics to understanding the history and epidemiology of the preeminent arboviral disease
Infect. Genet. Evol.
(2009) - et al.
Multiple recombinant dengue type 1 viruses in an isolate from a dengue patient
J. Gen. Virol.
(2007) - et al.
Informatics for unveiling hidden genome signatures
Genome Res.
(2003) - et al.
Emergence of dengue virus type 4 genotype IIA in Malaysia
J. Gen. Virol.
(2002)
Genome analysis with inter-nucleotide distances
Bioinformatics
Methods for subtyping and molecular comparison of human viral genomes
Clin. Microbiol. Rev.
The many faces of sequence alignment
Brief. Bioinform.
The evolution of genome compression and genomic novelty in RNA viruses
Genome Res.
Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes
BMC Genom.
MatGAT: an application that generates similarity/identity matrices using protein or DNA sequences
BMC Bioinform.
SWORDS: a statistical tool for analysing large DNA sequences
J. Biosci.
Identification of a recombinant dengue virus type 1 with 3 recombination regions in natural populations in Guangdong province, China
Arch. Virol.
Molecular evolution and structure–function relationships of crotoxin-like and asparagine-6-containing phospholipases A2 in pit viper venoms
Biochem. J.
The dynamics of T-cell receptor repertoire diversity following thymus transplantation for DiGeorge anomaly
PLoS Comput. Biol.
Isolation of a novel species of flavivirus and a new strain of Culex flavivirus (Flaviviridae) from a natural mosquito population in Uganda
J. Gen. Virol.
Comparison study on k-word statistical measures for protein: from sequence to ‘sequence space’
BMC Bioinform.
Genomic signature: characterization and classification of species assessed by chaos game representation of sequences
Mol. Biol. Evol.
Distribution of the first return time in fractional Brownian motion and its application to the study of on–off intermittency
Phys. Rev. E
Molecular evolution: computer methods for macromolecular sequence analysis
Methods Enzymol.
Molecular evolution: computer analysis of protein and nucleic acid sequences
Methods Enzymol.
Phylogenetic inference under varying proportions of indel-induced alignment gaps
BMC Evol. Biol.
SATCHMO: sequence alignment and tree construction using hidden Markov models
Bioinformatics
Identification of GBV-D, a novel GB-like Flavivirus from old world frugivorous bats (Pteropus giganteus) in Bangladesh
PLoS Pathog.
LifePrint: a novel k-tuple distance method for construction of phylogenetic trees
Adv. Appl. Bioinform. Chem.
Estimating phylogenetic trees from distance matrices
Am. Nat.
Construction of phylogenetic trees
Science
Serotype-specific differences in the risk of dengue hemorrhagic fever: an analysis of data collected in Bangkok, Thailand from 1994 to 2006
PLoS Negl. Trop. Dis.
Alignment-free biomolecular sequence comparison method
J. Biomed. Eng.
Victorivirus, a new genus of fungal viruses in the family Totiviridae
Arch. Virol.
Bioinformatics software resources
Brief. Bioinform.
The application of molecular phylogenetics to the analysis of viral genome diversity and evolution
Rev. Med. Virol.
RNA virus genomics: a world of possibilities
J. Clin. Invest.
Discovering the phylodynamics of RNA viruses
PLoS Comput. Biol.
Chimeric dengue type 2 (vaccine strain PDK-53)/dengue type 1 virus as a potential candidate dengue type 1 virus vaccine
J. Virol.
A new flavivirus and a new vector: characterization of a novel flavivirus isolated from uranotaenia mosquitoes from a tropical rain forest
J. Virol.
Dinucleotide relative abundance extremes: a genomic signature
Trends Genet.
Comparative statistics for DNA and protein sequences: single sequence analysis
Proc. Natl. Acad. Sci. USA
Online identification of viruses
J. Microbiol. Immunol. Infect.
‘Inter-arrival time’ inspired algorithm and its application in clustering and molecular phylogeny
AIP Conf. Proc.
Cited by (46)
KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences
2023, Molecular Phylogenetics and EvolutionPEER: A direct method for biosequence pattern mining through waits of optimal k-mers
2020, Information SciencesA novel genome signature based on inter-nucleotide distances profiles for visualization of metagenomic data
2017, Physica A: Statistical Mechanics and its ApplicationsCitation Excerpt :Based on tetranucleotide frequencies of fragments of genome, Laczny et al. [16] reported a scalable approach for the visualization of metagenomic data via nonlinear dimension reduction. Alignment-free methods has been increasingly used for genome comparison, in particular for genome-based phylogeny reconstruction, including information-based analysis [17,18], singular value decomposition (SVD) [19,20], principal component analysis [21], the ZL compression distance [22] which is based on the algorithm of Ziv and Lempel [23], fractal methods [24–27], Markov model [28–32], dynamical language model [33–35], the average common substring (ACS) distance [36], the maximum unique matches (MUM) distance [37,38], feature frequency profiles (FFP) [39–41], distance measure based on return time distribution [42], method using spaced-word frequencies [43,44]. The inter-nucleotide distance [45] is a novel DNA numerical profile to explore the correlation structure of DNA.
Genetic diversity and evolution of dengue virus serotype 3: A comparative genomics study
2017, Infection, Genetics and EvolutionThe Picornaviridae Family: Knowledge Gaps, Animal Models, Countermeasures, and Prototype Pathogens
2023, Journal of Infectious Diseases