Key Points
-
Promoter prediction software can succeed for the ∼50% of genes with CpG islands or for genes with abundant transcript data.
-
Predictions of individual transcription-factor binding sites (TFBSs) are unreliable owing to the promiscuous binding of transcription factors.
-
Comparative genome sequence analysis (phylogenetic footprinting) can eliminate up to 90% of false binding-site predictions; however, true sites are still obscured by the false predictions.
-
Analysis of clusters of TFBSs in cis-regulatory modules can generate reliable predictions of regulatory regions.
-
New methods are emerging to improve the detection of sequences that regulate gene transcription.
Abstract
The compilation of multiple metazoan genome sequences and the deluge of large-scale expression data have combined to motivate the maturation of bioinformatics methods for the analysis of sequences that regulate gene transcription. Historically, these bioinformatics methods have been plagued by poor predictive specificity, but new bioinformatics algorithms that accelerate the identification of regulatory regions are drawing disgruntled users back to their keyboards. However, these new approaches and software are not without problems. Here, we introduce the purpose and mechanisms of the leading algorithms, with a particular emphasis on metazoan sequence analysis. We identify key issues that users should take into consideration in interpreting the results and provide an online training example to help researchers who wish to test online tools before taking an independent foray into the bioinformatics of transcription regulation.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others
References
Alberts, B (ed.). et al. Molecular Biology of the Cell 4th edn (Garland Science, New York, 2002).
Davidson, E. H. Genomic regulatory systems: development and evolution (Academic, San Diego, 2001).
Greenbaum, D., Jansen, R. & Gerstein, M. Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18, 585–596 (2002).
Schmid, C. D., Praz, V., Delorenzi, M., Perier, R. & Bucher, P. The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res. 32, D82–D85 (2004).
Fickett, J. W. & Hatzigeorgiou, A. G. Eukaryotic promoter recognition. Genome Res. 7, 861–878 (1997). Demonstrated the poor performance of promoter-prediction software. Led to a shift from predicting specific transcription start sites, and towards prediction of regions that are likely to contain a TSS.
Bucher, P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563–578 (1990).
Antequera, F. Structure, function and evolution of CpG island promoters. Cell. Mol. Life Sci. 60, 1647–1658 (2003).
Hannenhalli, S. & Levy, S. Promoter prediction in the human genome. Bioinformatics 17 (Suppl. 1), S90–S96 (2001).
Down, T. A. & Hubbard, T. J. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 12, 458–461 (2002).
Davuluri, R. V., Grosse, I. & Zhang, M. Q. Computational identification of promoters and first exons in the human genome. Nature Genet. 29, 412–417 (2001).
Adachi, N. & Lieber, M. R. Bidirectional gene organization: a common architectural feature of the human genome. Cell 109, 807–809 (2002).
Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261–282 (1987).
Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).
Karolchik, D. et al. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51–54 (2003).
Liu, R. & States, D. J. Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. Genome Res. 12, 462–469 (2002).
Suzuki, Y., Yamashita, R., Sugano, S. & Nakai, K. DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 32, D78–D81 (2004).
Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl Acad. Sci. USA (2003). Introduces a new method for the identification of TSS on the basis of improved laboratory methods for the generation of full-length cDNAs. The data generated from this method will be important for the identification of alternative promoters.
Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nature Rev. Genet. 4, 251–262 (2003).
Frazer, K. A., Elnitski, L., Church, D. M., Dubchak, I. & Hardison, R. C. Cross-species sequence comparisons: a review of methods and available resources. Genome Res. 13, 1–12 (2003).
Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).
C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).
Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
Levy, S. & Hannenhalli, S. Identification of transcription factor binding sites in the human genome sequence. Mamm. Genome 13, 510–514 (2002).
Lenhard, B. et al. Identification of conserved regulatory elements by comparative genome analysis. J. Biol. 2, 13 (2003). Demonstrates that phylogenetic footprinting can eliminate an order of magnitude of false-positive transcription-factor binding-site predictions, in exchange for a modest sensitivity decrease.
Bagheri-Fam, S., Ferraz, C., Demaille, J., Scherer, G. & Pfeifer, D. Comparative genomics of the SOX9 region in human and Fugu rubripes: conservation of short regulatory sequence elements within large intergenic regions. Genomics 78, 73–82 (2001).
Aparicio, S. et al. Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc. Natl Acad. Sci. USA 92, 1684–1688 (1995).
Santini, S., Boore, J. L. & Meyer, A. Evolutionary conservation of regulatory elements in vertebrate Hox gene clusters. Genome Res. 13, 1111–1122 (2003).
Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
Storm, C. E. & Sonnhammer, E. L. Comprehensive analysis of orthologous protein domains using the HOPS database. Genome Res. 13, 2353–2362 (2003).
Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 32, D35–D40 (2004).
Schwartz, S. et al. Human–mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003).
Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003). One of the best progressive alignment algorithms for global genome sequence alignment that facilitates phylogenetic footprinting.
Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).
Brudno, M. et al. Glocal alignment: finding rearrangements during alignment. Bioinformatics 19 (Suppl. 1), I54–I62 (2003).
Loots, G. G., Ovcharenko, I., Pachter, L., Dubchak, I. & Rubin, E. M. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 12, 832–839 (2002).
Elnitski, L. et al. PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences. Genomics 80, 681–690 (2002).
Elnitski, L. et al. Distinguishing regulatory DNA from neutral sites. Genome Res. 13, 64–72 (2003). A new method to classify functions of conserved regions as regulatory or coding on the basis of the pattern of identical nucleotides.
Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003). A first look at methods to analyse large sets of orthologous eukaryotic gene sequences.
Montgomery, S. B. et al. Sockeye: A 3D environment for comparative genomics. Genome Res. (in the press).
Davidson, E. H. et al. A genomic regulatory network for development. Science 295, 1669–1678 (2002). One of several papers by Davidson that constructs the argument that genes are regulated by composite interactions of transcription factors that interact with locally dense clusters of binding sites.
Palstra, R. J. et al. The β-globin nuclear compartment in development and erythroid differentiation. Nature Genet. 35, 190–194 (2003).
Fickett, J. W. Quantitative discrimination of MEF2 sites. Mol. Cell Biol. 16, 437–441 (1996).
Fickett, J. W. Coordinate positioning of MEF2 and myogenin binding sites. Gene 172, GC19–GC32 (1996).
Tronche, F., Ringeisen, F., Blumenfeld, M., Yaniv, M. & Pontoglio, M. Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. J. Mol. Biol. 266, 231–245 (1997). Demonstration that matrix-based profiles for the prediction of transcription-factor binding sites accurately predict in vitro binding.
Pollock, R. & Treisman, R. A sensitive method for the determination of protein-DNA binding specificities. Nucleic Acids Res. 18, 6197–6204 (1990).
Bulyk, M. L., Gentalen, E., Lockhart, D. J. & Church, G. M. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nature Biotechnol. 17, 573–577 (1999).
Shultzaberger, R. K. & Schneider, T. D. Using sequence logos and information analysis of Lrp DNA binding sites to investigate discrepancies between natural selection and SELEX. Nucleic Acids Res. 27, 882–887 (1999).
Roulet, E. et al. High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Nature Biotechnol. 20, 831–835 (2002).
Stormo, G. D. DNA binding sites: representation and discovery. Bioinformatics 16, 16–23 (2000). An excellent explanation of the relationship between scores that are produced by binding-site profiles and binding energy.
King, O. D. & Roth, F. P. A non-parametric model for transcription factor binding sites. Nucleic Acids Res. 31, e116 (2003).
Berg, O. G. & von Hippel, P. H. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 193, 723–750 (1987).
Udalova, I. A., Mott, R., Field, D. & Kwiatkowski, D. Quantitative prediction of NF-κ B DNA-protein interactions. Proc. Natl Acad. Sci. USA 99, 8167–8172 (2002).
Barash, Y., Elidan, G., Friedman, N. & Kaplan, T. in Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (eds Vingron, M., Istrail, S., Pevzner, P. and Waterman, M.) 28–37 (ACM, New York, 2003).
Benos, P. V., Bulyk, M. L. & Stormo, G. D. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 30, 4442–4451 (2002). Summary of several key papers that demonstrate that matrix profiles provide reasonable predictions of binding sites in most cases.
Owen, G. I. & Zelent, A. Origins and evolutionary diversification of the nuclear receptor superfamily. Cell. Mol. Life Sci. 57, 809–827 (2000).
Roulet, E. et al. Experimental analysis and computer prediction of CTF/NFI transcription factor DNA binding sites. J. Mol. Biol. 297, 833–848 (2000).
Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W. & Lenhard, B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32, D91–D94 (2004).
Bray, N., Dubchak, I. & Pachter, L. AVID: a global alignment program. Genome Res. 13, 97–102 (2003).
Matys, V. et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378 (2003).
Lenhard, B. & Wasserman, W. W. TFBS: computational framework for transcription factor binding site analysis. Bioinformatics 18, 1135–1136 (2002).
Dermitzakis, E. T. & Clark, A. G. Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114–1121 (2002).
Wray, G. A. et al. The evolution of transcriptional regulation in eukaryotes. Mol. Biol. Evol. 20, 1377–1419 (2003). An examination of the patterns of sequence evolution in regulatory regions. Surveys the genetic consequences of changes in binding sites.
Tagle, D. A. et al. Embryonic ε- and γ-globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203, 439–455 (1988). One of several papers from the group that, to the best of our knowledge, established the phrase 'phylogenetic footprinting'.
Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).
Wasserman, W. W. & Fickett, J. W. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278, 167–181 (1998).
Frith, M. C., Li, M. C. & Weng, Z. Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 31, 3666–3668 (2003).
Krivan, W. & Wasserman, W. W. A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 11, 1559–1566 (2001). Demonstration that coupling module predictions with phylogenetic footprinting can result in reliable predictions of regulatory sequences.
Liu, R., McEachin, R. C. & States, D. J. Computationally identifying novel NF-κ B-regulated immune genes in the human genome. Genome Res. 13, 654–661 (2003).
Berman, B. P. et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA 99, 757–762 (2002).
Johansson, O., Alkema, W., Wasserman, W. W. & Lagergren, J. Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics 19 (Suppl. 1), I169–I176 (2003).
Bailey, T. L. & Noble, W. S. Searching for statistically significant regulatory modules. Bioinformatics 19 (Suppl. 2), II16–II25 (2003).
Aerts, S., Van Loo, P., Thijs, G., Moreau, Y. & De Moor, B. Computational detection of cis-regulatory modules. Bioinformatics 19 (Suppl. 2), II5–II14 (2003).
Rajewsky, N., Vergassola, M., Gaul, U. & Siggia, E. D. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 3, 30 (2002). An excellent algorithm for the detection of locally dense clusters of transcription-factor binding sites, particularly orientated towards large clusters of sites for a single factor.
Lifanov, A. P., Makeev, V. J., Nazina, A. G. & Papatsenko, D. A. Homotypic regulatory clusters in Drosophila. Genome Res. 13, 579–588 (2003).
Sandelin, A. & Wasserman, W. W. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. (in the press).
Gelfand, M. S., Novichkov, P. S., Novichkova, E. S. & Mironov, A. A. Comparative analysis of regulatory patterns in bacterial genomes. Brief Bioinform. 1, 357–371 (2000).
Cliften, P. et al. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301, 71–76 (2003).
Aerts, S. et al. Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res. 31, 1753–1764 (2003).
Vadigepalli, R., Chakravarthula, P., Zak, D. E., Schwaber, J. S. & Gonye, G. E. PAINT: a promoter analysis and interaction network generation tool for gene regulatory network identification. Omics 7, 235–252 (2003).
Klingenhoff, A., Frech, K., Quandt, K. & Werner, T. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics 15, 180–186 (1999).
Berezikov, E., Guryev, V., Plasterk, R. H. & Cuppen, E. CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. Genome Res. 14, 170–178 (2004).
Kel-Margoulis, O. V., Ivanova, T. G., Wingender, E. & Kel, A. E. Automatic annotation of genomic regulatory sequences by searching for composite clusters. Pac. Symp. Biocomput. 187–198 (2002).
Sharan, R., Ovcharenko, I., Ben-Hur, A. & Karp, R. M. CRéME: a framework for identifying cis-regulatory modules in human–mouse conserved segments. Bioinformatics 19 (Suppl. 1), I283–I291 (2003).
Felsenfeld, G. Quantitative approaches to problems of eukaryotic gene expression. Biophys. Chem. 100, 607–613 (2003).
O'Brien, T. P. et al. Genome function and nuclear architecture: from gene expression to nanoscience. Genome Res. 13, 1029–1241 (2003).
Levitsky, V. G., Podkolodnaya, O. A., Kolchanov, N. A. & Podkolodny, N. L. Nucleosome formation potential of eukaryotic DNA: calculation and promoters analysis. Bioinformatics 17, 998–1010 (2001).
Shannon, M. F. & Rao, S. Transcription: of chips and ChIPs. Science 296, 666–669 (2002).
Gerasimova, T. I. & Corces, V. G. Chromatin insulators and boundaries: effects on transcription and nuclear organization. Annu. Rev. Genet. 35, 193–208 (2001).
West, A. G., Gaszner, M. & Felsenfeld, G. Insulators: many functions, many mechanisms. Genes Dev. 16, 271–288 (2002).
Schneider, T. D. & Stephens, R. M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).
Christensen, T. H., Prentice, H., Gahlmann, R. & Kedes, L. Regulation of the human cardiac/slow-twitch troponin C gene by multiple, cooperative, cell-type-specific, and MyoD-responsive elements. Mol. Cell Biol. 13, 6752–6765 (1993).
Parmacek, M. S. et al. A novel myogenic regulatory circuit controls slow/cardiac troponin C gene transcription in skeletal muscle. Mol. Cell Biol. 14, 1870–1885 (1994).
Kel, A. E. et al. MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31, 3576–3579 (2003).
Clamp, M. et al. Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 31, 38–42 (2003).
Lee, Y. et al. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 12, 493–502 (2002).
Hollich, V., Storm, C. E. & Sonnhammer, E. L. OrthoGUI: graphical presentation of Orthostrapper results. Bioinformatics 18, 1272–1273 (2002).
Acknowledgements
W.W.W. is supported by a grant from the Canadian Institutes of Health Research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Glossary
- ORTHOLOGY
-
Two sequences are orthologous if they share a common ancestor and are separated by speciation.
- PHYLOGENETIC FOOTPRINTING
-
An approach that seeks to identify conserved regulatory elements by comparing genomic sequences between related species.
- MACHINE LEARNING
-
The ability of a program to learn from experience — that is, to modify its execution on the basis of newly acquired information. In bioinformatics, neural networks and Monte Carlo Markov Chains are well-known examples.
- NEURAL NETWORK
-
A machine-learning technique that simulates a network of communicating nerve cells.
- CAGE
-
(Cap analysis of gene expression). The high-throughput sequencing of concatamers of DNA tags that are derived from the initial nucleotides of 5′ mRNA.
- SAGE
-
(Serial analysis of gene expression). A method for quantitative and simultaneous analysis of a large number of transcripts; short sequence tags are isolated, concentrated and cloned; their sequencing reveals a gene-expression pattern that is characteristic of the tissue or cell type from which the tags were isolated.
- LOCAL ALIGNMENT
-
The detection of local similarities between two sequences.
- GLOBAL ALIGNMENT
-
The alignment of two sequences over their full length.
- NEEDLEMAN–WUNSCH ALGORITHM
-
A commonly used algorithm in bioinformatics that produces a global alignment of two sequences. The term 'global' refers to alignments across the entirety of the sequences. The algorithm returns an optimal alignment, in which 'optimal' refers to the highest possible score under a specific scoring system. The algorithm is computationally demanding, restricting its direct application to sequences of modest length.
- HIDDEN MARKOV MODEL
-
(HMM). A probabilistic model for the recognition of patterns in DNA or protein sequences. HMMs represent a system as a set of discrete states and as transitions between those states. Each transition has an associated probability, which can be readily derived from training sets, such as alignments of known examples of a pattern. HMMs are valuable because they enable a search or alignment algorithm to be built on firm probabilistic bases.
- FUTILITY THEOREM
-
The authors' assertion that essentially all predicted transcription-factor (TF) binding sites that are generated with models for the binding of individual TFs will have no functional role.
- SELEX
-
(Systematic evolution of ligands by exponential amplification). A set of laboratory procedures for the identification of representative sets of ligands for a protein. In the case of DNA-binding proteins, the protein is mixed with a pool of double-stranded oligonucleotides that contain a random core of nucleotides flanked by specific sequences. The protein in complex with bound DNA is recovered and the ligands are subsequently amplified by PCR. The recovered oligonucleotides are sequenced and analysed to reveal the binding specificity of the protein.
- INFORMATION CONTENT
-
A measure of nucleotide conservation in a position, based on information theory.
- PSEUDOCOUNT
-
The sample correction that is added when assessing the probability to correct for small sample sizes (that is, few binding sites).
- HOMOTYPIC CLUSTER
-
A cluster of similar transcription-factor (TF) binding sites, often binding the same TF.
- BAYESIAN [METHOD]
-
A statistical method of combining the likelihood with additional information to produce an overall estimate of the strength of a piece of evidence.
Rights and permissions
About this article
Cite this article
Wasserman, W., Sandelin, A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5, 276–287 (2004). https://doi.org/10.1038/nrg1315
Issue Date:
DOI: https://doi.org/10.1038/nrg1315
This article is cited by
-
cBAF generates subnucleosomes that expand OCT4 binding and function beyond DNA motifs at enhancers
Nature Structural & Molecular Biology (2024)
-
G-quadruplexes promote the motility in MAZ phase-separated condensates to activate CCND1 expression and contribute to hepatocarcinogenesis
Nature Communications (2024)
-
Identification of the global diurnal rhythmic transcripts, transcription factors and time-of-day specific cis elements in Chenopodium quinoa
BMC Plant Biology (2023)
-
BMP4 upregulates glycogen synthesis through the SMAD/SLC2A1 (GLUT1) signaling axis in hepatocellular carcinoma (HCC) cells
Cancer & Metabolism (2023)
-
NRF2 transcriptionally regulates Caspase-11 expression to activate HMGB1 release by Autophagy-deficient hepatocytes
Cell Death Discovery (2023)