doi:10.1016/j.gde.2005.02.004
Copyright © 2005 Elsevier Ltd All rights reserved.
Computational methods for transcriptional regulation
Eric D Siggia
Center for Studies in Physics and Biology, The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA
Available online 26 February 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
How is the information from a thousand gene-expression arrays, the location of more than two hundred regulatory factors, and nine sequenced genomes to be integrated into a global view of the regulatory network in budding yeast? Computational methods that fit incomplete noisy data provide the outlines of regulatory pathways, but the errors are not quantified. In the fly, embryonic patterning has proved amenable to computational prediction, but only when the DNA-binding preferences of the relevant factors are taken into account. In both these model organisms, simply restricting attention to regulatory sequences that align with related species (i.e. ‘conserved’) discards much information regarding what is functional.
Table 1.

Interspecies conservation for a set of 407 experimentally footprinted binding sites upstream of 194 genes in S. cerivisiae. Noncoding sequence upstream of the gene was obtained for all the sensu-stricto species [35• and 37] and aligned according to the method presented by Morgenstern [71]. Each site was then scored by three different methods* for its overlap with sequence that aligned with the other species (i.e. was conserved). The sites were then randomized in position, and the overlap rescored in a consistent way. The similarity between the second and third columns shows that much of the conservation that is interpreted as evidence for functionality is to be expected by chance. The last two rows show that only about half of the sites would be recovered if one demanded conservation.
*For the continuous scoring method, the score is the sum over all bases in the site, of the number of species aligned with S. cerivisiae. For the stringent scoring method, a site was counted as conserved if at least 75% of its bases were aligned in at least 3 other species. For the permissive method, at least 2 other species had to show 75% sequence conservation. If fewer than the minimum number of species were available, then they all had to be aligned for conservation. The maximum binary score is 407 (the number of sites). Imposing 50% base conservation increases the recovery of known sites and the number of random sites by 10%. The randomization was done so as to preserve the distribution in position relative to transcription start. The binding sites were taken from the study by Zhu and Zhang [72] and filtered for overlaps.
Table 2.

Recovery of known yeast binding motifs from genome-wide interspecies comparisons. Predictions in column 3 are calculated using the methodology set out by Kellis et al. (Table 2 [35•]), compared with predictions from a single genome in column 4 [41]. Only the top and bottom 12 entries from [35•] are shown. The results are comparable.