doi:10.1016/j.compbiolchem.2003.12.004
Copyright © 2004 Elsevier Ltd. All rights reserved.
The difficult interpretation of transcriptome data: the case of the GATC regulatory network
Alessandra Riva
,
, a, Marie-Odile Delormea, Tony Chevalierb, Nicolas Guilhotb, Corinne Hénautb and Alain Hénauta
a CNRS, Laboratoire Génome et Informatique, Tour Evry 2, 523 Place des Terrasses, 91034, Evry cedex, France
b METabolic EXplorer S.A., Biopôle Clermont-Limagne, 63 360, Saint-Beauzire, France
Received 22 November 2003;
Revised 12 December 2003;
accepted 12 December 2003.
Available online 4 May 2004.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Genomic analyses on part of Escherichia coli’s chromosome had suggested the existence of a GATC regulated network. This has recently been confirmed through a transcriptome analysis. Two hypotheses about the molecular control mechanism have been proposed—(i) the GATC network regulation is caused by the presence of GATC clusters within the coding sequences; the regulation is the direct consequence of the clusters’ hemi-methylation and therefore their elevated melting temperature, (ii) the regulation is caused by the presence of GATCs in the non-coding 500 bp upstream regions of the affected genes; it is the consequence of an interaction with a regulatory protein like Fnr or CAP. An analysis of the transcriptome data has not allowed us to decide between the two hypotheses. We have therefore taken a classic genomic approach, analyzing the statistical distribution of GATC along the chromosome, using a realistic model of the chromosome as theoretical reference. We observe no particular distribution of GATC in the non-coding upstream regions; however, we confirm the presence of GATC clusters within the genes. In order to verify that the particular distribution observed in E. coli is not a statistical artefact, but has a physiological role, we have carried out the same analysis on Salmonella, making the hypothesis that the genes containing a GATC clusters should be largely the same in the two bacteria. This has been indeed observed, showing that the genes containing a GATC cluster are part of a regulation network. The present is a case study, which demonstrates that the analysis of transcriptome data does not always permit to identify the primary cause of a phenomenon observed; on the other hand, a classic genomic approach linked with a comparative study of related genomes may allow this identification.
Author Keywords: GATC; Statistics; Transcriptome analysis; Comparative genomics; Fnr
Fig. 1. Comparison of the GATC distribution in the real and the virtual chromosomes. The distance of successive GATCS is displayed in the real chromosome and the virtual chromosome. The distribution in the real chromosome is characterized by a strong periodicity of three, reflecting the fact that 85% of the chromosome is made up of coding sequences. The virtual chromosome follows this distribution closely.
Fig. 2. The data contained in Table 2 are displayed in form of histograms. See the legend for Table 2 for details. (a) The GATC distribution in the 500 bp upstream regions of the genes in function of their sensitivity to the dam genotype (column D). (b) The GATC distribution in the 500 bp upstream regions of the genes in function of their sensitivity to the dam genotype (column E).
Fig. 3. The theoretical and observed GATC distribution in the 500 bp upstream regions of the genes. The data contained in Table 3 are displayed in form of a histogram. See the legend for Table 3 for details.
Fig. 4. The theoretical and observed GATC distribution in the 500 bp upstream regions of the genes, in function of the kind of gene. The data contained in Table 4 are displayed in form of a histogram. See the legend for Table 4 for details.
Table 1. The GATC distribution in the 500 bp upstream regions of the genes in function of their sensitivity to the dam genotype

We have analyzed the GATC distribution in the 500 bp region of each gene. In order to establish whether there is a relationship between the number of upstream GATCs and the change of expression levels in function of the dam genotype, the genes have been divided in function of their sensitivity to the dam genotype, in Table 1a according to Oshima et al. (2002) and in Table 1b according to our criterion (see Methods for details). In both cases no relationship is present (p-value=21.5 and 30.4%, respectively). Note: the upstream region containing 18 GATCs corresponds to the region of the origin of replication. For an easier interpretation, the data of Tables 1a and 1b are displayed as histograms in Fig. 2a and b, respectively.
Table 2. The relationship between Fnr and the sensitivity to the dam genotype

We have identified all the Fnr consensus sequences present in E. coli and found a total of 22. Fifteen lie in upstream regions whilst seven are found within coding sequences. We introduce the concept of “genetic structure” controlled by Fnr: if an Fnr consensus sequence lies upstream or within an isolated gene, we consider only this gene to be regulated by Fnr and therefore to be the “genetic structure”. If, however an Fnr consensus sequence lies upstream of an operon or in a coding sequence within an operon, we consider the entire operon to be one genetic structure, regulated by Fnr. We consider a genetic structure to be sensitive to the dam genotype if at least one gene belonging to the structure is sensitive. For the sensitivity to the dam genotype we refer to Column D and Column E; the results are displayed in Tables 2a and 2b, respectively. Table 2a: There exists no correlation between the presence of a Fnr consensus sequence and the sensitivity to the dam genotype (P-value=10.7%). Table 2b: A slight correlation can be detected (P-value=0.1%). However, it should be noted that the numbers dealt with are very small and that even in Table 2b, only 1/3 of the genes possessing an Fnr consensus sequence upstream are dam-sensitive.
Table 3. The theoretical and observed GATC distribution in the 500 bp upstream regions of the genes

The theoretical frequencies of the GATCs in the upstream regions have been calculated with the hypothesis of a binomial distribution with a probability of 7174/(4019×500) (7174 GATCs, 4019 regions of 500 bp). The theoretical frequency is compared with the actual frequency observed. For an easier interpretation, the data of Table 3 are displayed as a histogram in Fig. 3. The GATC distribution is not binomial (P-value=1.4×10−06). There is a slight excess of genes with no upstream GATC and a slight excess of genes containing three to five upstream GATCs.
Table 4. The theoretical and observed GATC distribution in the 500 bp upstream regions of the genes, in function of the kind of gene

We distinguish three kinds of upstream regions: regions, which are upstream of an isolated gene, upstream of the first gene of an operon or upstream of a gene within an operon. In the first two cases, the upstream regions are non-coding sequences, whilst the upstream region of a gene within an operon will correspond to a coding sequence. Three theoretical frequencies of the GATCs in the upstream regions have been calculated, each with the hypothesis of a binomial distribution—(a) for non-coding regions upstream of an isolated gene, with a probability of 2986/(1813×500) (2986 GATCs, 1813 region of 500 bp), (b) for non-coding regions upstream of the first gene of an operon, with a probability of 1414/(831×500) and (c) for coding regions upstream of a gene within an operon, with a probability of 2774/(1375×500). For an easier interpretation, the data of Table 4 are displayed as a histogram in Fig. 4. In all three cases the GATC distribution does not differ significantly from a binomial distribution (P-value equal to 4.3, 4.8 and 4.7%, respectively).