doi:10.1016/S0378-1119(02)01206-4
Copyright © 2002 Elsevier Science B.V. All rights reserved.
Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster
Emanuele Bultrinia, Elisabetta Pizzi
,
, a, Paolo Del Giudiceb and Clara Frontalia
a Laboratorio di Biologia Cellulare, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Rome, Italy
b Laboratorio di Fisica, Istituto Superiore di Sanità, Rome, Italy
Received 4 July 2002;
revised 15 November 2002;
accepted 4 December 2002;
Received by G. Pesole
Available online 24 January 2003.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Overall compositional properties at the level of bases, dinucleotides and longer oligos characterize genomes of different species. In Caenorhabditis elegans, using recurrence analysis, we recognized the existence of a long-range correlation in the oligonucleotide usage of introns and intergenic regions. Through correlation analysis, this is confirmed here to be a genome-wide property of C. elegans non-coding portions. We then investigate the possibility of extracting a typical vocabulary through statistical analysis of experimentally confirmed introns of sufficient length (>1 kb), deprived of known splice signals, the focus being on distributed lexical features rather than on localized motifs. Lexical preferences typical of introns could be exposed using principal component analysis of pentanucleotide frequency distributions, both in C. elegans and in Drosophila melanogaster. In either species, the introns' pentamer preferences are largely shared by intergenic tracts. The pentamer vocabularies extracted for the two species exhibit interesting symmetry properties and overlap in part. A more extensive investigation of the interspecies relationship at the level of oligonucleotide preferences in non-coding regions, not related by sequence similarity, might form the basis of new approaches for the study of the evolutionary behaviour of these regions.
Author Keywords: Introns; Caenorhabditis elegans; Drosophila melanogaster; Linguistic properties
Abbreviations: PCA, principal component analysis; PC1, first principal component; PC2, second principal component; ORF, open reading frame
Fig. 1. Histograms representing populations of Pearson correlation coefficients, r, between pentamer frequency distributions for all possible pairs of non-overlapping 1 kb windows cut along intron, exon and intergenic C. elegans supersequences. (a) Intron windows against themselves and against their randomized versions; (b) intron windows against intergenic I (real and randomized) windows; (c) same as (b) for chromosome II; (d) exon windows against themselves and against their randomized versions; (e) exon windows against intergenic I (real and randomized) windows; (f) intron windows against real and randomized exon windows. Comparisons involving randomized data are given as dotted curves.
Fig. 2. PCA. Pentanucleotide frequency distributions for the experimentally-confirmed 256 introns and 67 exons longer than 1 kb from C. elegans, and for the set of randomized introns (see Section 2 for the elimination of splice signals from introns) were pooled and subjected to ‘blind’ PCA. The figure shows a scatter plot in the plane of the first two principal components (PC1 and PC2). Labels (full circles: introns; open circles: exons; crosses: randomized introns) were added a posteriori. Inset: PC1 values from the main figure are plotted against the G+C content of the corresponding sequences.
Fig. 3. Symmetry scatter plots for C. elegans introns longer than 1 kb and their randomized counterparts: (a) the 512 pairs of reverse complementary pentamers are indicated by dots having as co-ordinates the frequencies (expressed as percent values) of either member of the pair in each of the 241 introns examined; (b) same as (a) for the 241 randomized intron sequences; (c) same as (a) after exclusion of the 12 pairs appearing in C. elegans introns' vocabulary; (d) same as (c) for the randomized intron sequences.
Fig. 4. Vocabulary usage in different regions of the C. elegans genome. Distributions according to the content, N, in vocabulary words are given for the populations of 200 bp non-overlapping windows from the intron (white), intergenic I (light grey), intergenic II (dark grey) and exon (black) supersequences. In the inset the same sets of windows are analyzed in terms of the ratio N/n, n being the number of different vocabulary words present in the window.
Fig. 5. PCA for the experimentally confirmed 87 introns and 37 exons longer than 1 kb from D. melanogaster, and for the set of randomized introns. Symbols are as in Fig. 2.
Fig. 6. Vocabulary usage in different regions of the D. melanogaster genome. Distributions according to the content, N, in vocabulary words are given for the populations of 200 bp non-overlapping windows from the intron (white), intergenic I (light grey), intergenic II (dark grey) and exon (black) supersequences. Inset as in Fig. 4.
Fig. 7. Projection (see text) of the D. melanogaster intron set (black stars) onto the C. elegans PC1/PC2 plane of Fig. 2, reproduced in grey.
Table 1. Introns' pentamer vocabularies

Pairs of reverse complementary pentamers (ranked according to PC2 loadings) are listed separately from unpaired ones. Asterisks mark pentamers significantly different from tetramer-based expectation. Pentamers appearing in both lists are in bold.
Table 2. Base composition and skewness in C. elegans introns and exons (>1 kb)
