Prediction and characterization of noncoding RNAs in C. elegans by integrating conservation, secondary structure, and high-throughput sequencing and array data
- Zhi John Lu1,2,10,
- Kevin Y. Yip1,2,3,10,
- Guilin Wang4,
- Chong Shou1,
- LaDeana W. Hillier5,
- Ekta Khurana1,2,
- Ashish Agarwal2,6,
- Raymond Auerbach1,
- Joel Rozowsky1,2,
- Chao Cheng1,2,
- Masaomi Kato7,
- David M. Miller8,
- Frank Slack7,
- Michael Snyder9,
- Robert H. Waterston5,
- Valerie Reinke4 and
- Mark B. Gerstein1,2,6,11
- 1 Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA;
- 2 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA;
- 3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong;
- 4 Department of Genetics, Yale University, New Haven, Connecticut 06520, USA;
- 5 Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA;
- 6 Department of Computer Science, Yale University, New Haven, Connecticut 06511, USA;
- 7 Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06824, USA;
- 8 Department of Cell and Developmental Biology, Vanderbilt University, Nashville, Tennessee 37232, USA;
- 9 Departments of Developmental Biology and Genetics, Stanford University Medical Center, Stanford, California 94305, USA
-
↵10 These authors contributed equally to this work.
Abstract
We present an integrative machine learning method, incRNA, for whole-genome identification of noncoding RNAs (ncRNAs). It combines a large amount of expression data, RNA secondary-structure stability, and evolutionary conservation at the protein and nucleic-acid level. Using the incRNA model and data from the modENCODE consortium, we are able to separate known C. elegans ncRNAs from coding sequences and other genomic elements with a high level of accuracy (97% AUC on an independent validation set), and find more than 7000 novel ncRNA candidates, among which more than 1000 are located in the intergenic regions of C. elegans genome. Based on the validation set, we estimate that 91% of the approximately 7000 novel ncRNA candidates are true positives. We then analyze 15 novel ncRNA candidates by RT-PCR, detecting the expression for 14. In addition, we characterize the properties of all the novel ncRNA candidates and find that they have distinct expression patterns across developmental stages and tend to use novel RNA structural families. We also find that they are often targeted by specific transcription factors (∼59% of intergenic novel ncRNA candidates). Overall, our study identifies many new potential ncRNAs in C. elegans and provides a method that can be adapted to other organisms.
Footnotes
-
↵11 Corresponding author.
E-mail mark.gerstein{at}yale.edu.
-
[Supplemental material is available for this article. All data sets, prediction results, and the prediction software are available at http://incrna.gersteinlab.org/.]
-
Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.110189.110.
- Received May 7, 2010.
- Accepted December 3, 2010.
- Copyright © 2011 by Cold Spring Harbor Laboratory Press
Freely available online through the Genome Research Open Access option.