RESEARCH SUMMARY

How much of our genome is functional? A surprising finding from the completion of the human genome sequence, over a decade ago, was that classical genes encoding proteins make up less than 2% of our DNA. Proteins are produced via RNA, which serves as a messenger molecule conveying the genetic blueprints encoded in DNA to the ribosome. It is now evident that the majority of the genome is transcribed to RNA at some point and that non-protein coding RNAs (ncRNAs) are involved in a wide variety of cellular processes. However, ncRNAs seldom display evidence of genetic sequence conservation, an evolutionary hallmark of molecular function. This is consistent with observations that less than 10% of the human genome is conserved throughout vertebrates.

Here, we compared the genomic sequences of 35 mammals for evidence of conserved, higher-order structural features that are characteristic of functional RNA molecules. We employed computational tools to identify patterns of genetic mutations indicative of conserved RNA structure, which suggest that upwards of 30% of our genome undergoes natural purifying selection at this level. Our findings expose millions of novel functional genomic regions that will assist researchers in uncovering the precise molecular mechanisms underlying complex diseases, development, and evolution.

SCIENTIFIC ABSTRACT

Evolutionarily conserved RNA secondary structures are a robust indicator of purifying selection and, consequently, molecular function. Evaluating their genome-wide occurrence through comparative genomics has consistently been plagued by high false-positive rates and divergent predictions. We present a novel benchmarking pipeline aimed at calibrating the precision of genome-wide scans for consensus RNA structure prediction. The benchmarking data obtained from two refined structure prediction algorithms, RNAz and SISSIz, were then analyzed to fine-tune the parameters of an optimized workflow for genomic sliding window screens. When applied to consistency-based multiple genome alignments of 35 mammals, our approach confidently identifies over 4 million evolutionarily constrained RNA structures using a conservative sensitivity threshold that entails historically low false discovery rates for such analyses (5-22%). These predictions comprise 13.6% of the human genome, 88% of which fall outside any known sequence-constrained element, suggesting that a large proportion of the mammalian genome is functional. As an example, our findings identify both known and novel conserved RNA structure motifs in the long non-coding RNA MALAT1. This study provides an extensive set of functional transcriptomic annotations that will assist researchers in uncovering the precise mechanisms underlying the developmental ontologies of higher eukaryotes.

Accepted for publication in Nucleic Acids Research on 16 June 2013.