SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework
Computational methods for predicting evolutionarily conserved rather than thermodynamic RNA structures have recently attracted increased interest. These methods are indispensable not only for elucidating the regulatory roles of known RNA transcripts, but also for predicting RNA genes. It has been notoriously difficult to devise them to make the best use of the available data and to predict high-quality RNA structures that may also contain pseudoknots. We introduce a novel theoretical framework for co-estimating an RNA secondary structure including pseudoknots, a multiple sequence alignment, and an evolutionary tree, given several RNA input sequences. We also present an implementation of the framework in a new computer program, called SimulFold, which employs a Bayesian Markov chain Monte Carlo method to sample from the joint posterior distribution of RNA structures, alignments, and trees. We use the new framework to predict RNA structures, and comprehensively evaluate the quality of our predictions by comparing our results to those of several other programs. We also present preliminary data that show SimulFold's potential as an alignment and phylogeny prediction method. SimulFold overcomes many conceptual limitations that current RNA structure prediction methods face, introduces several new theoretical techniques, and generates high-quality predictions of conserved RNA structures that may include pseudoknots. It is thus likely to have a strong impact, both on the field of RNA structure prediction and on a wide range of data analyses.
Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs
Owing to the degeneracy of the genetic code, protein-coding regions of mRNA sequences can harbour more than only amino acid information. We search the mRNA sequences of 11 human protein-coding genes for evolutionarily conserved secondary structure elements using RNA-Decoder, a comparative secondary structure prediction program that is capable of explicitly taking the known protein-coding context of the mRNA sequences into account. We detect well-defined, conserved RNA secondary structure elements in the coding regions of the mRNA sequences and show that base-paired codons strongly correlate with sparse codons. We also investigate the role of repetitive elements in the formation of secondary structure and explain the use of alternate start codons in the caveolin-1 gene by a conserved secondary structure element overlapping the nominal start codon. We discuss the functional roles of our novel findings in regulating the gene expression on mRNA level. We also investigate the role of secondary structure on the correct splicing of the human CFTR gene. We study the wild-type version of the pre-mRNA as well as 29 variants with synonymous mutations in exon 12. By comparing our predicted secondary structures to the experimentally determined splicing efficiencies, we find with weak statistical significance that pre-mRNAs with high-splicing efficiencies have different predicted secondary structures than pre-mRNAs with low-splicing efficiencies.
Moments of the Boltzmann distribution for RNA secondary structures
We here present a dynamic programming algorithm which is capable of calculating arbitrary moments of the Boltzmann distribution for RNA secondary structures. We have implemented the algorithm in a program called RNA-VARIANCE and investigate the difference between the Boltzmann distribution of biological and random RNA sequences. We find that the minimum free energy structure of biological sequences has a higher probability in the Boltzmann distribution than random sequences. Moreover, we show that the free energies of biological sequences have a smaller variance than random sequences and that the minimum free energy of biological sequences is closer to the expected free energy of the rest of the structures than that of random sequences. These results suggest that biologically functional RNA sequences not only require a thermodynamically stable minimum free energy structure, but also an ensemble of structures whose free energies are close tothe minimum free energy.