Publications of Miklós, I.

Kim H, Toroczkai Z, Erdős PL, Miklós I, Székely LA. Degree-based graph construction. Journal of Physics. A. Mathematical and TheoreticalJournal of Physics. A. Mathematical and Theoretical. 2009;42(39):392001.1. Abstract

Degree-based graph construction

Degree-based graph construction is a ubiquitous problem in network modelling (Newman et al 2006 The Structure and Dynamics of Networks (Princeton Studies in Complexity) (Princeton, NJ: Princeton University Press), Boccaletti et al 2006 Phys. Rep. 424 175), ranging from social sciences to chemical compounds and biochemical reaction networks in the cell. This problem includes existence, enumeration, exhaustive construction and sampling questions with aspects that are still open today. Here we give necessary and sufficient conditions for a sequence of nonnegative integers to be realized as a simple graph's degree sequence, such that a given (but otherwise arbitrary) set of connections from an arbitrarily given node is avoided. We then use this result to present a swap-free algorithm that builds all simple graphs realizing a given degree sequence. In a wider context, we show that our result provides a greedy construction method to build all the f-factor subgraphs (Tutte 1952 Can. J. Math. 4 314) embedded within Kn setmn Sk, where Kn is the complete graph and Sk is a star graph centred on one of the nodes.

Streamlining and Large Ancestral Genomes in Archaea Inferred with a Phylogenetic Birth-and-Death Model

Homologous genes originate from a common ancestor through vertical inheritance, duplication, or horizontal gene transfer. Entire homolog families spawned by a single ancestral gene can be identified across multiple genomes based on protein sequence similarity. The sequences, however, do not always reveal conclusively the history of large families. To study the evolution of complete gene repertoires, we propose here a mathematical framework that does not rely on resolved gene family histories. We show that so-called phylogenetic profiles, formed by family sizes across multiple genomes, are sufficient to infer principal evolutionary trends. The main novelty in our approach is an efficient algorithm to compute the likelihood of a phylogenetic profile in a model of birth-and-death processes acting on a phylogeny. We examine known gene families in 28 archaeal genomes using a probabilistic model that involves lineage- and family-specific components of gene acquisition, duplication, and loss. The model enables us to consider all possible histories when inferring statistics about archaeal evolution. According to our reconstruction, most lineages are characterized by a net loss of gene families. Major increases in gene repertoire have occurred only a few times. Our reconstruction underlines the importance of persistent streamlining processes in shaping genome composition in Archaea. It also suggests that early archaeal genomes were as complex as typical modern ones, and even show signs, in the case of the methanogenic ancestor, of an extremely large gene repertoire.

Efficient Sampling of Parsimonious Inversion Histories with Application to Genome Rearrangement in Yersinia

Inversions are among the most common mutations acting on the order and orientation of genes in a genome, and polynomial-time algorithms exist to obtain a minimal length series of inversions that transform one genome arrangement to another. However, the minimum length series of inversions (the optimal sorting path) is often not unique as many such optimal sorting paths exist. If we assume that all optimal sorting paths are equally likely, then statistical inference on genome arrangement history must account for all such sorting paths and not just a single estimate. No deterministic polynomial algorithm is known to count the number of optimal sorting paths nor sample from the uniform distribution of optimal sorting paths. Here, we propose a stochastic method that uniformly samples the set of all optimal sorting paths. Our method uses a novel formulation of parallel Markov chain Monte Carlo. In practice, our method can quickly estimate the total number of optimal sorting paths. We introduce a variant of our approach in which short inversions are modeled to be more likely, and we show how the method can be used to estimate the distribution of inversion lengths and breakpoint usage in pathogenic Yersinia pestis. The proposed method has been implemented in a program called "MC4Inversion." We draw comparison of MC4Inversion to the sampler implemented in BADGER and a previously described importance sampling (IS) technique. We find that on high-divergence data sets, MC4Inversion finds more optimal sorting paths per second than BADGER and the IS technique and simultaneously avoids bias inherent in the IS technique.

Miklós I, Mélykúti B, Swenson K. The Metropolized Partial Importance Sampling MCMC Mixes Slowly on Minimum Reversal Rearrangement Paths. IEEE/ACM Trans. Comput. Biol. and Bioinf.IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009. Abstract

The Metropolized Partial Importance Sampling MCMC Mixes Slowly on Minimum Reversal Rearrangement Paths

Markov chain Monte Carlo has been the standard technique for inferring the posterior distribution of genome rearrangement scenarios under a Bayesian approach. We present here a negative result on the rate of convergence of the generally used Markov chains. We prove that the relaxation time of the Markov chains walking on the optimal reversal sorting scenarios might grow exponentially with the size of the signed permutations, namely, with the number of syntheny blocks.

Stochastic models of sequence evolution including insertion--deletion events

Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertions-deletions (abbr. indels) is central in sequence and genome analysis and is called alignment. This statistical approach is harder conceptually and computationally, than competing approaches based on choosing an alignment according to some optimality criteria. But it has major practical advantages in terms of testing evolutionary hypotheses and parameter estimation. Basic dynamic approaches can allow the analysis of up to 4--5 sequences. MCMC techniques can bring this to about 10--15 sequences. Beyond this, different or heuristic approaches must be used. Besides the computational challenges, increasing realism in the underlying models is presently being addressed. A recent development that has been especially fruitful is combining statistical alignment with the problem of sequence annotation, making statements about the function of each nucleotide/amino acid. So far gene finding, protein secondary structure prediction and regulatory signal detection has been tackled within this framework. Much progress can be reported, but clearly major challenges remain if this approach is to be central in the analyses of large incoming sequence data sets.

BigFoot: Bayesian alignment and phylogenetic footprinting with MCMC

We have previously combined statistical alignment and phylogenetic footprinting to detect conserved functional elements without assuming a fixed alignment. Considering a probability-weighted distribution of alignments removes sensitivity to alignment errors, properly accommodates regions of alignment uncertainty, and increases the accuracy of functional element prediction. Our method utilized standard dynamic programming hidden markov model algorithms to analyze up to four sequences.RESULTS:We present a novel approach, implemented in the software package BigFoot, for performing phylogenetic footprinting on greater numbers of sequences. We have developed a Markov chain Monte Carlo (MCMC) approach which samples both sequence alignments and locations of slowly evolving regions. We implement our method as an extension of the existing StatAlign software package and test it on well-annotated regions controlling the expression of the even-skipped gene in Drosophila and the alpha-globin gene in vertebrates. The results exhibit how adding additional sequences to the analysis has the potential to improve the accuracy of functional predictions, and demonstrate how BigFoot outperforms existing alignment-based phylogenetic footprinting techniques.CONCLUSION:BigFoot extends a combined alignment and phylogenetic footprinting approach to analyze larger amounts of sequence data using MCMC. Our approach is robust to alignment error and uncertainty and can be applied to a variety of biological datasets. The source code and documentation are publicly available for download from http://www.stats.ox.ac.uk/~satija/BigFoot/

Dynamics of Genome Rearrangement in Bacterial Populations

Whole-genome sequencing has revealed that organisms exhibit extreme variability in chromosome structure. One common type of chromosome structure variation is genome arrangement variation: changes in the ordering of genes on the chromosome. Not only do we find differences in genome arrangement across species, but in some organisms, members of the same species have radically different genome arrangements. We studied the evolution of genome arrangement in pathogenic bacteria from the genus Yersinia. The Yersinia exhibit substantial variation in genome arrangement both within and across species. We reconstructed the history of genome rearrangement by inversion in a group of eight Yersinia, and we statistically quantified the forces shaping their genome arrangement evolution. In particular, we discovered an excess of rearrangement activity near the origin of chromosomal replication and found evidence for a preferred configuration for the relative orientations of the origin and terminus of replication. We also found real inversions to be significantly shorter than expected. Finally, we discovered that no single reconstruction of inversion history is parsimonious with respect to the total number of inversion mutations, but on average, reconstructed genome arrangements favor “balanced” genomes—where the replication origin is positioned opposite the terminus on the circular chromosome.

How reliably can we predict the reliability of protein structure predictions?

BACKGROUND:Comparative methods have been the standard techniques for in silico protein structure prediction. The prediction is based on a multiple alignment that contains both reference sequences with known structures and the sequence whose unknown structure is predicted. Intensive research has been made to improve the quality of multiple alignments, since misaligned parts of the multiple alignment yield misleading predictions. However, sometimes all methods fail to predict the correct alignment, because the evolutionary signal is too weak to find the homologous parts due to the large number of mutations that separate the sequences.RESULTS:Stochastic sequence alignment methods define a posterior distribution of possible multiple alignments. They can highlight the most likely alignment, and above that, they can give posterior probabilities for each alignment column. We made a comprehensive study on the HOMSTRAD database of structural alignments, predicting secondary structures in four different ways. We showed that alignment posterior probabilities correlate with the reliability of secondary structure predictions, though the strength of the correlation is different for different protocols. The correspondence between the reliability of secondary structure predictions and alignment posterior probabilities is the closest to the identity function when the secondary structure posterior probabilities are calculated from the posterior distribution of multiple alignments. The largest deviation from the identity function has been obtained in the case of predicting secondary structures from a single optimal pairwise alignment. We also showed that alignment posterior probabilities correlate with the 3D distances between Calpha amino acids in superimposed tertiary structures.CONCLUSION:Alignment posterior probabilities can be used to a priori detect errors in comparative models on the sequence alignment level.

StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees

Motivation: Bayesian analysis is one of the most popular methods in phylogenetic inference. The most commonly used methods fix a single multiple alignment and consider only substitutions as phylogenetically informative mutations, though alignments and phylogenies should be inferred jointly as insertions and deletions also carry informative signals. Methods addressing these issues have been developed only recently and there has not been so far a user-friendly program with a graphical interface that implements these methods. Results: We have developed an extendable software package in the Java programming language that samples from the joint posterior distribution of phylogenies, alignments and evolutionary parameters by applying the Markov chain Monte Carlo method. The package also offers tools for efficient on-the-fly summarization of the results. It has a graphical interface to configure, start and supervise the analysis, to track the status of the Markov chain and to save the results. The background model for insertions and deletions can be combined with any substitution model. It is easy to add new substitution models to the software package as plugins. The samples from the Markov chain can be summarized in several ways, and new postprocessing plugins may also be installed. Availability: The code is available from http://phylogeny-cafe.elte.hu/StatAlign/ Contact: miklosi@ramet.elte.hu

Miklós I. Untitled. In: Iványi A, editor. Algorithms of Informatics. Budapest: Mondat Kiadó; 2007. p. 973-1011.
Miklós I. Statistical multiple alignment. In: Kao M-Y, editor. Encyclopedia of algorithms. New York NY: Springer; 2007.

SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework

Computational methods for predicting evolutionarily conserved rather than thermodynamic RNA structures have recently attracted increased interest. These methods are indispensable not only for elucidating the regulatory roles of known RNA transcripts, but also for predicting RNA genes. It has been notoriously difficult to devise them to make the best use of the available data and to predict high-quality RNA structures that may also contain pseudoknots. We introduce a novel theoretical framework for co-estimating an RNA secondary structure including pseudoknots, a multiple sequence alignment, and an evolutionary tree, given several RNA input sequences. We also present an implementation of the framework in a new computer program, called SimulFold, which employs a Bayesian Markov chain Monte Carlo method to sample from the joint posterior distribution of RNA structures, alignments, and trees. We use the new framework to predict RNA structures, and comprehensively evaluate the quality of our predictions by comparing our results to those of several other programs. We also present preliminary data that show SimulFold's potential as an alignment and phylogeny prediction method. SimulFold overcomes many conceptual limitations that current RNA structure prediction methods face, introduces several new theoretical techniques, and generates high-quality predictions of conserved RNA structures that may include pseudoknots. It is thus likely to have a strong impact, both on the field of RNA structure prediction and on a wide range of data analyses.

Csűrös M, Miklós I. A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer. In: Research in computational molecular biology. Vol 3909. Berlin: Springer; 2006. p. 206-20. (Lecture Notes in Comput. Sci.; vol 3909).
Miklós I, Paige TB, Ligeti P. Efficient sampling of transpositions and inverted transpositions for Bayesian MCMC. In: Algorithms in bioinformatics. Vol 4175. Berlin: Springer; 2006. p. 174-85. (Lecture Notes in Comput. Sci.; vol 4175). Abstract

Efficient sampling of transpositions and inverted transpositions for Bayesian MCMC

The evolutionary distance between two organisms can be determined by comparing the order of appearance of orthologous genes in their genomes. Above the numerous parsimony approaches that try to obtain the shortest sequence of rearrangement operations sorting one genome into the other, Bayesian Markov chain Monte Carlo methods have been introduced a few years ago. The computational time for convergence in the Markov chain is the product of the number of needed steps in the Markov chain and the computational time needed to perform one MCMC step. Therefore faster methods for making one MCMC step can reduce the mixing time of an MCMC in terms of computer running time.We introduce two efficient algorithms for characterizing and sampling transpositions and inverted transpositions for Bayesian MCMC. The first algorithm characterizes the transpositions and inverted transpositions by the number of breakpoints the mutations change in the breakpoint graph, the second algorithm characterizes the mutations by the change in the number of cycles. Both algorithms run in O(n) time, where n is the size of the genome. This is a significant improvement compared with the so far available brute force method with O(n3) running time and memory usage.

Lunter G, Drummond AJ, Miklós I, Hein J. Statistical alignment: recent progress, new applications, and challenges. In: Statistical methods in molecular evolution. New York: Springer; 2005. p. 375-405. (Stat. Biol. Health).
Miklós I, Hein J. Genome Rearrangement in Mitochondria and Its Computational Biology. In: Genome Rearrangement in Mitochondria and Its Computational Biology. Berlin: Springer; 2005. p. 85-96. (Lecture Notes in Computer Science). Abstract

Genome Rearrangement in Mitochondria and Its Computational Biology

In the first part of this paper, we investigate gene orders of closely related mitochondrial genomes for studying the properties of mutations rearranging genes in mitochondria. Our conclusions are that the evolution of mitochondrial genomes is more complicated than it is considered in recent methods, and stochastic modelling is necessary for its deeper understanding and more accurate inferring. The second part is a review on the Markov chain Monte Carlo approaches for the stochastic modelling of genome rearrangement, which seem to be the only computationally tractable way to this problem. We introduce the concept of partial importance sampling, which yields a class of Markov chains being efficient both in terms of mixing and computational time. We also give a list of open algorithmic problems whose solution might help improve the efficiency of partial importance samplers.

Bayesian coestimation of phylogeny and sequence alignment

BACKGROUND:Two central problems in computational biology are the determination of the alignment and phylogeny of a set of biological sequences. The traditional approach to this problem is to first build a multiple alignment of these sequences, followed by a phylogenetic reconstruction step based on this multiple alignment. However, alignment and phylogenetic inference are fundamentally interdependent, and ignoring this fact leads to biased and overconfident estimations. Whether the main interest be in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both.RESULTS:We developed a fully Bayesian Markov chain Monte Carlo method for coestimating phylogeny and sequence alignment, under the Thorne-Kishino-Felsenstein model of substitution and single nucleotide insertion-deletion (indel) events. In our earlier work, we introduced a novel and efficient algorithm, termed the "indel peeling algorithm", which includes indels as phylogenetically informative evolutionary events, and resembles Felsenstein's peeling algorithm for substitutions on a phylogenetic tree. For a fixed alignment, our extension analytically integrates out both substitution and indel events within a proper statistical model, without the need for data augmentation at internal tree nodes, allowing for efficient sampling of tree topologies and edge lengths. To additionally sample multiple alignments, we here introduce an efficient partial Metropolized independence sampler for alignments, and combine these two algorithms into a fully Bayesian co-estimation procedure for the alignment and phylogeny problem.Our approach results in estimates for the posterior distribution of evolutionary rate parameters, for the maximum a-posteriori (MAP) phylogenetic tree, and for the posterior decoding alignment. Estimates for the evolutionary tree and multiple alignment are augmented with confidence estimates for each node height and alignment column. Our results indicate that the patterns in reliability broadly correspond to structural features of the proteins, and thus provides biologically meaningful information which is not existent in the usual point-estimate of the alignment. Our methods can handle input data of moderate size (10-20 protein sequences, each 100-200 bp), which we analyzed overnight on a standard 2 GHz personal computer.CONCLUSION:Joint analysis of multiple sequence alignment, evolutionary trees and additional evolutionary parameters can be now done within a single coherent statistical framework.

Statistical alignment of retropseudogenes and their functional paralogs via a common ancestor

A model is introduced for the sequence evolution of a retropseudogene and its functional paralog from a common protein-coding ancestor. The model accounts for substitutions, insertions and deletions, and combines nucleotide- and codon-level mutation models. We give dynamic programming algorithms for calculating the likelihood of homology between two sequences in the model, for obtaining the most probable alignment, and for computing the posterior probability of ancestral codons. Our method has several advantages, which include the possibility of inferring evolutionary parameters, hypothesis testing, and quantitative assessment of the alignment. The algorithms were implemented in Java, and the performance of the methods are shown on analyzing the evolution of human cytochrome c.

Statistical alignment of retropseudogenes and their functional paralogs

We describe a model for the sequence evolution of a processed pseudogene and its paralog from a common protein-coding ancestor. The model accounts for substitutions, insertions, and deletions and combines nucleotide- and codon-level mutation models. We give a dynamic programming method for calculating the likelihood of homology between two aequences in the model and describe the accompanying alignment algorithm. We also describe how ancestral codons can be computed when the same gene produced multiple pseudogene homologs. We apply our methods to the evolution of human cytochrome c.

A linear memory algorithm for Baum-Welch training

BACKGROUND::Baum-Welch training is an expectation-maximisation algorithm for training the emission and transition probabilities of hidden Markov models in a fully automated way. It can be employed as long as a training set of annotated sequences is known, and provides a rigorous way to derive parameter values which are guaranteed to be at least locally optimal. For complex hidden Markov models such as pair hidden Markov models and very long training sequences, even the most efficient algorithms for Baum-Welch training are currently too memory-consuming. This has so far effectively prevented the automatic parameter training of hidden Markov models that are currently used for biological sequence analyses.RESULTS::We introduce the first linear space algorithm for Baum-Welch training. For a hidden Markov model with M states, T free transition and E free emission parameters, and an input sequence of length L, our new algorithm requires O(M) memory and O(LMTmax (T + E)) time for one Baum-Welch iteration, where Tmax is the maximum number of states that any state is connected to. The most memory efficient algorithm until now was the checkpointing algorithm with O(log(L)M) memory and O(log(L)LMTmax) time requirement. Our novel algorithm thus renders the memory requirement completely independent of the length of the training sequences. More generally, for an n-hidden Markov model and n input sequences of length L, the memory requirement of O(log(L)Ln-1 M) is reduced to O(Ln-1 M) memory while the running time is changed from O(log(L)Ln MTmax + Ln(T + E)) to O(Ln MTmax (T + E)).An added advantage of our new algorithm is that a reduced time requirement can be traded for an increased memory requirement and vice versa, such that for any c [element of] {1, ..., (T + E)}, a time requirement of Ln MTmax c incurs a memory requirement of Ln-1 M(T + E – c).CONCLUSION:For the large class of hidden Markov models used for example in gene prediction, whose number of states does not scale with the length of the input sequence, our novel algorithm can thus be both faster and more memory-efficient than any of the existing algorithms.

Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs

Owing to the degeneracy of the genetic code, protein-coding regions of mRNA sequences can harbour more than only amino acid information. We search the mRNA sequences of 11 human protein-coding genes for evolutionarily conserved secondary structure elements using RNA-Decoder, a comparative secondary structure prediction program that is capable of explicitly taking the known protein-coding context of the mRNA sequences into account. We detect well-defined, conserved RNA secondary structure elements in the coding regions of the mRNA sequences and show that base-paired codons strongly correlate with sparse codons. We also investigate the role of repetitive elements in the formation of secondary structure and explain the use of alternate start codons in the caveolin-1 gene by a conserved secondary structure element overlapping the nominal start codon. We discuss the functional roles of our novel findings in regulating the gene expression on mRNA level. We also investigate the role of secondary structure on the correct splicing of the human CFTR gene. We study the wild-type version of the pre-mRNA as well as 29 variants with synonymous mutations in exon 12. By comparing our predicted secondary structures to the experimentally determined splicing efficiencies, we find with weak statistical significance that pre-mRNAs with high-splicing efficiencies have different predicted secondary structures than pre-mRNAs with low-splicing efficiencies.

Miklós I, Meyer IM, Nagy B. Moments of the Boltzmann distribution for RNA secondary structures. Bulletin of Mathematical BiologyBulletin of Mathematical Biology. 2005;67(5):1031-47. Abstract

Moments of the Boltzmann distribution for RNA secondary structures

We here present a dynamic programming algorithm which is capable of calculating arbitrary moments of the Boltzmann distribution for RNA secondary structures. We have implemented the algorithm in a program called RNA-VARIANCE and investigate the difference between the Boltzmann distribution of biological and random RNA sequences. We find that the minimum free energy structure of biological sequences has a higher probability in the Boltzmann distribution than random sequences. Moreover, we show that the free energies of biological sequences have a smaller variance than random sequences and that the minimum free energy of biological sequences is closer to the expected free energy of the rest of the structures than that of random sequences. These results suggest that biologically functional RNA sequences not only require a thermodynamically stable minimum free energy structure, but also an ensemble of structures whose free energies are close tothe minimum free energy.

A New Multivariate Approach to Studying Temporal Changes of Vegetation

We emphasize the necessity of a complex approach to evaluating vegetation change at various levels of abstraction. The analytical steps include comparisons at the data, derived variable, distance, ordination and classification levels. A variety of data randomization methods incorporated in testing the significance of changes in raw data are introduced and compared. It is shown that these are true alternatives to Procrustean comparisons, which offer an apparently unfortunate choice in the presence/absence case. We propose to evaluate nearest neighbor relationships among quadrats in a new method, called adjacency analysis, to detect temporal trends that may remain unrevealed, should our attention be paid to full distance structures only. As an illustration, compositional and structural changes in the rock grassland vegetation of the Sas-hegy Nature Reserve (Budapest, Hungary), intensively sampled by quadrats in 1977 and 2000, are evaluated. Permutation tests show that differences between the 2 years are much smaller than expected by chance alone. Such an overall stability in community structure, however, does not mean that minor aspects of vegetation pattern are invariant over the years. Changes in life form and seed mass spectra are explained by the fluctuation of hemicryptophytes and the slight but detectable expansion of annuals and woody species. Classification is slightly rearranged in time, with clearly detectable within-cluster changes, also depicted in ordination scattergrams.

Rearrangement of ecological data matrices via Markov chain Monte Carlo simulation

Block clustering and seriation reveal the underlying structure of ecological data structures by rearranging the rows and the columns of the data table or data matrix, usually representing species and sample sites, respectively. Classical approaches to this problem rely upon a goodness criterion optimized through an iterative algorithm or utilize a simultaneous classification or ordination of species and sites. The new procedure introduced here does not strive for a single optimal rearrangement. Instead, it generates a series of probability distributions, the Boltzmann distributions, for a large set of solutions that include both the optimal and suboptimal solutions. The procedure is governed by a hypothetical parameter, T, called the "temperature." For the value of zero, only the best rearrangements (if there are many) have nonzero probabilities. As the value of this parameter increases, less optimal rearrangements become more frequent in the associated series of distributions. When T approaches infinity, the distribution becomes uniform over all the possible matrix rearrangements. We propose a Markov chain Monte Carlo (MCMC) method which converges reasonably fast to this distribution for any value of T. This chain provides a sample of matrices that can be characterized via several statistics based on the Boltzmann distribution.The relevance of the method is demonstrated using ecological data. We illustrate how much extra information can be gained from suboptimal solutions that may have biological meaning not revealed by the best solution. Although the objective is to give a distribution of solutions rather than a single optimal solution, the new method can actually outperform heuristic searching algorithms in finding the best arrangement. We provide source code for the MCMC method in the C language, which can be compiled under many operating systems (Windows, Linux/Unix, Macintosh OS) and used in command line mode. The Linux/Unix version operates in interactive mode, it gives a graphical output of the results, and is available as a web server interface in PHP 4.3, which can also be installed on personal computers.

Statistical aligment: recent progress, new applications and challenges

Two papers by Thorne, Kishino, and Felsenstein in the early 90's provided a basis for performing alignment within a statistical framework. Here we review progress and associated challenges in the investigation of models of insertions and deletions in biological sequences stemming from this early work. In the last few years this approach to sequence analysis has experienced a renaissance and recent progress has given this methodology the potential for becoming a practical research tool. The advantages of a statistical approach to alignment include the possibility of parameter inference, hypothesis testing, and assessment of uncertainty, none of which are possible using the score-based methods that currently predominate. Recent progress in statistical alignment includes better models, the extension of pairwise alignment algorithms to many sequences, faster algorithms, and the increased use of MCMC methods to handle practical problems. In this chapter we illustrate the statistical approach to multiple sequence alignment on a series of increasingly large data sets.

Miklós I. Bioinformatikai algoritmusok. In: Iványi A, editor. Informatikai algoritmusok. Budapest: Eötvös Kiadó; 2004. p. 538-79.

Co-transcriptional folding is encoded within RNA genes

Most of the existing RNA structure prediction programs fold a completely synthesized RNA molecule. However, within the cell, RNA molecules emerge sequentially during the directed process of transcription. Dedicated experiments with individual RNA molecules have shown that RNA folds while it is being transcribed and that its correct folding can also depend on the proper speed of transcription.METHODS:The main aim of this work is to study if and how co-transcriptional folding is encoded within the primary and secondary structure of RNA genes. In order to achieve this, we study the known primary and secondary structures of a comprehensive data set of 361 RNA genes as well as a set of 48 RNA sequences that are known to differ from the originally transcribed sequence units. We detect co-transcriptional folding by defining two measures of directedness which quantify the extend of asymmetry between alternative helices that lie 5' and those that lie 3' of the known helices with which they compete.RESULTS:We show with statistical significance that co-transcriptional folding strongly influences RNA sequences in two ways: (1) alternative helices that would compete with the formation of the functional structure during co-transcriptional folding are suppressed and (2) the formation of transient structures which may serve as guidelines for the co-transcriptional folding pathway is encouraged.CONCLUSIONS:These findings have a number of implications for RNA secondary structure prediction methods and the detection of RNA genes.

Miklós I. ParIS Genome Rearrangement server. BioinformaticsBioinformatics. 2004;21(6):817-20. Abstract

ParIS Genome Rearrangement server

ParIS Genome Rearrangement is a web server for a Bayesian analysis of unichromosomal genome pairs. The underlying model allows inversions, transpositions and inverted transpositions. The server generates a Markov chain using a Partial Importance Sampler technique, and samples trajectories of mutations from this chain. The user can specify several marginalizations to the posterior: the posterior distribution of number of mutations needed to transform one genome into another, length distribution of mutations, number of mutations that have occurred at a given site. Both text and graphical outputs are available. We provide a limited server, a downloadable unlimited server that can be installed locally on any linux/Unix operating system, and a database of mitochondrial gene orders.

Miklós I, Lunter GA, Holmes I. A "Long Indel" Model For Evolutionary Sequence Alignment. Mol. Biol. Evol.A "Long Indel" Model For Evolutionary Sequence Alignment. 2004;21(3):529-40. Abstract

A "Long Indel" Model For Evolutionary Sequence Alignment

We present a new probabilistic model of sequence evolution, allowing indels of arbitrary length, and give sequence alignment algorithms for our model. Previously implemented evolutionary models have allowed (at most) single-residue indels or have introduced artifacts such as the existence of indivisible ‘‘fragments.’’ We compare our algorithm to these previous methods by applying it to the structural homology dataset HOMSTRAD, evaluating the accuracy of (1)alignments and (2) evolutionary time estimates. With our method, it is possible (for the first time) to integrate probabilistic sequence alignment, with reliability indicators and arbitrary gap penalties, in the same framework as phylogenetic reconstruction. Our alignment algorithm requires that we evaluate the likelihood of any specific path of mutation events in a continuous-time Markov model, with the event times integrated out. To this effect, we introduce a ‘‘trajectory likelihood’’ algorithm (Appendix A). We anticipate that this algorithm will be useful in more generalcontexts, such as Markov Chain Monte Carlo simulations.

Randomization of Presence-Absence Matrices: Comments and New Algorithms

Randomization of presence-absence data matrices with fixed row and column totals is an important tool in ecological research wherever the significance of data-based statistics (e.g., species association measures) is to be evaluated. In the current literature of numerical ecology, however, there has been no algorithm that randomizes moderately large matrices in short time such that equidistribution of results is guaranteed. We show that a simple modification of the swap algorithm, called here the "trial-swap method," satisfies the requirement for equidistribution. Since this is relatively slow, we suggest two fast algorithms that, combined with the trial-swap method, produce all possible results with equal chance. The three procedures are illustrated using actual examples taken from bird biogeography and vegetation ecology.

Miklós I, Drummond A, Lunter G, Hein J. Bayesian Phylogenetic Inference under a Statistical Insertion-Deletion Model. In: Algorithms in Bioinformatics. Berlin: Springer; 2003. p. 228-44. (Lecture Notes in Computer Science). Abstract

Bayesian Phylogenetic Inference under a Statistical Insertion-Deletion Model

A central problem in computational biology is the inference of phylogeny given a set of DNA or protein sequences. Currently, this problem is tackled stepwise, with phylogenetic reconstruction dependent on an initial multiple sequence alignment step. However these two steps are fundamentally interdependent. Whether the main interest is in sequence alignment or phylogeny, a major goal of computational biology is the co-estimation of both. Here we present a first step towards this goal by developing an extension of the Felsenstein peeling algorithm. Given an alignment, our extension analytically integrates out both substitution and insertion–deletion events within a proper statistical model. This new algorithm provides a solution to two important problems in computational biology. Firstly, indel events become informative for phylogenetic reconstruction, and secondly phylogenetic uncertainty can be included in the estimation of insertion-deletion parameters. We illustrate the practicality of this algorithm within a Bayesian Markov chain Monte Carlo framework by demonstrating it on a non-trivial analysis of a multiple alignment of ten globin protein sequences.

Lunter GA, Miklós I, Song YS, Hein J. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J. Comput. BiolJournal of Computational Biology: A Journal of Computational Molecular Cell Biology. 2003;10(6):869-89. Abstract

An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees

We present an efficient algorithm for statistical multiple alignment based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary k-leaved phylogenetic tree. The existing algorithms use a hidden Markov model approach, which requires at least O( radical 5(k)) states and leads to a time complexity of O(5(k)L(k)), where L is the geometric mean sequence length. Using a combinatorial technique reminiscent of inclusion/exclusion, we are able to sum away the states, thus improving the time complexity to O(2(k)L(k)) and considerably reducing memory requirements. This makes statistical multiple alignment under the TKF91 model a definite practical possibility in the case of a phylogenetic tree with a modest number of leaves.

Miklós I. MCMC genome rearrangement. Bioinformatics. 2003;19(Suppl 2):ii130-ii137. Abstract

MCMC genome rearrangement

We introduce a Markov Chain Monte Carlo method to genome rearrangement based on a stochastic model of evolution, which can estimate the number of different evolutionary events needed to sort a signed permutation. The performance of the method was tested on simulated data, and the estimated numbers of differenttypes of mutations were reliable. Human and Drosophila mitochondrial data were also analysed with the new method. The mixing time of the Markov Chain is short bothin terms of CPU times and number of proposals.

Algorithm for statistical alignment of two sequences derived from a Poisson sequence length distribution

The statistical approach to DNA sequence evolution involves the stochastic modelling of the substitution, insertion and deletion processes. Substitution has been modelled by finite Markov-process for more than three decades. Modelling the insertion and deletion process is in its new-age, and the recent model has a serious drawback: it assumes geometric sequence length equilibrium distribution, which contradicts biological knowledge. An algorithm is presented that computes the joint probability of two sequences evolved on a non-reversible way from a Poisson sequence length distribution.

Miklós I. Algorithm for statistical alignment of two sequences derived from a Poisson sequence length distribution. Discrete Applied Mathematics. The Journal of Combinatorial Algorithms, Informatics and ComputationalDiscrete Applied Mathematics. The Journal of Combinatorial Algorithms, Informatics and Computational Sciences. 2003;127(1):79-84.

An improved algorithm for statistical alignment of sequences related by a star tree

The insertion-deletion model developed by Thorne, Kishino and Felsenstein (1991, J. Mol. Evol., 33, 114–124; the TKF91 model) provides a statistical framework of two sequences. The statistical alignment of a set of sequences related by a star tree is a generalization of this model. The known algorithm computes the probability of a set of such sequences in O(l 2k ) time, where l is the geometric mean of the sequence lengths and k is the number of sequences. An improved algorithm is presented whose running time is only O(22k l k).

Resemblance coefficients and the horseshoe effect in principal coordinates analysis

Although principal coordinates analysis is one of the most widely used ordination methods in ecology, no study had been undertaken as yet on the combined effect of gradient type and resemblance coefficient on the results. We examine the performance of principal coordinates analysis with different choices of the resemblance function and different types of a single underlying gradient. Whereas unimodal species response to long gradients always leads to horseshoe (or arch)-shaped configurations in the first two dimensions, the converse is not true; curvilinear arrangements cannot generally be explained by the Gaussian model. Several resemblance coefficients widely used in ecology produce paradoxical arches from perfectly linear data. Species richness changes alone may also lead to a horseshoe for even more distance functions, with the noted exception of Manhattan metric. The appearance of arches is a mathematical necessity in these cases; true artifacts are introduced only if distances are treated inappropriately before eigenanalysis. Examples illustrate that similar configurations (curves and even circles) may arise from very different data structures; therefore the shape of the point scatter is insufficient by itself to identify background ecological phenomena. The horseshoe effect may be diminished and eigenvalue extraction may be made more efficient if input measures are raised to high powers; but this operation is recommended only in combination with standard analyses, as part of a comparative approach. We derive a new distance function, implying standardization by species totals, from the chi-square distance. We found that this function improves gradient recovery when there is unimodal species response and some species have their optima outside the range of study.

Miklós I, Toroczkai Z. An improved model for statistical alignment. In: Algorithms in bioinformatics (\AArhus, 2001). Vol 2149. Berlin: Springer; 2001. p. 1-10. (Lecture Notes in Comput. Sci.; vol 2149).

Epigenetic inheritance, genetic assimilation and speciation

Epigenetic inheritance systems enable the environmentally induced phenotypes to be transmitted between generations. Jablonka and Lamb (1991, 1995) proposed that these systems have a substantial role during speciation. They argued that divergence of isolated populations may be first triggered by the accumulation of (heritable) phenotypic differences that are later followed and strengthened by genetic changes. The plausibility of this idea is examined in this paper. At first, we discuss the “exploratory” behaviour of an epigenetic inheritance system on a one peak adaptive landscape. If a quantitative trait is far from the optimum, then it is advantageous to induce heritable phenotypic variation. Conversely, if the genotypes get closer to the peak, it is more favorable to canalize the phenotypic expression of the character. This process would lead to genetic assimilation. Next we show that the divergence of heritable epigenetic marks acts to reduce or to eliminate the genetic barrier between two adaptive peaks. Therefore, an epigenetic inheritance system can increase the probability of transition from one adaptive state to another. Peak shift might be initiated by (i) slight changes in the inducing environment or by (ii) genetic drift of the genes controlling epigenetic variability. Remarkably, drift-induced transition is facilitated even if phenotypic variation is not heritable. A corollary of our thesis is that evolution can proceed through suboptimal phenotypic states, without passing through a deep adaptive valley of the genotype. We also consider the consequences of this finding on the dynamics and mode of reproductive isolation.