close
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug 1;31(15):2489-96.
doi: 10.1093/bioinformatics/btv185. Epub 2015 Apr 2.

SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics

Affiliations

SPARSE: quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics

Sebastian Will et al. Bioinformatics. .

Abstract

Motivation: RNA-Seq experiments have revealed a multitude of novel ncRNAs. The gold standard for their analysis based on simultaneous alignment and folding suffers from extreme time complexity of [Formula: see text]. Subsequently, numerous faster 'Sankoff-style' approaches have been suggested. Commonly, the performance of such methods relies on sequence-based heuristics that restrict the search space to optimal or near-optimal sequence alignments; however, the accuracy of sequence-based methods breaks down for RNAs with sequence identities below 60%. Alignment approaches like LocARNA that do not require sequence-based heuristics, have been limited to high complexity ([Formula: see text] quartic time).

Results: Breaking this barrier, we introduce the novel Sankoff-style algorithm 'sparsified prediction and alignment of RNAs based on their structure ensembles (SPARSE)', which runs in quadratic time without sequence-based heuristics. To achieve this low complexity, on par with sequence alignment algorithms, SPARSE features strong sparsification based on structural properties of the RNA ensembles. Following PMcomp, SPARSE gains further speed-up from lightweight energy computation. Although all existing lightweight Sankoff-style methods restrict Sankoff's original model by disallowing loop deletions and insertions, SPARSE transfers the Sankoff algorithm to the lightweight energy model completely for the first time. Compared with LocARNA, SPARSE achieves similar alignment and better folding quality in significantly less time (speedup: 3.7). At similar run-time, it aligns low sequence identity instances substantially more accurate than RAF, which uses sequence-based heuristics.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Example alignment A of two RNA sequences A and B together with (non-crossing) structures S={a1,a2,a3,a4,a5} and T={b1,b2,b3,b4,b5}. We highlight the positions in the loop closed by a1 and in the loop closed by b2. The base pair a1 is the parent of the highlighted positions in A and of a2. The base pair a1 closes a 2-loop; a2, a 1-loop and a3, a multiloop. The latter is a 3-loop, since a3 has two inner base pairs (a4 and a5.) Note that the structure alignment triple (A,S,T) covers the external base pairs a1,a3,b1 and b2; as well as the inner base pairs of the two multiloops. Finally, (A,S,T) deletes the entire 2-loop of a1 and inserts the entire 2-loop of b2
Fig. 2.
Fig. 2.
Recursions of the novel lightweight alignment algorithm PARSE
Fig. 3.
Fig. 3.
(A) Example alignment. Due to stacking effects, the probability of a base pair a3 in the loop closed by a2 (loop of single structure) is much higher than its probability in the loop closed by a1 [loop of the consensus structure {(a1,b1),(a3,b2)}]. (B) Computing represented entries in the sparsified algorithm. We show the matrix Mab; the rounded bars left and on top of the matrix symbolize the represented positions for a and b; the gray areas contain the represented entries for Mab. In our example, the entry Mab(i,k) recurses (solid arrows) to unrepresented entries Mab(i1,k1),Mab(i,k1), Mab(i1,k) and Mab(a1L1,b1L1) (white boxes); the latter via matching base pairs; their left ends correspond to the dashed box at (a1L,b1L). The numbers 1-4 at the arrow heads refer to the respective recursion case. The unrepresented entries are computed from represented entries (dashed arrows to black boxes), each in constant time
Fig. 4.
Fig. 4.
Alignment quality (measured by SPS) at different sequence identities for pairwise alignments (Bralibase 2.1 set k2). The curves are lowess curves (Cleveland, 1981) through data points for each benchmark instance. The thin lines visualize the distribution of scores by estimating the respective instance averages above and below of the main lowess curve
Fig. 5.
Fig. 5.
Structure prediction quality measured by MCC within different ranges of average pairwise sequence identity (APSI) shown as boxplots. (Bralibase 2.1 set k2) whiskers are extended up to one interquartile range from the boxes

References

    1. Bernhart S.H., et al. (2008) RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9, 474. - PMC - PubMed
    1. Chambers J.M., et al. (1983) Graphical Methods for Data Analysis . Wadsworth, Belmont, CA.
    1. Clark M.B., et al. (2011) The reality of pervasive transcription. PLoS Biol, 9, e1000625; discussion e1001102. - PMC - PubMed
    1. Cleveland W.S. (1981) Lowess: a program for smoothing scatterplots by robust locally weighted regression. Am. Stat., 35, 54.
    1. Do C.B., et al. (2008) A max-margin model for efficient simultaneous alignment and folding of RNA sequences. Bioinformatics, 24, i68–i76. - PMC - PubMed

Publication types