Scheduled maintenance This site will be offline from 13:30 to 15:00 CEST on Monday, 21 October 2024 in order to be moved to a new infrastructure. We apologise for the inconvenience. Please check the Slack channel #dc-system-status for updates. The development team can be contacted at datacentre@scilifelab.se if you have any question.

Efficient mapping of accurate long reads in minimizer space with mapquik.

Ekim B, Sahlin K, Medvedev P, Berger B, Chikhi R

Genome Res 33 (7) 1188-1197 [2023-07-00; online 2023-06-30]

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Kristoffer Sahlin

SciLifeLab Fellow

PubMed 37399256

DOI 10.1101/gr.277679.123

Crossref 10.1101/gr.277679.123

pmc: PMC10538364
pii: gr.277679.123
medline: 9509184


Publications 9.5.0