Automated Removal of Non-homologous Sequence Stretches with PREQUAL.

Irisarri I, Burki F, Whelan S

Methods in molecular biology (Clifton, N.J.) 2231 (-) 147-162 [2020-12-09; online 2020-12-09]

Large-scale multigene datasets used in phylogenomics and comparative genomics often contain sequence errors inherited from source genomes and transcriptomes. These errors typically manifest as stretches of non-homologous characters and derive from sequencing, assembly, and/or annotation errors. The lack of automatic tools to detect and remove sequence errors leads to the propagation of these errors in large-scale datasets. PREQUAL is a command line tool that identifies and masks regions with non-homologous adjacent characters in sets of unaligned homologous sequences. PREQUAL uses a full probabilistic approach based on pair hidden Markov models. On the front end, PREQUAL is user-friendly and simple to use while also allowing full customization to adjust filtering sensitivity. It is primarily aimed at amino acid sequences but can handle protein-coding nucleotide sequences. PREQUAL is computationally efficient and shows high sensitivity and accuracy. In this chapter, we briefly introduce the motivation for PREQUAL and its underlying methodology, followed by a description of basic and advanced usage, and conclude with some notes and recommendations. PREQUAL fills an important gap in the current bioinformatics tool kit for phylogenomics, contributing toward increased accuracy and reproducibility in future studies.

Fabien Burki

SciLifeLab Fellow

PubMed 33289892

DOI 10.1007/978-1-0716-1036-7_10

Crossref 10.1007/978-1-0716-1036-7_10