High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs

High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs
Author: Felix Kallenborn
Publisher:
Total Pages: 0
Release: 2024
Genre:
ISBN:


Download High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs Book in PDF, Epub and Kindle

With the technological advances in the field of genomics and sequencing, the processing of vast amounts of generated data becomes more and more challenging. Nowadays, software for processing large-scale datasets of sequencing reads may take hours to days to complete, even on high-end workstations. This explains the need for new approaches to achieve faster, high-performance applications. In contrast to traditional CPU-based software, algorithms utilizing the massively-parallel many-core architecture and fast memory of GPUs are potentially able to deliver the desired performance in many fields. In this thesis, we introduce two novel GPU-accelerated applications, CARE and CAREx, for common steps in sequence processing pipelines, error correction and read extension of Next Generation Sequencing (NGS) Illumina data, to improve the results of down-stream data analysis. To the best of our knowledge, CARE and CAREx are the first modern GPU-accelerated solutions for the respective problems. A key component of our algorithm is the identification of similar DNA sequences within a dataset. For this purpose, we developed a minhashing-based index data structure for large-scale read datasets. In conjunction with our fast bit-parallel shifted hamming distance computations, this allows for the efficient identification of similar reads. The resulting set of similar sequences is subsequently arranged into a gap-free multiple-sequence alignment to solve the problem at hand. Sequencing machines introduce both systematic errors and random errors. CARE, Context-Aware Read Error corrector, accurately removes errors introduced by NGS sequencing machines during the initial sequencing of a biological sample. With the help of a pre-trained Random Forest, CARE generates two orders-of-magnitude fewer false positives than its competitors. At the same time, it shows similar numbers of true positives. Read extension describes the process of elongating DNA sequences. The presence of longer sequences improves the resolution of more, larger structures within a genome. CAREx, Context-Aware Read Extender, produces longer sequences, so called pseudo-long reads, by connecting the two reads of read pairs which were sequenced in close proximity. Evaluation shows that CAREx produces significantly more highly accurate pseudo-long reads than the state-of-the-art. With algorithms tailored towards high-performance GPU computations, both CARE and CAREx run significantly faster than the CPU-based competitors, while, at the same time, produce more accurate results. The processing of a large Human dataset with 30x coverage with CARE requires less than 30 minutes using a single A100 GPU. This time can be further reduced down to 10 minutes on multi-GPU systems. In contrast, CPU-based tools like Musket or BFC take 3 hours and 1.5 hours, respectively. Read extension of a Human dataset with CAREx takes 3.3 hours to complete on a single GPU, whereas Konnector2 requires over a day to complete. This shows that large-scale sequence processing can greatly benefit from the usage of GPUs, and that multiple-sequence alignment-based algorithms should be considered despite their increased complexity because they provide great accuracy. While our general building blocks have been tailored towards our needs for error correction and read extension, they could also prove useful in other GPU-accelerated applications that process sequence data.


High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs
Language: en
Pages: 0
Authors: Felix Kallenborn
Categories:
Type: BOOK - Published: 2024 - Publisher:

GET EBOOK

With the technological advances in the field of genomics and sequencing, the processing of vast amounts of generated data becomes more and more challenging. Now
Encyclopedia of Bioinformatics and Computational Biology
Language: en
Pages: 3421
Authors:
Categories: Medical
Type: BOOK - Published: 2018-08-21 - Publisher: Elsevier

GET EBOOK

Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, Three Volume Set combines elements of computer science, information technology,
CUDA for Engineers
Language: en
Pages: 739
Authors: Duane Storti
Categories: Computers
Type: BOOK - Published: 2015-11-02 - Publisher: Addison-Wesley Professional

GET EBOOK

CUDA for Engineers gives you direct, hands-on engagement with personal, high-performance parallel computing, enabling you to do computations on a gaming-level P
Accelerating Bioinformatics Applications on CUDA-enabled Multi-GPU Systems
Language: en
Pages: 0
Authors: Robin Kobus
Categories:
Type: BOOK - Published: 2023 - Publisher:

GET EBOOK

A wide range of bioinformatics applications have to deal with a continuously growing amount of data generated by high-throughput sequencing techniques. Exclusiv
Computational Methods for Next Generation Sequencing Data Analysis
Language: en
Pages: 462
Authors: Ion Mandoiu
Categories: Computers
Type: BOOK - Published: 2016-09-12 - Publisher: John Wiley & Sons

GET EBOOK

Introduces readers to core algorithmic techniques for next-generation sequencing (NGS) data analysis and discusses a wide range of computational techniques and