Interpretable Machine Learning for Scientific Discovery in Regulatory Genomics

Interpretable Machine Learning for Scientific Discovery in Regulatory Genomics
Author: Avanti Shrikumar
Publisher:
Total Pages:
Release: 2020
Genre:
ISBN:


Download Interpretable Machine Learning for Scientific Discovery in Regulatory Genomics Book in PDF, Epub and Kindle

All cells in our body have approximately the same DNA sequence, yet different cell-types have distinct behavior due to differential expression of genes. This cell-type specific control of gene expression is governed by regulatory proteins that bind to DNA. Over 90% of disease-associated mutations do not disrupt the DNA sequences of genes, but rather disrupt functions involved in the regulation of gene expression. Unfortunately, conventional computational models can fail to distinguish between mutations that are benign and mutations that are likely to affect regulatory activity. Machine learning poses a solution to this dilemma: by training complex models, including deep learning models, to predict regulatory activity from DNA sequence, we implicitly force the models to learn which sequence features are relevant for regulation. However, our difficulty in interpreting and trusting these models limits our ability to extract novel scientific insights from them. In this thesis, I will present techniques I have developed to address some of these limitations. I will begin by discussing DeepLIFT, a fast algorithm for calculating example-specific importance scores to explain the predictions of a deep learning model, as well as GkmExplain, an algorithm for efficiently computing importance scores for gapped k-mer support vector machines. I will then describe TF-MoDISco, an algorithm that leverages importance scores produced by an algorithm such as DeepLIFT or GkmExplain to discover recurring patterns learned by the model. Next, I discuss two projects on leveraging domain-specific knowledge to improve the performance and interpretability of deep learning models trained on regulatory genomic data. The first project, on reverse-complement parameter sharing, introduces architectures that can account for symmetries inherent in the double-stranded nature of regulatory DNA. The second project, on separable fully-connected layers, introduces a novel parameterization to exploit the fact that positional patterns in DNA binding sites are often shared across different regulatory proteins. Finally, I will discuss three projects centered on improving the reliability of predictions derived from these models. The first project deals with the situation where a deep learning model trained on regulatory genomic data is leveraged to identify pairs of proteins that have non-additive interaction effects; we demonstrate that looking at change in the model's prediction loss, rather than simply looking at the change in the predictions, is a far more robust indicator of whether the model's learned interaction effect is likely to be an artifact. The second project presents a state-of-the-art algorithm for improving the model predictions under a type of data distribution shift known as ``label shift'', where the class proportions in the held-out testing set differ from the class proportions that the model was trained on (this can occur, for example, if a model that is trained to predict diseases given symptoms is deployed in a situation where the prevalence of the disease is far higher than in the data distribution it was trained on). The third project explores the scenario where a model can abstain from making predictions on a subset of examples that it is uncertain of, in order to improve user trust in the predictions on remaining examples; in the project, we devise a novel and flexible strategy for choosing which examples to abstain on when the goal is to optimize metrics other than simple prediction accuracy, such as the area under the ROC curve or the sensitivity at a target specificity level (such metrics are commonly used in genomics and medicine). Taken together, I hope these methods help pave the way for successful application of advanced machine learning techniques to derive novel scientific insights from regulatory genomic data.


Interpretable Machine Learning for Scientific Discovery in Regulatory Genomics
Language: en
Pages:
Authors: Avanti Shrikumar
Categories:
Type: BOOK - Published: 2020 - Publisher:

GET EBOOK

All cells in our body have approximately the same DNA sequence, yet different cell-types have distinct behavior due to differential expression of genes. This ce
Improving and Leveraging the Interpretability of Deep Neural Networks for Genomics
Language: en
Pages:
Authors: Alex Michael Tseng
Categories:
Type: BOOK - Published: 2022 - Publisher:

GET EBOOK

In recent years, the field of genomics has been characterized by an extraordinary influx of novel high-throughput technologies and techniques. These methods pro
Interpretable Machine Learning Methods for Regulatory and Disease Genomics
Language: en
Pages:
Authors: Peyton Greis Greenside
Categories:
Type: BOOK - Published: 2018 - Publisher:

GET EBOOK

It is an incredible feat of nature that the same genome contains the code to every cell in each living organism. From this same genome, each unique cell type ga
Handbook of Machine Learning Applications for Genomics
Language: en
Pages: 222
Authors: Sanjiban Sekhar Roy
Categories: Technology & Engineering
Type: BOOK - Published: 2022-06-23 - Publisher: Springer Nature

GET EBOOK

Currently, machine learning is playing a pivotal role in the progress of genomics. The applications of machine learning are helping all to understand the emergi
Machine Learning and Deep Learning in Computational Toxicology
Language: en
Pages: 654
Authors: Huixiao Hong
Categories: Medical
Type: BOOK - Published: 2023-03-11 - Publisher: Springer Nature

GET EBOOK

This book is a collection of machine learning and deep learning algorithms, methods, architectures, and software tools that have been developed and widely appli