Biomedical data is vastly increasing in quantity, scope, and generality, expanding opportunities to discover novel biological mechanisms and clinically translatable outcomes. The ENCODE project is generating myriads of sequencing datasets that measure varied activities across the human genome in many different cell types (Section 3 below).  A growing number of disease studies have produced multiple types of high-throughput molecular and imaging data (Section 2). Medical records, now routinely digitized, provide new possibilities to follow and predict patients’ progress in real time (Section 4). Integrating this biomedical big data into a joint probabilistic model could provide a comprehensive understanding of biological mechanisms, novel therapeutic targets, and informative predictive models. However, a greatly increased hypothesis space, and complex dependencies among variables pose complex, open challenges. The use of complex statistical models often hurts the interpretability of the results.  The development of new computational methods to address these challenges has become an integral part of biomedical research.
Prof. Lee (photographed by Dennis Wise)

Our research seeks to develop machine learning techniques to extract relevant information from large, heterogeneous biological data, in a diverse range of topics from basic science to disease biology to bedside applications (Sections 1 through 5).  We also identify limitations of existing machine learning methods and address the challenges by fundamentally advancing the conventional methods (Section 6), which makes it generalizable to different biological problems or other applications outside of biology, and do not rely on specific datasets. 

From Big Data to Precision Oncology

While targeting key drivers of tumor progression (e.g., BCR/ABL, HER2, and BRAFV600E) has had a major impact in oncology, most patients with advanced cancer continue to receive drugs that do not work in concert with their specific biology.  In fact, resistance to anticancer drugs is the leading cause of death among cancer patients. This is exemplified by acute myeloid leukemia (AML), a disease for which treatments and cure rates (in the range of 20%) have remained stagnant.  Effectively deploying an ever-expanding array of cancer therapeutics holds great promise for improving these rates but requires methods to identify how drugs will affect specific patients.  Cancers that appear pathologically similar often respond differently to the same drug regimens. 

We seek to build an AI system that takes available molecular information, “reasons” about the best possible treatment strategy, and explains its reasoning. This AI system will accurately predict the response to a large number of possible drugs and drug combinations by leveraging existing high-throughput molecular data from many other patients with the same disease to predict more accurately and to identify more precisely novel disease drivers and therapeutic targets.  To this end, we will develop or apply statistically reliable and theoretically well-founded ML algorithms to efficiently learn complex biological interactions from big data and to produce interpretable prediction results for each patient. Lee and Celik et al. (In Revision Nature Comm), Lee et al. (ASH/Blood 2014)Lawrence et al. (Cell Reports 2015).

To build the framework for this vision, we focus on acute myeloid leukemia (AML) primary patient samples and 160 anticancer drugs. The University of Washington (UW) and Fred Hutchinson Cancer Research Center (FHCRC) have a strong research and clinical partnership in AML research, and we have already initiated collaborations across nine of their labs to support this effort, including the labs of Dr. Pamela Becker and Dr. Tony Blau of UW Hematology and UW’s Center for Cancer Innovation, and Dr. Judit Villen of UW Genome Sciences. 

people: Safiye Celik, Scott LundbergBenjamin Logsdon, Vivian Oehler, Chris Miller, Judit VillenTony BlauPamela Becker
collaboration: UW Center for Cancer Innovation, Fred Hutchinson Cancer Research Center,  UW Medicine/Hematology
funding: American Cancer Society (PI: Lee), Pfizer (PI: Dr. Vivian Oehler)


Systems Biology of Human Diseases

Inferring a network from high-dimensional ‘omic’ data – such as transcriptomic, proteomic, and epigenomic data – has become a key analysis tool in computational biology.  A network inference algorithm takes a matrix of variables (e.g., genes) and samples (patients) as input and infers a network of variables based on their conditional dependencies across samples.  A critical limitation is that it often produces too many false positive network edges: the sample size in a single dataset does not provide enough statistical power to identify a true positive network out of many possible networks.  Simply appending datasets from multiple studies to increase the sample size is unlikely to be successful. Due to the heterogeneity of samples (e.g., different disease stages) and different sets of variables (e.g., different platforms, RNA-seq), it is impossible to integrate them into a single statistical model. 

To resolve this challenge, we are building a unique network inference framework to integrate datasets from: (1) heterogeneous samples, (2) different sets of variables, and (3) substantially different sample populations. We do so by jointly learning one or more networks represented by multiple datasets. Below, we briefly describe examples of new biological insights we gained from using our recently developed methods in each of these three categories:
  1. We developed general ML algorithms to identify perturbed variables (nodes) whose network connection with other nodes significantly differs across conditions from high-dimensional data. Our recently developed method, called DISCERN (DIfferential SparsE Regulatory Network), introduces a novel perturbation model, which enables us to incorporate all (thousands of) genes, which was previously impossible and is crucial in genomics applications. DISCERN takes genome-wide expression datasets as input (e.g., cancer vs. normal tissues) and estimates the extent to which each gene is perturbed. Our results provide new insights into a differential network in cancer: Genes identified by DISCERN reveal known driver mutations and are associated with patient prognosis in 3 different types of human cancers – acute myeloid leukemia (AML), breast cancer and lung cancer. DISCERN’s integrative analysis with epigenomic data from the ENCODE and Epigenomic Roadmap projects uncovered both known and previously unknown regulatory mechanisms underlying network perturbation.  See Grechkin et al. PLOS Computational Biology (2016).  
  2. We defined the INSPIRE (INferring Shared modules from multiPle gene expREssion datasets) method to extract a low-dimensional feature representation from multiple expression datasets from different platforms of microarrays or RNA-sequencing.  This method enables us to make use of all samples (patients) across datasets from different studies to robustly identify disease drivers. We demonstrate this by applying INSPIRE to 9 ovarian cancer datasets from 1,500 patients, which led to the identification of HOPX as a potential driver for tumor-associated stroma and a novel molecular basis for tumor resectability.  This means that HOPX can be a therapeutic target in neoadjuvant therapy for shrinking the tumor before surgery. This work was recently published in Celik et al. Genome Medicine (2016) and highlighted and featured in Science Translational Medicine
  3. We defined a new category of tumor drivers in cancer genome evolution: ‘selected expression regulators’ (SERs) – genes driving dysregulated transcriptional programs in cancer evolution. The SERs are identified from genome-wide tumor expression data with our novel ML method, namely SPARROW (SPARse selected expRessiOn regulators identified With penalized regression).  The identified SERs reveal driver mutations in multiple human cancers, known cancer-associated processes, and survival-associated genes better than popular network inference methods.  Our results indicate that SPARROW offers a powerful complementary approach to identifying genes containing cancer driver events that are difficult to detect from sequence data due to both a large number of passenger mutations and the lack of a sufficiently large sample size.  We identified a new molecular marker, PYCARD, for obatoclax in AML, which was published in Logsdon et al. Nucleic Acids Research (2015)
people: Safiye CelikBenjamin Logsdon, Stephanie Battle, Charles DrescherMara Rendi, David HawkinsAndrew Gentles.
funding: NSF ABI (PI: Lee), American Cancer Society (PI: Lee), STTR Transformative Research Grant (PI: Lee).

A machine learning approach to identify Alzheimer’s disease therapeutic targets
Currently, there is no known cure for Alzheimer’s disease (AD) and no treatment to reverse or halt its progression. Recently, the growing availability of expression data from AD patients holds great promise for identifying genes whose expression drives the disease.  A traditional approach is to perform a genome-wide association analysis to identify genes whose expression levels are predictive of each phenotype.  However, the high-dimensionality of data makes it difficult to avoid false positive associations, and the identified gene-phenotype associations alone do not provide systems-level insights into how the phenotype is modulated. With Dr. Paul Crane (Internal Medicine) and Dr. Dirk Keene (Neuropathology) at the UW AD Research Center, we are working on a unique method to reduce the data dimensionality into genes that represent important molecular events in AD progression inferred based on large amounts of existing data from AD patients. We focus on the ‘resilience’ to understand why AD neuropathology is found at autopsy in about 30% of cognitively normal older individuals.  Our identified mechanisms underlying protective mechanisms toward AD neuropathology can lead to new therapeutic targets.

collaboration: UW CSEUW Genome Sciences, UW Medicine, Group Health, Sage Bionetworks, Allen Institute for Brain Science
funding: NIH NIA (PI: Lee)


Understanding of Genome Regulation

Regulatory factors – such as transcription factors (TFs) and histone modifications – co-localize in the genome to interact with each other to regulate gene expression, physical structure of the genome, and many other cellular processes.  Identifying these interactions, which we refer to as the chromatin network, is crucial to the understanding of genome regulation. To infer the chromatin network, we can use ChIP-seq datasets to find the factors that co-localize. Co-localization may indicate that two factors interact directly by forming a complex or functionally by regulating similar DNA targets. However, identifying co-localization alone fails to distinguish direct interactions from indirect links. We instead focus on conditional dependence, which measures co-localization after accounting for information provided by other factors. If we infer a conditional dependence network, we remove indirect edges from the network. Since incorporating more ChIP-seq datasets can remove indirect edges, we propose to incorporate all 1,451 (current) available ENCODE ChIP-seq datasets.

Learning a conditional dependence network from large collections of ChIP-seq data involves two key challenges. 1) It is computationally very intensive. 2) Some ChIP-seq datasets are highly correlated with each other (e.g., TFs forming a complex; the same TF measured in different conditions). Standard methods often learn edges only among these highly correlated variables and weakly connect them to the rest of the network, obscuring dependence. Incorporating more ChIP-seq datasets would exacerbate this problem; however, arbitrarily removing or merging datasets can eliminate important information.

We are developing a series of new ML methods and algorithms with three goals: 1) to estimate the chromatin network based on all available ENCODE ChIP-seq data; 2) to  jointly infer context-specific chromatin networks and the associated genomic regions; and 3) to learn a conserved chromatin network across species (human, mouse, fly, worm) and predict factor interactions even when the factors are not measured in the species of study.

Recently, we introduced the group graphical model, which learns a conditional dependence network among groups of variables and individual variables to resolve challenges with highly correlated variables.  We applied the group graphical model to all available 1,451 ENCODE ChIP-seq datasets and showed that it can reveal previously known protein interactions better than alternative methods to infer the chromatin network, such as pairwise correlation, partial correlation, and inverse covariance matrix.  We showed that jointly learning a network across all cell types greatly increases the scope of possible interactions, which leads to a significantly higher fold enrichment for known protein interactions compared to learning cell type-specific networks separately. In collaboration with Dr. Linda Penn and Dr. Brian Raught at the University of Toronto, we experimentally validated the novel MYC-HCFC1 interaction7. Our network can be navigated using an interactive visualization tool developed by our lab:  This work was published in  .

people: Scott Ludberg, Nao Hiranuma, William Tu, Brian RaughtBill NobleLinda PennMichael Hoffman
collaboration: University of Toronto 
funding: NSF CAREER (PI: Lee)


Bringing Machine Learning to the Operating Room

About 50 million surgeries are performed in the U.S. every year, with an annual cost in the hundreds of billions of dollars.  Postoperative complications from surgery and anesthesia (e.g., wound infection, respiratory failure) collectively occur in up to 40% of patients. With Dr. Jerry Kim and Dr. Bala Nair of UW Anesthesiology, we developed the Prescience system, based on an ML technique, to predict whether patients would undergo respiratory crisis such as oxygen desaturation (SaO2<92) during surgery based on pre- and intra-operatively collected data.  

In a clinical application, understanding why a model made a certain prediction is crucial, because interpretable predictions engender appropriate trust and provide insight into what actions need to be taken.  However, accuracy, which is often achieved better by complex models than simple models (e.g., linear model), and interpretability of the prediction results are the two important goals that are often in opposition.  Prescience provides not only the chance of oxygen desaturation but also explanations, i.e., which features contributed to the prediction, which makes the prediction more interpretable and improves the actionability.  We used the novel machine learning approach we recently developed (Lundberg et al. (NIPS workshop 2016)).  

We use real-time features that are collected in a typical operating room setting. We also extracted fixed values (e.g., age, BMI, etc.) and word occurrences in physicians’ notes from pre-operative medical charts. We used gradient boosting trees and ~2,000,000 training samples to predict when patients would undergo desaturation up to 5 minutes prior to the event.  The Prescience system uses a new ML method to make the prediction more interpretable to help doctors make informed decisions.  We have shown that Prescience has substantially better prediction accuracy than 4 anesthesiologists who were given the exact same set of pre- and intra-operative data.  It achieved 35% increased area under the precision recall curve compared to the best performance of any doctor (Lundberg et al.).  

collaboration: UW Medicine/Anesthesiology, Harborview Medical Center
funding: eScience/ITHS Seed Grant (PIs: Kim, Lee)


Understanding Yeast Morphology Phenotypes

To survive in stressful environments, yeast selectively bind to each other and to substrates in cohesive groups called yeast biofilms. The ability of yeast to form biofilms affects myriad processes relevant to human lives. In brewing, yeast biofilms act as natural filters. However, they can complicate biofuel production and can help pathogenic yeast remain virulent on hospital surfaces. In the laboratory, yeast biofilms create significant challenges for classic experiments. Although biofilm forming yeasts exhibit incredible phenotypic diversity attributable to genetic variation, the current understanding remains incomplete. The Yeast Research Center (YRC) at the UW recently generated the data measuring over 14,000 transcript, protein, metabolite, morphological traits and biofilm phenotypes in 22 sequenced strains of S. cerevisiae (Scer). We collaborate with Prof. Dunham's lab to better understand the biofilm phenotypes by developing novel ML techniques to analyze these
collaboration: UW Genome Sciences
funding: NSF ABI (PI: Lee)

Advancing Machine Learning Techniques

In each of the problems described above, we try to identify limitations of existing ML methods and addressed them by fundamentally advancing existing methods.  This approach has the following advantages: 

First, this approach makes our methods generalizable to different biological problems in various diseases, and do not rely on specific problems or datasets.
More and more biomedical research problems involve high-dimensional data, large amount of data, and multiple heterogeneous datasets.  The methods we developed to address these challenges, such as new structured sparsity priors ( , efficient optimization method (Grechkin et al. (AAAI 2015)), a new network model with a learning algorithm (Hosseini et al. (NIPS 2016), Celik et al. (ICML 2014)Celik et al. Genome Medicine (2016) ), and a method to make prediction more interpretable (Lundberg et al. (NIPS workshop 2016)) are general ML techniques and can be applied to general problems even beyond biomedical applications.

Second, developing new ML methods allows us to address new biological questions and problems. For example, the new method for identifying conditional dependencies even when many variables are highly correlated (Section 3) allows us to identify a new type of interaction between regulatory factors on the genome.  The new ML method for identifying perturbed nodes (Section 2.1) allows us to address a novel biological question: “Which genes generate large network topology changes across the cancer progression?”. This can lead to a discovery of a new type of tumor driver.  Making a prediction outcome in each sample more interpretable (Section 4) opens a new way for doctors to help patients’ progress. 

Safiye Celik
Scott Ludberg
, Javad Hosseini, Benjamin Logsdon

, Daniela Witten, Maryam Fazel, Kean Ming Tan, Karthik Mohan, Palma London

collaboration: UW CSE, UW Genome Sciences, UW EE, UW Biostatistics, UW Statistics
funding: NSF ABI (PI: Lee)
, UW's Royalty Research Fund 
(PIs: Witten, Fazel, Lee)

Past Research