Opening the Black Box of Machine Learning Models

Biomedical data is vastly increasing in quantity, scope, and generality, expanding opportunities to discover novel biological processes and clinically translatable outcomes. The ENCODE (Encyclopedia of DNA Elements) project is generating myriads of sequencing datasets that measure varied activities across the human genome in many different cell types. Growing numbers of disease studies are producing multiple types of high-throughput molecular and imaging data. Medical records, now routinely digitized, provide new possibilities to follow and predict patients’ progress in real time.

Machine learning (ML), a key technology in modern biology that addresses these changing dynamics, aims to infer meaningful interactions among variables by learning their statistical relationships from data consisting of measurements on variables across samples. Accurate inference of such interactions from big biological data would provide a tremendous opportunity to generate novel biological discoveries, therapeutic targets, and predictive models for patient outcomes. However, a greatly increased hypothesis space and complex dependencies among variables pose complex, open challenges. To meet these challenges, we developed innovative, rigorous, and principled ML techniques to infer reliable, accurate statistical relationships from data in various kinds of network inference problems, pushing the boundaries of both ML1–8 and biology.7–11

Fundamental limitations of current ML techniques leave many future opportunities to translate inferred statistical relationships into biological knowledge, as exemplified in a standard biomarker discovery problem. Assume that we seek meaningful links between gene expression levels and phenotypes of interest (Fig 1). The current approach attempts to find a set of features that best predict a phenotype.12–15 Unfortunately, false positive markers are very common, as evidenced by the low success rates of replication in independent data and of reaching clinical practice (far less than 1%). Current ML techniques have three challenging limitations:16–18

  1. The high dimensionality of data (i.e., number of features exceeds the number of samples), hidden variables, and feature correlations create a discrepancy between predictability (observed associations) and true biological interactions (Fig 1). Assume an unobserved disease subtype D is a true marker for a phenotype Y1. D is correlated with XA, XB, and XC in the training data only because the subtype is prevalent in a specific population. This yields false positive associations that often cannot be replicated (Fig 1C). Say XE, a disease driver, is a true marker for Y3 and regulates many other genes (XA, XD, and XB) that will be indirectly associated with Y3. We need new feature selection criteria to make the model better explain rather than simply predict the phenotype. Example paper: Lee and Celik et al. (Nature Communications, 2018)
  2. Due to disease heterogeneity, markers that work in one population may not work in another, which creates complex relationships between Xs and Y. Complex models (e.g., deep learning or ensemble models) more accurately describe relationships among gene Xs and a phenotype Y than simpler, linear models, but they lack interpretability and are considered to be black boxes (Fig 2). This is applied to a general prediction problem in the field of precision medicine, where you aim to predict a patient outcome Y based on the patient's features Xs. When a certain prediction was made on a particular patient, we want to know why, and a more accurate model is often not interpretable (Fig 2B). We need a way to make interpretable predictions from these models by estimating the importance of each feature on a particular prediction on a particular patient (Fig 2C). Example papers: Lundberg and Lee (NIPS 2017), Lundberg et al. (In Revision Nature BME)
  3. Fundamentally, marker identification is observational research that cannot prove causal relations. We need a systematic active learning approach to incorporating results of interventional experiments.

Figure 1. (A) The conventional way to identify molecular markers. (B) Discrepancy between true interactions and observed associations. (C) Validation of significant associations in independent data.

Figure 2. (A) Linear models are easy to interpret. (B) Complex models are often considered to be black boxes. (C) Individualized explanations for a particular prediction.

To address these problems, we develop machine learning techniques for learning interpretable models from data by:

  • (1) learning interpretable feature representation using prior knowledge,
  • (2) making interpretable predictions with explanations, and
  • (3) validating and refining predictions through interventional experiments.

The ML approach we will develop to address each challenge constitutes an independent ML framework that focuses on different aspects of model interpretability and therefore can be independently applied to various problems. We will demonstrate their effectiveness for a wide range of topics, from basic science to bedside applications.

Ongoing Projects

  • Enabling precision cancer medicine and drug development (with Hematology, and Center for Cancer Innovation)
  • Seeking cure for Alzheimer's disease (with Pathology, Neuropathology, and Internal Medicine)
  • Bringing ML to operating rooms (with Anesthesiology & Pain Medicine)
  • Enabling efficient pre-hospital prediction for trauma patients (with Emergency Medicine)
  • Predicting kidney diseases (with Kidney Research Institute)
  • Making medical examinations more efficient (with School of Dentistry, and Global Health)
  • Developing interpretable deep learning techniques for genomic, multi-omic, and protein structure data
  • Developing interpretable ML principles, techniques, and theories


1. Tan, K. M., London, P., Mohan, K., Lee, S.-I., Fazel, M. & Witten, D. Learning Graphical Models With Hubs. J. Mach. Learn. Res. 15, 3297–3331 (2014).

2. Mohan, K., Chung, M., Han, S., Witten, D., Lee, S.-I. & Fazel, M. in NIPS Neural Inf. Process. Syst. 25 629–637 (2012).

3. Mohan, K., London, P., Fazel, M., Witten, D. & Lee, S.-I. Node-based learning of multiple Gaussian graphical models. J. Mach. Learn. Res. 15, 445–488 (2014).

4. Grechkin, M., Fazel, M., Witten, D. & Lee, S.-I. L. Pathway Graphical Lasso. in AAAI Conf. Artif. Intell. (2015).

5. Celik, S., Logsdon, B. A. & Lee, S.-I. L. Efficient Dimensionality Reduction for High-Dimensional Network Estimation. in ICML Int. Conf. Mach. Learn. (2014).

6. Celik, S., Logsdon, B. A. & Lee, S.-I. Sparse Estimation of Module Gaussian Graphical Models with Applications to Cancer Systems Biology. in NIPS Work. ML Comput. Biol. Accept. rate 20.45% (2013).

7. Lundberg, S. M., Tu, W. B., Raught, B., Penn, L. Z., Hoffman, M. M. & Lee, S.-I. ChromNet: Learning the human chromatin network from all ENCODE ChIP-seq data. Genome Biol. 17, 82 (2016).

8. Celik, S., Logsdon, B. A., Battle, S., Drescher, C., Rendi, M., Hawkins, R. D. & Lee, S.-I. Extracting a low-dimensional description of multiple gene expression datasets reveals a potential driver for tumor-associated stroma in ovarian cancer. Genome Med. 8, 66 (2016).

9. Logsdon, B. A., Gentles, A. J., Miller, C. P., Blau, C. A., Becker, P. S. & Lee, S.-I. Sparse expression bases in cancer reveal tumor drivers. Nucleic Acids Res. gku1290- (2015). doi:10.1093/nar/gku1290

10. Lee, S.-I., Celik, S., Logsdon, B. A., Lundberg, S. M., Martins, T. J., Oehler, V., Estey, E. H., Miller, C. P., Chien, S., Saxena, A., Blau, C. anthony & Becker, P. S. An integrative framework for prioritizing candidate molecular markers reveals a novel driver for sensitivity to topoisomerase inhibitors. Nat. Commun. In Minor Revision

11. Grechkin, M., Logsdon, B. A., Gentles, A. J. & Lee, S.-I. Identifying Network Perturbation in Cancer. PLoS Comput. Biol. 12, e1004888 (2016).

12. Garnett, M. J. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570–5 (2012).

13. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–7 (2012).

14. Heiser, L. M., et al. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proc. Natl. Acad. Sci. U. S. A. 109, 2724–9 (2012).

15. Lawrence, R. T., Perez, E. M., Hernández, D., Miller, C. P., Haas, K. M., Irie, H. Y., Lee, S.-I., Blau, C. A. & Villén, J. The Proteomic Landscape of Triple-Negative Breast Cancer. Cell Rep. 11, 630–44 (2015).

16. Kern, S. E. Why your new cancer biomarker may never work: recurrent patterns and remarkable diversity in biomarker failures. Cancer Res. 72, 6097–101 (2012).

17. Hanash, S. M. Why have protein biomarkers not reached the clinic? Genome Med. 3, 66 (2011).

18. Ransohoff, D. F. Opinion: Bias as a threat to the validity of cancer molecular-marker research. Nat. Rev. Cancer 5, 142–149 (2005).