EMBARKER: A hierarchical Bayesian approach empowering big data with prior knowledge for expression marker discovery and its application to Alzheimer’s disease

Safiye Celik, Josh C. Russell, Cezar R. Pestana, Ting-I Lee, Shubhabrata Mukherjee,

Paul K. Crane, C. Dirk Keene, Jennifer F. Bobb, Matt Kaeberlein*, Su-In Lee*

* Corresponding authors (suinlee@cs.washington.edu and kaeber@uw.edu)

Demonstration of how EMBARKER works on a single dataset:

Demonstration of how EMBARKER works in a meta-analysis setting:

Identifying meaningful associations between gene expression levels and disease phenotypes is a fundamental task in biomedical research. However, the high-dimensionality of expression data (i.e., the number of genes being much larger than the number of samples) increases false positive findings that cannot be reproduced in other studies. Using samples from different studies focusing on the same disease can potentially reduce the dimensionality, but it is not obvious how to use them together because different studies are usually not “synchronized”. Moreover, the results most existing tools produce (i.e., -values for tens of thousands of genes) are not biologically interpretable. To resolve these challenges, we propose a general computational framework, EMBARKER, which introduces two computational innovations: (1) incorporating gene-pathway membership information to alleviate the high-dimensionality of data and improve interpretability of results, (2) an intuitive way of applying this approach to multiple datasets to increase statistical power while accounting for the heterogeneity across the datasets. We compare EMBARKER to 15 state-of-the-art approaches using a total of 43 genome-wide gene expression datasets and 55 disease phenotypes in a wide variety of association problems ranging from cancer to Alzheimer’s disease. We demonstrate that EMBARKER leads to a dramatical improvement in statistical robustness of the identified expression markers. Most notably, we apply EMBARKER to 1,742 human brain tissue samples from 9 brain regions from three studies, which is, to our knowledge, the largest expression meta-analysis for AD. This application leads to a successful in vivo validation of identified markers of Aβ toxicity tolerance in a transgenic Caenorhabditis elegans model expressing AD-associated Aβ. This finding suggests mitochondrial Complex I as a critical mediator of proteostasis and a promising pharmacological avenue toward treating AD.

Application of EMBARKER to 1,742 brain tissue samples from 9 brain regions from three AD studies:

An R implementation of the EMBARKER algorithm can be found here:


An example run with randomly generated expression and phenotype data: ExampleRun.R