AIControl

AIControl: Replacing matched control experiments with machine learning improves ChIP-seq peak identification

Naozumi Hiranuma, Scott Lundberg, Su-In Lee

Abstract

Motivation: Identifying the binding sites of regulatory proteins has been one of the central problems in molecular biology. The most commonly used experimental technique to determine binding locations of transcription factors is chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq). Because ChIP-seq is highly susceptible to background noise, the current practice is to obtain one matched “control” ChIP-seq dataset and estimate position-wise background distributions using ChIP-seq signals from nearby positions (e.g. within 10,000bps). This introduces four problems: 1) Incorporating such a large window of nearby positions may result in inaccurate estimation of position-specific background distributions. 2) There are multiple ways to obtain control ChIP-seq datasets, and current peak calling methods require users to use only one control dataset. 3) One matched control dataset may not capture all sources of noise signals. 4) Generating a matched control dataset incurs additional time and cost.

Methods: We introduce the AIControl framework that replaces matched control experiments by automatically learning weighted contributions of a large number of publicly available control ChIP-seq datasets to generate position-specific background noise distributions. This helps us avoid the cost of running control experiments while simultaneously increasing accuracy. Specifically, AIControl can 1) obtain a precise position-specific background distribution (i.e., no need to use a window), 2) use machine learning to systematically select the most appropriate set of control datasets in a data-driven way, 3) capture noise sources that may be missed by one matched control, and 4) remove the need of matched control experiments.

Results: We applied AIControl to 410 ChIP-seq datasets from the ENCODE project whose transcription factors have motif information and that are from tier 1 or 2 cell lines. AIControl used 445 control ChIP-seq datasets across 107 cell lines and 9 laboratories to estimate background ChIP-seq noise signals. The peaks identified by AIControl without using matched control datasets were more enriched for putative binding sites than the peaks identified by other popular peak callers that use a matched control dataset.Additionally, AIControl significantly reduced the reproducibility between two ChIP-seq datasets whose associated transcription factors have no documented interactions, which suggests that AIControl better removes confounding effects. Finally, we demonstrated that AIControl improves the quality of downstream analysis, by showing that binding sites identified by AIControl recover documented protein interactions more accurately.

Conclusion: AIControl removes the need to generate additional matched control data and provides more accurate prediction of the binding events even when testing on cell lines that do not exist in the ENCODE control datasets.


Implementation

Implementation of AIControl algorithm in Julia 0.5 can be found under https://github.com/suinleelab/AIControl (Academic Use Only). Accompanying evaluation pipeline (in Julia 0.5) and visualization code (in Python 2.7) are also found in the same repository in the form of Jupyter notebooks. Anaconda Python Distribution covers all Python modules necessary for the Python code. For the Julia code, PureSeq.jl, DataFrames.jl, JLD.jl, and Distributions.jl are necessary.


Data

  • Binned Control dataset: AIControl uses pool of publicly available control datasets to identify peak locations. 455 control datasets are obtained from ENCODE database and they are preprocessed by binning in 100 bps windows (.fbin100 and .rbin100 files). User may use AIControl preprocessing code to create finer scaled signals.
  • Encode ID List: The list of ENCODE IDs of the IP and control datasets used in this project.
  • xtxs.zip: A zip file containing XtX matrices necessary for AIControl.


People

Naozumi Hiranuma

Paul G. Allen School of Computer Science and Engineering, University of Washington


Scott Lundberg

Paul G. Allen School of Computer Science and Engineering, University of Washington


Su-In Lee

Paul G. Allen School of Computer Science and Engineering, University of Washington

Department of Genome Sciences, University of Washington