AIControl: Replacing matched control experiments with machine learning improves ChIP-seq peak identification

Naozumi Hiranuma, Scott Lundberg, and Su-In Lee

Project description

Motivation: Identifying the binding sites of regulatory proteins has been one of the central problems in molecular biology. The most commonly used experimental technique to determine binding locations of transcription factors is chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq). Because ChIP-seq is highly susceptible to background signals, the current practice is to obtain one matched “control” ChIP-seq dataset and estimate position-wise background distributions using ChIP-seq signals from nearby positions (e.g. within 10,000bps). This introduces four problems: 1) Incorporating such a large window of nearby positions may result in inaccurate estimation of position-specific distributions of backgrond signals. 2) There are multiple ways to obtain control ChIP-seq datasets, and current peak calling methods require users to use only one control dataset. 3) One matched control dataset may not capture all sources of biases. 4) Generating a matched control dataset incurs additional time and cost.

Methods: We introduce the AIControl framework that replaces matched control experiments by automatically learning weighted contributions of a large number of publicly available control ChIP-seq datasets to generate position-specific distributions of background signals. This helps us avoid the cost of running control experiments while simultaneously increasing accuracy. Specifically, AIControl can 1) obtain a precise position-specific background distribution (i.e., no need to use a window), 2) use machine learning to systematically select the most appropriate set of control datasets in a data-driven way, 3) capture noise sources that may be missed by one matched control, and 4) remove the need of matched control experiments.

Results: We applied AIControl to 410 ChIP-seq datasets from the ENCODE project whose transcription factors have motif information and that are from tier 1 or 2 cell type. AIControl used 440 control ChIP-seq datasets across 107 cell types and 9 laboratories to estimate background ChIP-seq signals. The peaks identified by AIControl without using matched control datasets were more enriched for putative binding sites than the peaks identified by other popular peak callers that use a matched control dataset. Additionally, we demonstrated that AIControl improves the quality of downstream analysis, by showing that binding sites identified by AIControl recover documented protein interactions more accurately.

Conclusion: AIControl removes the need to generate additional matched control data and provides more accurate prediction of the binding events even when testing on cell lines that do not exist in the ENCODE control datasets.


Implementation of AIControl algorithm in Julia 0.7 can be found under (Academic Use Only).


All accompanying data files can be found under our Google Drive folder. The folder includes the following files.

  • Binned control files (e.g. forward.data100.nodup.tar.bz2): AIControl uses a pool of control datasets from the ENCODE website. We have done our best to compress and represent them in a compact format.

  • Peak files used for our evaluation including other peak callers

  • Bowtie2 reference files for remapping to the UCSC hg38 genome

  • Supplementary Data 1~5 for our paper


Naozumi Hiranuma, Scott Lundberg, and Su-In Lee

Paul G. Allen School of Computer Science and Engineering, University of Washington