Projects‎ > ‎

aicontrol

AIControl: Replacing matched control experiments with machine learning improves ChIP-seq peak identification

Naozumi Hiranuma, Scott Lundberg, Su-In Lee


Abstract

Motivation: Identifying the binding sites of regulatory proteins has been one of the central problems in molecular biology.  The most commonly used experimental technique to determine binding  locations  of  transcription  factors  is  chromatin  immunoprecipitation  followed  by DNA sequencing (ChIP-seq).  Because ChIP-seq is highly susceptible to background noise, the current practice is to obtain one matched “control” ChIP-seq dataset and estimate position-wise background  distributions  using  ChIP-seq  signals  from  nearby  positions  (e.g.  within  10,000bps).  This introduces four problems:  1) Incorporating such a large window of nearby positions may result in inaccurate estimation of position-specific background distributions.  2) There are multiple ways to obtain control ChIP-seq datasets,  and current peak calling methods require users  to  use  only  one  control  dataset.  3)  One  matched  control  dataset  may  not  capture  all sources of noise signals. 4) Generating a matched control dataset incurs additional time and cost.

Methods: We introduce the AIControl framework that replaces matched control experiments by  automatically  learning  weighted  contributions  of  a  large  number  of publicly  available control  ChIP-seq  datasets to  generate  position-specific  background  noise  distributions.   This helps  us  avoid  the  cost  of  running  control  experiments  while  simultaneously  increasing accuracy.   Specifically,  AIControl  can  1)  obtain  a  precise  position-specific  background  distribution  (i.e.,  no  need  to  use  a  window),  2)  use  machine  learning  to  systematically  select  the most  appropriate  set  of  control  datasets  in  a  data-driven  way,  3)  capture  noise  sources  that may be missed by one matched control, and 4) remove the need of matched control experiments.

Results:
We applied AIControl to 410 ChIP-seq datasets from the ENCODE project whose transcription factors have motif information and that are from tier 1 or 2 cell lines.  AIControl used  445  control  ChIP-seq  datasets  across  107  cell  lines  and  9  laboratories  to  estimate background ChIP-seq noise signals.  The peaks identified by AIControl without using matched control  datasets  were  more  enriched  for  putative  binding  sites  than  the  peaks  identified  by other  popular  peak  callers  that  use  a  matched  control  dataset.Additionally,  AIControl significantly  reduced  the  reproducibility  between  two  ChIP-seq  datasets  whose  associated transcription  factors  have  no  documented  interactions,  which  suggests  that  AIControl  better removes confounding effects.  Finally, we demonstrated that AIControl improves the quality of downstream analysis, by showing that binding sites identified by AIControl recover documented protein interactions more accurately.

Conclusion:
AIControl  removes  the  need  to  generate  additional  matched  control  data  and provides  more  accurate  prediction  of  the  binding  events  even  when  testing  on  cell  lines  that do not exist in the ENCODE control datasets.



Implementation

Implementation of AIControl algorithm in Julia 0.5 can be found under https://github.com/suinleelab/AIControl. Accompanying evaluation pipeline (in Julia 0.5) and visualization code (in Python 2.7) are also found in the same repository in the form of Jupyter notebooks.  Anaconda Python Distribution covers all Python modules necessary for the Python code. For the Julia code, PureSeq.jl, DataFrames.jl, JLD.jl, and Distributions.jl are necessary.


Data
  • Binned Control dataset: AIControl uses pool of publicly available control datasets to identify peak locations. 455 control datasets are obtained from ENCODE database and they are preprocessed by binning in 100 bps windows (.fbin100 and .rbin100 files). User may use AIControl preprocessing code to create finer scaled signals.
  • Encode ID List: The list of ENCODE IDs of the IP and control datasets used in this project.
  • xtxs.zip: A zip file containing XtX matrices necessary for AIControl.


People

Naozumi Hiranuma
Paul G. Allen School of Computer Science and Engineering, University of Washington

Scott Lundberg
Paul G. Allen School of Computer Science and Engineering, University of Washington

Su-In Lee
Paul G. Allen School of Computer Science and Engineering, University of Washington
Department of Genome Sciences, School of Medicine, University of Washington
ċ
ENCODE_ids.csv
(10k)
Nao Hiranuma,
Feb 15, 2018, 6:46 PM
ċ
xtxs.zip
(2944k)
Nao Hiranuma,
Feb 15, 2018, 6:46 PM