CITRUS, Interpretable deep learning for chromatin-informed inference of transcriptional programs driven by somatic alterations across cancers

Yifeng Tao*, Xiaojun Ma*, Drake Palmer, Russell Schwartz, Xinghua Lu, Hatice Ulku Osmanbeyoglu

University of Pittsburgh, Carnegie Mellon University, UPMC Hillman Cancer Center
https://academic.oup.com/nar/article/50/19/10869/6761738


ABSTRACT

Cancer is a disease of gene dysregulation, where cells acquire somatic and epigenetic alterations that drive aberrant cellular signaling. These alterations adversely impact transcriptional programs and cause profound changes in gene expression. Interpreting somatic alterations within context-specific transcriptional programs will facilitate personalized therapeutic decisions but is a monumental task. Toward this goal, we develop a partially interpretable neural network model called Chromatin-informed Inference of Transcriptional Regulators Using Self-attention mechanism (CITRUS). CITRUS models the impact of somatic alterations on transcription factors and downstream transcriptional programs. Our approach employs a self-attention mechanism to model the contextual impact of somatic alterations. Furthermore, CITRUS uses a layer of hidden nodes to explicitly represent the state of transcription factors (TFs) to learn the relationships between TFs and their target genes based on TF binding motifs in the open chromatin regions of tumor samples. We apply CITRUS to genomic, transcriptomic, and epigenomic data from 17 cancer types profiled by The Cancer Genome Atlas. CITRUS predicts patient-specific TF activities and reveals transcriptional program variations between and within tumor types. We show that CITRUS yields biological insights into delineating TFs associated with somatic alterations in individual tumors. Thus, CITRUS is a promising tool for precision oncology.


CODE

Code and documentation can be found in GitHub link github.com/osmanbeyoglulab/CITRUS


Input data

dataset_CITRUS.pkl : CITRUS input. All the data for training CITRUS model has been packaged into a pickle file.


Outputs

output_dataset_CITRUS.pkl : Outputs of CITRUS for a single run. The same as input data, we packaged and saved all the training outputs in a pickle file.

TF_activity_ensemble_10.csv : The ensembled TF activities of all samples for 10 runs. This is a tumor by TF matrix (5803 X 321) with tumor barcode as row names and TF names as column names except for the last column which is the cancer type.

TF_activity_ensemble_10_noHoldout.csv : The ensembled TF activities of samples excluding holdout samples for 10 runs. This is a tumor by TF matrix (4642 X 321) with tumor barcode as row names and TF names as column names except for the last column which is the cancer type.