Introduction and References
Install and run

Quick examples
Options

Custom Annotations
cepip Perl version
FAQ

cepip: Context-dependent epigenomic weighting for regulatory variant prioritization

Introduction

Majority of trait/disease associated variants identified by genome wide association studies (GWASs) locate in the regulatory regions. Since gene regulation is highly context-specific, it remains challenging to fine-map and prioritize functional regulatory variants in a particular cell/tissue type and apply them to disease-associated genes detection. By connecting large-scale epigenome profiles to expression quantitative trait loci (eQTLs) in a wide range of human tissues/cell types, we identify combination of several critical chromatin features that predict variant regulatory potential. We develop a joint likelihood framework to measure the regulatory probability of genetic variants in a context-dependent manner. We show our model is superior to existing cell type-specific methods and exhibit significant GWAS signal enrichment. Using phenotypically relevant epigenomes to weight GWAS SNPs, we discover more disease-associated genes owing to regulatory changes and improve the statistical power in gene-based association test.

References

1. Mulin Jun Li, Miaoxin Li, Zipeng Liu, Yan Bin, Zhicheng Pan, Dingge Ying, Jean-Pierre A. Kocher, Zhengyuan Xia, Pak Chung Sham, Jun S. Liu, Junwen Wang. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes Genome Biology (2017) 18:52

Install and run

System requirement

Java Runtime Environment (JRE) version 6.0 or above is required for cepip. It can be downloaded from the Java web site. Installing the JRE is very easy in Windows OS and Mac OS X.

In Linux, you have more work to do. Details of the installation can be found http://www.java.com/en/download/help/linux_install.xml.

In Ubuntu, if you have an error message like: "Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException ... " , please install the Sun Java Running Environment (JRE) first.

To install the Sun JRE on Ubuntu(10.04), please use the following commands:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts
Detailed explanation of above commands can be found at http://www.ubuntugeek.com/how-install-sun-java-runtime-environment-jre-in-ubuntu-10-04-lucid-lynx.html.

For Mac OS, the JRE 1.6 has been available at http://developer.apple.com/java/download/ since April 2008. Mac OS users may need update the Java application to run cepip. A potential problem is that this update does not replace the existing installation of J2SE 5.0 or change the default version of Java. Similar to the Linux OS, the Java_Home environmental variable has to be configured to initiate cepip.

Download: latest Java version (Fast, now integrate into KGGSeq) | latest Perl version (Slow but no need to download big annotation, <10k variants)

20170318 update: We have integrated our JAVA version cepip into KGGSeq package, now it supports whole genome scoring in an efficient manner (require mannually download dbNCFP wohle genome annotation file). Please refer to the manual for package configuration and running parameters (--regulatory-causing-predict and --cell). The Perl version of cepip is still supported for easy execution of small list of SNVs (< 10K) and reseacher who do not want to download big program-dependent annotation (Please refer to Perl version manual).

20161213 update: We introduce a Perl version of cepip for remote random access of allele-specific composite score annotations and cell type-specific epigenomic signal annotations. Without downloading the huge annotation file, the user can quickly execute cepip (require Tabix). However, the running speed could be slow using this Perl version. Please run it under small input. For the first-time user, please download the Perl package from cepip_PL_v0.12.zip (Mac OS X / Linux).

20161024 update: We allow users to custom their epigenomic annotations. Please download and replace previous jar file from cepip.jar. For the first-time user, please download the full package from cepip.zip. We also provide a demo of custom annotations custom_annotations.zip, please refer to the instrcution in this site for how to use custom annoations.

Installing cepip

Simply decompress the archive and run the following command.

java -Xms256m -Xmx1300m -jar ./cepip.jar <arguments>

The arguments -Xms256m and -Xmx1300m set the initial and maximum Java heap sizes for cepip as 256 megabytes and 1.3 gigabytes respectively. Specifying a larger maximum heap size can speed up the analysis. A higher setting like -Xmx2g or even-Xmx5g is required when there is a large number of variants, say 5 million. The number, however, should be less than the size of physical memory of a machine.

Note <arguments >can be saved in a flat text file.

Quick example

Variants file is needed (example.vcf). This example use VCF variant format without genotype, and other formats are also supported!

Note All files were included in the examples folder of cepip.

Run the command below:

java -Xms256m -Xmx1300m -jar cepip.jar examples/params.example.txt

We now walk through the parameter file "params.composite.example.txt" before going into the results. Lines starting with hash sign # are comments. Detailed interpretation for each argument in the parameter file is included in 'Options' part.

#one argument per line
#I.Environmental setting
#--no-resource-check \

#II. Specify the input files
--vcf-file examples/example.vcf \

#III. Output setting
--out example_result \

#IV. Filtering and Prioritization
--db-score dbncfp \
--regulatory-causing-predict all \
--cell GM12878 \

Part (I): Specify general environmental setting

Arguments in this part are used to set general cepip running environment, including resource path and program update

Part (II): Specify the input files

Arguments in this part are used to specify the input files which support various data format, and they are compulsory for running cepip.

Part (III): Output setting

Arguments in this part are used to set the output file name and type, which can be produced simultaneously by cepip.

Part (IV): Prioritization

Arguments in this part are used to apply a set of models (selected regulatory variant scores, cell type definition, etc.) to enable better prioritization of the regulatory.

Notes Most of the above arguments are optional, so user can mask some lines by # or delete the lines. Under this circumstance, user can have a systematic view of the impact for each level or even steps. And it will be easier to produce this parameter file by cepip command generator we provide.

Input formats

Variants file

NOTE Currently, cepip compiled an integrative database for eight latest algorithms in the noncoding variants functional prediction field, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. To make a universal format for all collected algorithms, cepip now only support all 1000 Genomes Project phase 1 biallelic variants.

VCF format WITHOUT sample genotypes
cepip can also accept VCF data without genotype information. But there are less usable functions for this type data.
java -jar cepip.jar --vcf-file path/to/file1

In addition, cepip also support other popular variants and/or genotypes formats which are popular in next generation studies.

ANNOVAR format java -jar cepip.jar --annovar-file path/to/file NOTE cepip can recognize an extended ANNOVAR format in which a head row and and multiple columns for comments are allowed.

Example:
chr startpos endpos ref alt comment1 comment2 comment3
1 69428 69428 T G T 92 129
1 69476 69476 T C T 1 0

Outputs

cepip can flexibly output different formats of the prioritization and annotation results for either final validation or further analysis by third-party tools.

Output file path and file name prefix: --outjava -jar cepip.jar --vcf-file path/to/file1 --out path/to/prefixname Specify path and prefix name of outputs. It is "./cepip" by default.

Odinary outputs

Text format: This is by defaultjava -jar cepip.jar --vcf-file path/to/file1 By default, cepip output results in TEXT format in a file named cepip.flt.txt, in which the fields (or columns) are delimited by the tabs.

Produce SeattleSeq input: --o-seattleseq java -jar cepip.jar --vcf-file path/to/file1 --o-seattleseq Generate an extra copy of output for the prioritized variants in SeattleSeq input format, which can be further annotated by SeattleSeq.

Produce ANNOVAR input: --o-annovar java -jar cepip.jar --vcf-file path/to/file1 --o-annovar Generate an extra copy of output for the prioritized variants in ANNOVAR input format, which can be further annotated by ANNOVAR.

Produce VCF input: --o-vcf java -jar cepip.jar --vcf-file path/to/file1 --o-vcf Generate an extra copy of output for the prioritized variants in VCF format, which can be further analyzed by other tools.

Prediction and Prioritization

One attractive feature of cepip is that it combines our previous composite model (Mulin Jun Li et al. 2016. bioinformatics. PRVCS) for functional impact scores from multiple algorithms (i.e., CADD and GWAVA) to predict whether variants are regulatory or not. In addition, cepip summarized a set of consolidated chromatin features using cell type-specific chromatin marks for uniformly processed expression quantitative trait loci (eQTLs) dataset, and can measure the probability of regulatory causality for given variants in selected condition.

Predicting noncoding regulatory variants based on specific condition

Notecepip also support context-dependent prioritization using cell type-specific epigenomic signature.

Predict regulatory variants : --db-score dbncfp --regulatory-causing-predict all --cell
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell GM12878 Assigning a cell type, cepip uses a logit model to measure the probability of regulatory causality for given variants in selected condition. It finally combine the above composite model and context-dependent model into an unified model to estimate the posterior probability of regulatory potential.

cepip can support 16 ENCODE cell types as follows:

Coding	Tissue	Description
A549	Epithelium	epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), "This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male." - ATCC, newly promoted to tier 2: not in 2011 analysis.
CD14	Monocytes	Monocytes-CD14+ are CD14-positive cells from human leukapheresis production, from donor RO 01746 (draw 1 ID is RO 01746, draw 2 ID is RO 01826), newly promoted to tier 2: not in 2011 analysis.
GM12878	Blood	B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus.
H1-hESC	Embryonic stem cell	embryonic stem cells.
HMEC	Breast	mammary epithelial cells.
HSMM	Muscle	skeletal muscle myoblasts.
HSMMT	Muscle	HSMM cell derived skeletal muscle myotubes cell line.
HUVEC	Vessel	umbilical vein endothelial cells.
HeLa-S3	Cervix	cervical carcinoma.
HepG2	Liver	hepatocellular carcinoma.
IMR90	Lung	fetal lung fibroblasts, newly promoted to tier 2: not in 2011 analysis.
K562	Blood	leukemia, "The continuous cell line K-562 was established by Lozzio and Lozzio from the pleural effusion of a 53-year-old female with chronic myelogenous leukemia in terminal blast crises." - ATCC.
NH-A	Brain	astrocytes (also called Astrocy).
NHDF	Skin	dermal fibroblasts from temple / breast
NHEK	Skin	epidermal keratinocytes.
NHLF	Lung	lung fibroblasts.

NoteBy default, cepip will use GM12878 for context-dependent prioritization.

cepip can also support 127 RoadMap human reference epigenomes:

Note Epigenomes of some specific marks are missing in some tissues/cells, which may affect the prediction. We will use imputed epigenomes to handle this missing data problem soon! java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell E116

Coding	Lineage Group	Epigenome Mnemonic	Epigenome Name	Anatomy	Type
E001	ESC	ESC.I3	ES-I3 Cell Line	ESC	CellLine
E002	ESC	ESC.WA7	ES-WA7 Cell Line	ESC	CellLine
E003	ESC	ESC.H1	H1 Cell Line	ESC	CellLine
E004	ES-deriv	ESDR.H1.BMP4.MESO	H1 BMP4 Derived Mesendoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E005	ES-deriv	ESDR.H1.BMP4.TROP	H1 BMP4 Derived Trophoblast Cultured Cells	ESC_DERIVED	CellLineDerived
E006	ES-deriv	ESDR.H1.MSC	H1 Derived Mesenchymal Stem Cells	ESC_DERIVED	CellLineDerived
E007	ES-deriv	ESDR.H1.NEUR.PROG	H1 Derived Neuronal Progenitor Cultured Cells	ESC_DERIVED	CellLineDerived
E008	ESC	ESC.H9	H9 Cell Line	ESC	CellLine
E009	ES-deriv	ESDR.H9.NEUR.PROG	H9 Derived Neuronal Progenitor Cultured Cells	ESC_DERIVED	CellLineDerived
E010	ES-deriv	ESDR.H9.NEUR	H9 Derived Neuron Cultured Cells	ESC_DERIVED	CellLineDerived
E011	ES-deriv	ESDR.CD184.ENDO	hESC Derived CD184+ Endoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E012	ES-deriv	ESDR.CD56.ECTO	hESC Derived CD56+ Ectoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E013	ES-deriv	ESDR.CD56.MESO	hESC Derived CD56+ Mesoderm Cultured Cells	ESC_DERIVED	CellLineDerived
E014	ESC	ESC.HUES48	HUES48 Cell Line	ESC	CellLine
E015	ESC	ESC.HUES6	HUES6 Cell Line	ESC	CellLine
E016	ESC	ESC.HUES64	HUES64 Cell Line	ESC	CellLine
E017	IMR90	LNG.IMR90	IMR90 fetal lung fibroblasts Cell Line	LUNG	CellLine
E018	iPSC	IPSC.15b	iPS-15b Cell Line	IPSC	CellLine
E019	iPSC	IPSC.18	iPS-18 Cell Line	IPSC	CellLine
E020	iPSC	IPSC.20B	iPS-20b Cell Line	IPSC	CellLine
E021	iPSC	IPSC.DF.6.9	iPS DF 6.9 Cell Line	IPSC	CellLine
E022	iPSC	IPSC.DF.19.11	iPS DF 19.11 Cell Line	IPSC	CellLine
E023	Mesench	FAT.MSC.DR.ADIP	Mesenchymal Stem Cell Derived Adipocyte Cultured Cells	FAT	CellLineDerived
E024	ESC	ESC.4STAR	Cell Line	ESC	CellLine
E025	Mesench	FAT.ADIP.DR.MSC	Adipose Derived Mesenchymal Stem Cell Cultured Cells	FAT	PrimaryCell
E026	Mesench	STRM.MRW.MSC	Bone Marrow Derived Cultured Mesenchymal Stem Cells	NECTIVE	PrimaryCell
E027	Epithelial	BRST.MYO	Breast Myoepithelial Primary Cells	BREAST	PrimaryCell
E028	Epithelial	BRST.HMEC.35	Breast variant Human Mammary Epithelial Cells (vHMEC)	BREAST	PrimaryCell
E029	HSC & B-cell	BLD.CD14.PC	Primary monocytes from peripheral blood	BLOOD	PrimaryCell
E030	HSC & B-cell	BLD.CD15.PC	Primary neutrophils from peripheral blood	BLOOD	PrimaryCell
E031	HSC & B-cell	BLD.CD19.CPC	Primary B cells from cord blood	BLOOD	PrimaryCell
E032	HSC & B-cell	BLD.CD19.PPC	Primary B cells from peripheral blood	BLOOD	PrimaryCell
E033	Blood & T-cell	BLD.CD3.CPC	Primary T cells from cord blood	BLOOD	PrimaryCell
E034	Blood & T-cell	BLD.CD3.PPC	Primary T cells from peripheral blood	BLOOD	PrimaryCell
E035	HSC & B-cell	BLD.CD34.PC	Primary hematopoietic stem cells	BLOOD	PrimaryCell
E036	HSC & B-cell	BLD.CD34.CC	Primary hematopoietic stem cells short term culture	BLOOD	PrimaryCell
E037	Blood & T-cell	BLD.CD4.MPC	Primary T helper memory cells from peripheral blood 2	BLOOD	PrimaryCell
E038	Blood & T-cell	BLD.CD4.NPC	Primary T helper naive cells from peripheral blood	BLOOD	PrimaryCell
E039	Blood & T-cell	BLD.CD4.CD25M.CD45RA.NPC	Primary T helper naive cells from peripheral blood	BLOOD	PrimaryCell
E040	Blood & T-cell	BLD.CD4.CD25M.CD45RO.MPC	Primary T helper memory cells from peripheral blood 1	BLOOD	PrimaryCell
E041	Blood & T-cell	BLD.CD4.CD25M.IL17M.PL.TPC	Primary T helper cells PMA-I stimulated	BLOOD	PrimaryCell
E042	Blood & T-cell	BLD.CD4.CD25M.IL17P.PL.TPC	Primary T helper 17 cells PMA-I stimulated	BLOOD	PrimaryCell
E043	Blood & T-cell	BLD.CD4.CD25M.TPC	Primary T helper cells from peripheral blood	BLOOD	PrimaryCell
E044	Blood & T-cell	BLD.CD4.CD25.CD127M.TREGPC	Primary T regulatory cells from peripheral blood	BLOOD	PrimaryCell
E045	Blood & T-cell	BLD.CD4.CD25I.CD127.TMEMPC	Primary T cells effector/memory enriched from peripheral blood	BLOOD	PrimaryCell
E046	HSC & B-cell	BLD.CD56.PC	Primary Natural Killer cells from peripheral blood	BLOOD	PrimaryCell
E047	Blood & T-cell	BLD.CD8.NPC	Primary T killer naive cells from peripheral blood	BLOOD	PrimaryCell
E048	Blood & T-cell	BLD.CD8.MPC	Primary T killer memory cells from peripheral blood	BLOOD	PrimaryCell
E049	Mesench	STRM.CHON.MRW.DR.MSC	Mesenchymal Stem Cell Derived Chondrocyte Cultured Cells	NECTIVE	PrimaryCell
E050	HSC & B-cell	BLD.MOB.CD34.PC.F	Primary hematopoietic stem cells G-CSF-mobilized Female	BLOOD	PrimaryCell
E051	HSC & B-cell	BLD.MOB.CD34.PC.M	Primary hematopoietic stem cells G-CSF-mobilized Male	BLOOD	PrimaryCell
E052	Myosat	MUS.SAT	Muscle Satellite Cultured Cells	MUSCLE	PrimaryCell
E053	Neurosph	BRN.CRTX.DR.NRSPHR	Cortex derived primary cultured neurospheres	BRAIN	PrimaryCell
E054	Neurosph	BRN.GANGEM.DR.NRSPHR	Ganglion Eminence derived primary cultured neurospheres	BRAIN	PrimaryCell
E055	Epithelial	SKIN.PEN.FRSK.FIB.01	Foreskin Fibroblast Primary Cells skin01	SKIN	PrimaryCell
E056	Epithelial	SKIN.PEN.FRSK.FIB.02	Foreskin Fibroblast Primary Cells skin02	SKIN	PrimaryCell
E057	Epithelial	SKIN.PEN.FRSK.KER.02	Foreskin Keratinocyte Primary Cells skin02	SKIN	PrimaryCell
E058	Epithelial	SKIN.PEN.FRSK.KER.03	Foreskin Keratinocyte Primary Cells skin03	SKIN	PrimaryCell
E059	Epithelial	SKIN.PEN.FRSK.MEL.01	Foreskin Melanocyte Primary Cells skin01	SKIN	PrimaryCell
E061	Epithelial	SKIN.PEN.FRSK.MEL.03	Foreskin Melanocyte Primary Cells skin03	SKIN	PrimaryCell
E062	Blood & T-cell	BLD.PER.MONUC.PC	Primary mononuclear cells from peripheral blood	BLOOD	PrimaryCell
E063	Adipose	FAT.ADIP.NUC	Adipose Nuclei	FAT	PrimaryTissue
E065	Heart	VAS.AOR	Aorta	VASCULAR	PrimaryTissue
E066	Other	LIV.ADLT	Liver	LIVER	PrimaryTissue
E067	Brain	BRN.ANG.GYR	Brain Angular Gyrus	BRAIN	PrimaryTissue
E068	Brain	BRN.ANT.CAUD	Brain Anterior Caudate	BRAIN	PrimaryTissue
E069	Brain	BRN.CING.GYR	Brain Cingulate Gyrus	BRAIN	PrimaryTissue
E070	Brain	BRN.GRM.MTRX	Brain Germinal Matrix	BRAIN	PrimaryTissue
E071	Brain	BRN.HIPP.MID	Brain Hippocampus Middle	BRAIN	PrimaryTissue
E072	Brain	BRN.INF.TMP	Brain Inferior Temporal Lobe	BRAIN	PrimaryTissue
E073	Brain	BRN.DL.PRFRNTL.CRTX	Brain Dorsolateral Prefrontal Cortex	BRAIN	PrimaryTissue
E074	Brain	BRN.SUB.NIG	Brain Substantia Nigra	BRAIN	PrimaryTissue
E075	Digestive	GI.CLN.MUC	Colonic Mucosa	GI_COLON	PrimaryTissue
E076	Sm. Muscle	GI.CLN.SM.MUS	Colon Smooth Muscle	GI_COLON	PrimaryTissue
E077	Digestive	GI.DUO.MUC	Duodenum Mucosa	GI_DUODENUM	PrimaryTissue
E078	Sm. Muscle	GI.DUO.SM.MUS	Duodenum Smooth Muscle	GI_DUODENUM	PrimaryTissue
E079	Digestive	GI.ESO	Esophagus	S	PrimaryTissue
E080	Other	ADRL.GLND.FET	Fetal Adrenal Gland	ADRENAL	PrimaryTissue
E081	Brain	BRN.FET.M	Fetal Brain Male	BRAIN	PrimaryTissue
E082	Brain	BRN.FET.F	Fetal Brain Female	BRAIN	PrimaryTissue
E083	Heart	HRT.FET	Fetal Heart	HEART	PrimaryTissue
E084	Digestive	GI.L.INT.FET	Fetal Intestine Large	GI_INTESTINE	PrimaryTissue
E085	Digestive	GI.S.INT.FET	Fetal Intestine Small	GI_INTESTINE	PrimaryTissue
E086	Other	KID.FET	Fetal Kidney	KIDNEY	PrimaryTissue
E087	Other	PANC.ISLT	Pancreatic Islets	PANCREAS	PrimaryTissue
E088	Other	LNG.FET	Fetal Lung	LUNG	PrimaryTissue
E089	Muscle	MUS.TRNK.FET	Fetal Muscle Trunk	MUSCLE	PrimaryTissue
E090	Muscle	MUS.LEG.FET	Fetal Muscle Leg	MUSCLE_LEG	PrimaryTissue
E091	Other	PLCNT.FET	Placenta	PLACENTA	PrimaryTissue
E092	Digestive	GI.STMC.FET	Fetal Stomach	GI_STOMACH	PrimaryTissue
E093	Thymus	THYM.FET	Fetal Thymus	THYMUS	PrimaryTissue
E094	Digestive	GI.STMC.GAST	Gastric	GI_STOMACH	PrimaryTissue
E095	Heart	HRT.VENT.L	Left Ventricle	HEART	PrimaryTissue
E096	Other	LNG	Lung	LUNG	PrimaryTissue
E097	Other	OVRY	Ovary	OVARY	PrimaryTissue
E098	Other	PANC	Pancreas	PANCREAS	PrimaryTissue
E099	Other	PLCNT.AMN	Placenta Amnion	PLACENTA	PrimaryTissue
E100	Muscle	MUS.PSOAS	Psoas Muscle	MUSCLE	PrimaryTissue
E101	Digestive	GI.RECT.MUC.29	Rectal Mucosa Donor 29	GI_RECTUM	PrimaryTissue
E102	Digestive	GI.RECT.MUC.31	Rectal Mucosa Donor 31	GI_RECTUM	PrimaryTissue
E103	Sm. Muscle	GI.RECT.SM.MUS	Rectal Smooth Muscle	GI_RECTUM	PrimaryTissue
E104	Heart	HRT.ATR.R	Right Atrium	HEART	PrimaryTissue
E105	Heart	HRT.VNT.R	Right Ventricle	HEART	PrimaryTissue
E106	Digestive	GI.CLN.SIG	Sigmoid Colon	GI_COLON	PrimaryTissue
E107	Muscle	MUS.SKLT.M	Skeletal Muscle Male	MUSCLE	PrimaryTissue
E108	Muscle	MUS.SKLT.F	Skeletal Muscle Female	MUSCLE	PrimaryTissue
E109	Digestive	GI.S.INT	Small Intestine	GI_INTESTINE	PrimaryTissue
E110	Digestive	GI.STMC.MUC	Stomach Mucosa	GI_STOMACH	PrimaryTissue
E111	Sm. Muscle	GI.STMC.MUS	Stomach Smooth Muscle	GI_STOMACH	PrimaryTissue
E112	Thymus	THYM	Thymus	THYMUS	PrimaryTissue
E113	Other	SPLN	Spleen	SPLEEN	PrimaryTissue
E114	ENCODE	LNG.A549.ETOH002.CNCR	A549 EtOH 0.02pct Lung Carcinoma Cell Line	LUNG	CellLine_Cancer
E115	ENCODE	BLD.DND41.CNCR	Dnd41 TCell Leukemia Cell Line	BLOOD	CellLine_Cancer
E116	ENCODE	BLD.GM12878	GM12878 Lymphoblastoid Cell Line	BLOOD	CellLine
E117	ENCODE	CRVX.HELAS3.CNCR	HeLa-S3 Cervical Carcinoma Cell Line	CERVIX	CellLine_Cancer
E118	ENCODE	LIV.HEPG2.CNCR	HepG2 Hepatocellular Carcinoma Cell Line	LIVER	CellLine_Cancer
E119	ENCODE	BRST.HMEC	HMEC Mammary Epithelial Primary Cells	BREAST	CellLine
E120	ENCODE	MUS.HSMM	HSMM Skeletal Muscle Myoblasts Cell Line	MUSCLE	CellLine
E121	ENCODE	MUS.HSMMT	HSMM cell derived Skeletal Muscle Myotubes Cell Line	MUSCLE	CellLine
E122	ENCODE	VAS.HUVEC	HUVEC Umbilical Vein Endothelial Cells Cell Line	VASCULAR	CellLine
E123	ENCODE	BLD.K562.CNCR	K562 Leukemia Cell Line	BLOOD	CellLine
E124	ENCODE	BLD.CD14.MONO	Monocytes-CD14+ RO01746 Cell Line	BLOOD	CellLine
E125	ENCODE	BRN.NHA	NH-A Astrocytes Cell Line	BRAIN	CellLine
E126	ENCODE	SKIN.NHDFAD	NHDF-Ad Adult Dermal Fibroblast Primary Cells	SKIN	CellLine
E127	ENCODE	SKIN.NHEK	NHEK-Epidermal Keratinocyte Primary Cells	SKIN	CellLine
E128	ENCODE	LNG.NHLF	NHLF Lung Fibroblast Primary Cells	LUNG	CellLine
E129	ENCODE	BONE.OSTEO	Osteoblast Primary Cells	BONE	CellLine

The option will append cell type-specifc regulatory potential (Cell_P) and combined probability (Combine_P) to the output file:

Adjust prediction scores in composite model

Notecepip allows to adjust prediction score in the composite, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. Here named "dbncfp" database.

Predict regulatory variants: --db-score dbncfp --regulatory-causing-predict
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict 1,3,4,5,6,8,10,11 Use at most 8 existing algorithms for 11 available functional impact scores (listed below) to RE-predict whether a single nucleotide variant (SNV) or Indel will potentially be regulatory. By default, cepip uses all algorithms (8 scores) for a combinatorial prediction. However, the iterative searching shows the best combination for 4 scores can achieve top performance (CADD_cscore, GWAS3D, SuRFR, GWAVA_TSS). Figure: Receiver operating characteristic (ROC) and area under the curves (AUC) of individual scores and combined score by composite model
Note: Figures from our paper.

On the other hand, one can FIX the prediction using a specified subset or full set of the 4 impact scores by option like --regulatory-causing-predict 1,6,8,10
The coding for the functional impact scores used in'--regulatory-causing-predict' options is listed below:

Coding	Method	Description
1	CADD_CScore	"Raw" CADD scores come straight from the CADD model, and are interpretable as the extent to which the annotation profile for a given variant suggests that that variant is likely to be "observed" (negative values) vs "simulated" (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or "not observed") and therefore more likely to have deleterious effects.
2	CADD_PHRED	Since the CScores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a "normalized" and now externally comparable unit of analysis. CADD scored and ranked all ~8.6 billion SNVs of the GRCh37/hg19 reference and then "PHRED-scaled" those values by expressing the rank in order of magnitude terms rather than the precise rank itself.
3	DANN	DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture nonlinear relationships among features and are better suited than SVMs for problems with a large number of samples and features.
4	FunSeq	FunSeq filters mutations overlapping 1000 Genomes variants and then prioritizes those in regions under strong selection (sensitive and ultrasensitive), breaking TF motifs, and those associated with hubs. It can score the deleterious potential of variants in single or multiple genomes. The scores for each noncoding variant vary from 0 to 6, with 6 corresponding to maximum deleterious effect. When multiple tumor genomes are given as input, FunSeq also identifies recurrent mutations in the same element.
5	FunSeq2	FunSeq2 is originally to annotate and prioritize somatic alterations integrating various resources from genomic and cancer studies. The framework consists of two components: (1) data context from uniformly processing large-scale datasets; and (2) a high-throughput variant prioritization pipeline. FunSeq2 can also be used to prioritize noncoding genetic variants.
6	GWAS3D	GWAS3D systematically assesses the genetic variants that could affect regulatory elements, by integrating annotations from cell type-specific chromatin states, epigenetic modifications, sequence motifs and cross-species conservation. It combines the original GWAS signal, risk haplotype, binding affinity significance and conservation information to prioritize the leading variants, and infer the putative causal variant in the LD of leading variant.
7	GWAVA_Region	GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was composed of all 1KG variants in the 1 kb surrounding each of the HGMD variants.
8	GWAVA_TSS	GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was matched for distance to the nearest TSS genome-wide.
9	GWAVA_Unmatched	GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was constructed from a random selection of SNVs from across the genome in order to sample overall background.
10	SuRFR	SuRFR integrates functional annotation and prior biological knowledge to prioritise candidate functional variants by regression model. It introduces novel training and validation datasets that i) capture the regional heterogeneity of genomic annotation better than previously applied approaches, and ii) facilitate understanding of which annotations are most important for discriminating different classes of functionally relevant variants from background variants.
11	FATHMM-MKL	FATHMM-MKL uses MKL classifier to predict the functional consequences of both coding and non-coding sequence variants from various genomic annotations and weights the significance of each component annotation source.

Note Predictions at variants with missing scores at specified methods will use population mean of corresponding method!

The option will append the corresponding score of selected prediction methods to the output file, as well as the bayes factor (BF) and composite probability (Composite_P):

Custom annotations

cepip also supports custom annotations which are defined by user. We suggest the user prepare all chromatin features that was used in our prediction model including DNase, H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K9me3, H3K79me2, H3K27me3. If the user can not provide certain marks, cepip also require to keep empty files with fixed nomination. To use this custom annotations function, please following instructions:

1. The annotation should be sorted ENCODE narrowPeak format (https://genome.ucsc.edu/FAQ/FAQformat#format12);
2. The annotation file should start with the tissue/cell type name (eg. cellA) and append with "-[MarkName].narrowPeak.sorted", such as "cellA-DNase.narrowPeak.sorted";
3. Using gunzip to compress the above narrowPeak file. The final annotation file for certain mark in specific cell is "cellA-DNase.narrowPeak.sorted.gz";
4. All required mark gz files should be prepared for each custom tissue/cell type, including DNase, H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K9me3, H3K79me2 and H3K27me,3 even if some mark are not available currently (compress empty narrowPeak file for unavailable marks).
5. Please put all mark gz files into the cepip annotation reource folder, which locates in "[cepip path]/resources/hg19/all_cell_signal/";
6. We currently only support hg19.

java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell customCellName

cepip_PL whole genome all possible SNVs cell type-specific scoring without downloading huge annotations

Command line:perl cepip_PL.pl -i examples/example.vcf -f vcf -t /usr/bin/tabix -r ftp://147.8.193.36/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz -s 1,3,4,5,6,8,10,11 -a ftp://147.8.193.36/cepip/cell_signal/ -c HepG2 -o cepip.out By default, cepip use Tabix to visit remote compiled dataset for random access. You have to install Tabix and assign the progaram path to script by "-t".

Options: cepip_PL.pl -- Perl version of cepip for prediction cell type-specific regulatory variant -h output help information to screen -i the input variants file -f the format of input file, supporting VCF and ANNOVAR format; default: vcf -t the path of executable tabix program; default: /user/bin/tabix -r the path of dbNCFP reference database; default: ftp://147.8.193.36/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz -s the selected tool scores; default: 1,3,4,5,6,8,10,11 1 CADD_cscore 2 CADD_PHRED 3 DANN_score 4 FunSeq_score 5 FunSeq2_score 6 GWAS3D_score 7 GWAVA_region_score 8 GWAVA_TSS_score 9 GWAVA_unmatched_score 10 SuRFR_score 11 Fathmm_MKL_score -p the casual distribution folder; default: resources/all_causal_distribution -n the neutral distribution folder; default: resources/all_neutral_distribution -a the path of tissue/cell type reference epigenome; default: ftp://147.8.193.36/cepip/cell_signal/ -c the dependent tissue/cell type; default: E116 -o the path of output file; default: cepip1.flt.txt

FAQ?

1) Why cepip does not read my VCF file?

If you use standard VCF output from GATK pipeline, it usually contains variants on mitochondrial DNA. However, mitochondrial DNA is not annotated by gene feature database. Therefore cepip currently only accept VCF file excluding variants on ChrM.

2) Whether cepip supports rare variants or somatic variants?

Our composite model only supports the genetic variants from 1000 Genomes Project phase 1 since we currently prefer to make all collected eight methods work well. We will support all possible SNVs in the human genome very soon. For context-dependent model, it can be apply to any variants in the human genome.

3) Can I use ANNOVAR and cepip together on my dataset?

cepip is quite flexible for interacting with other sequence-oriented analytical programs/software (including SamTools, ANNOVAR, etc). It can accept various input formats, and output different kinds of sequence data. In case of ANNOVAR, cepip can read ANNOVAR-formatted sequence variants, and write the final remaining variants in ANNOVAR format.

4) Can I run cepip on my laptop? Is it time consuming to run a complete cepip process?

Normally, cepip run well and fast with >=1 GB RAM memory. Hence current laptop are certainty affordable for running cepip. The whole process need only <10 minutes, unless first downloading time.

5) How do I report an error or bugs to cepip?

You are welcomed to write an email to mulin0424.li@gmail.com or limx54@yahoo.com.