Introduction and References
Install and run

Quick examples
Options Custom Annotations
cepip Perl version
FAQ

cepip: Context-dependent epigenomic weighting for regulatory variant prioritization


Introduction


Majority of trait/disease associated variants identified by genome wide association studies (GWASs) locate in the regulatory regions. Since gene regulation is highly context-specific, it remains challenging to fine-map and prioritize functional regulatory variants in a particular cell/tissue type and apply them to disease-associated genes detection. By connecting large-scale epigenome profiles to expression quantitative trait loci (eQTLs) in a wide range of human tissues/cell types, we identify combination of several critical chromatin features that predict variant regulatory potential. We develop a joint likelihood framework to measure the regulatory probability of genetic variants in a context-dependent manner. We show our model is superior to existing cell type-specific methods and exhibit significant GWAS signal enrichment. Using phenotypically relevant epigenomes to weight GWAS SNPs, we discover more disease-associated genes owing to regulatory changes and improve the statistical power in gene-based association test.

References

1.   Mulin Jun Li, Miaoxin Li, Zipeng Liu, Yan Bin, Zhicheng Pan, Dingge Ying, Jean-Pierre A. Kocher, Zhengyuan Xia, Pak Chung Sham, Jun S. Liu, Junwen Wang. cepip: context-dependent epigenomic weighting for prioritization of regulatory variants and disease-associated genes Genome Biology (2017) 18:52

Install and run

System requirement

Java Runtime Environment (JRE) version 6.0 or above is required for cepip. It can be downloaded from the Java web site. Installing the JRE is very easy in Windows OS and Mac OS X.

In Linux, you have more work to do. Details of the installation can be found http://www.java.com/en/download/help/linux_install.xml.

In Ubuntu, if you have an error message like: "Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException ... " , please install the Sun Java Running Environment (JRE) first.

To install the Sun JRE on Ubuntu(10.04), please use the following commands:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts
Detailed explanation of above commands can be found at http://www.ubuntugeek.com/how-install-sun-java-runtime-environment-jre-in-ubuntu-10-04-lucid-lynx.html.


For Mac OS, the JRE 1.6 has been available at http://developer.apple.com/java/download/ since April 2008. Mac OS users may need update the Java application to run cepip. A potential problem is that this update does not replace the existing installation of J2SE 5.0 or change the default version of Java. Similar to the Linux OS, the Java_Home environmental variable has to be configured to initiate cepip.

Download: latest Java version (Fast, now integrate into KGGSeq) | latest Perl version (Slow but no need to download big annotation, <10k variants)

20170318 update: We have integrated our JAVA version cepip into KGGSeq package, now it supports whole genome scoring in an efficient manner (require mannually download dbNCFP wohle genome annotation file). Please refer to the manual for package configuration and running parameters (--regulatory-causing-predict and --cell). The Perl version of cepip is still supported for easy execution of small list of SNVs (< 10K) and reseacher who do not want to download big program-dependent annotation (Please refer to Perl version manual).

20161213 update:  We introduce a Perl version of cepip for remote random access of allele-specific composite score annotations and cell type-specific epigenomic signal annotations. Without downloading the huge annotation file, the user can quickly execute cepip (require Tabix). However, the running speed could be slow using this Perl version. Please run it under small input.  For the first-time user, please download the Perl package from cepip_PL_v0.12.zip (Mac OS X / Linux).

20161024 update:  We allow users to custom their epigenomic annotations. Please download and replace previous jar file from cepip.jar.  For the first-time user, please download the full package from cepip.zip.  We also provide a demo of custom annotations custom_annotations.zip,  please refer to the instrcution in this site for how to use custom annoations.

Installing cepip

Simply decompress the archive and run the following command.

  java -Xms256m -Xmx1300m -jar ./cepip.jar <arguments>

The arguments -Xms256m and -Xmx1300m set the initial and maximum Java heap sizes for cepip as 256 megabytes and 1.3 gigabytes respectively. Specifying a larger maximum heap size can speed up the analysis. A higher setting like -Xmx2g or even-Xmx5g is required when there is a large number of variants, say 5 million. The number, however, should be less than the size of physical memory of a machine.

Note  <arguments >can be saved in a flat text file.


Quick example

Variants file is needed (example.vcf). This example use VCF variant format without genotype, and other formats are also supported!

Note  All files were included in the examples folder of cepip.

Run the command below:

java -Xms256m -Xmx1300m -jar cepip.jar examples/params.example.txt

We now walk through the parameter file "params.composite.example.txt" before going into the results. Lines starting with hash sign # are comments. Detailed interpretation for each argument in the parameter file is included in 'Options' part.

#one argument per line
#I.Environmental setting
#--no-resource-check \

#II. Specify the input files
--vcf-file examples/example.vcf \

#III. Output setting
--out example_result \

#IV. Filtering and Prioritization
--db-score dbncfp \
--regulatory-causing-predict all \
--cell GM12878 \

Part (I): Specify general environmental setting

Arguments in this part are used to set general cepip running environment, including resource path and program update

Part (II): Specify the input files

Arguments in this part are used to specify the input files which support various data format, and they are compulsory for running cepip.

Part (III): Output setting

Arguments in this part are used to set the output file name and type, which can be produced simultaneously by cepip.

Part (IV): Prioritization

Arguments in this part are used to apply a set of models (selected regulatory variant scores, cell type definition, etc.) to enable better prioritization of the regulatory.

Notes Most of the above arguments are optional, so user can mask some lines by # or delete the lines. Under this circumstance, user can have a systematic view of the impact for each level or even steps. And it will be easier to produce this parameter file by cepip command generator we provide.

Input formats

Variants file


NOTE  Currently, cepip compiled an integrative database for eight latest algorithms in the noncoding variants functional prediction field, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. To make a universal format for all collected algorithms, cepip now only support all 1000 Genomes Project phase 1 biallelic variants.

VCF format WITHOUT sample genotypes
cepip can also accept VCF data without genotype information. But there are less usable functions for this type data.
java -jar cepip.jar --vcf-file path/to/file1

In addition, cepip also support other popular variants and/or genotypes formats which are popular in next generation studies.

ANNOVAR format java -jar cepip.jar --annovar-file path/to/file NOTE cepip can recognize an extended ANNOVAR format in which a head row and and multiple columns for comments are allowed.

Example:
chr startpos endpos ref alt comment1 comment2 comment3
1 69428 69428 T G T 92 129
1 69476 69476 T C T 1 0

Outputs

cepip can flexibly output different formats of the prioritization and annotation results for either final validation or further analysis by third-party tools.

Output file path and file name prefix: --outjava -jar cepip.jar --vcf-file path/to/file1 --out path/to/prefixname Specify path and prefix name of outputs. It is "./cepip" by default.

Odinary outputs

Text format: This is by defaultjava -jar cepip.jar --vcf-file path/to/file1 By default, cepip output results in TEXT format in a file named cepip.flt.txt, in which the fields (or columns) are delimited by the tabs.

Produce SeattleSeq input: --o-seattleseq java -jar cepip.jar --vcf-file path/to/file1 --o-seattleseq Generate an extra copy of output for the prioritized variants in SeattleSeq input format, which can be further annotated by SeattleSeq.

Produce ANNOVAR input: --o-annovar java -jar cepip.jar --vcf-file path/to/file1 --o-annovar Generate an extra copy of output for the prioritized variants in ANNOVAR input format, which can be further annotated by ANNOVAR.

Produce VCF input: --o-vcf java -jar cepip.jar --vcf-file path/to/file1 --o-vcf Generate an extra copy of output for the prioritized variants in VCF format, which can be further analyzed by other tools.

Prediction and Prioritization

One attractive feature of cepip is that it combines our previous composite model (Mulin Jun Li et al. 2016. bioinformatics. PRVCS) for functional impact scores from multiple algorithms (i.e., CADD and GWAVA) to predict whether variants are regulatory or not. In addition, cepip summarized a set of consolidated chromatin features using cell type-specific chromatin marks for uniformly processed expression quantitative trait loci (eQTLs) dataset, and can measure the probability of regulatory causality for given variants in selected condition.

Predicting noncoding regulatory variants based on specific condition


Notecepip also support context-dependent prioritization using cell type-specific epigenomic signature.

Predict regulatory variants : --db-score dbncfp --regulatory-causing-predict all --cell
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell GM12878 Assigning a cell type, cepip uses a logit model to measure the probability of regulatory causality for given variants in selected condition. It finally combine the above composite model and context-dependent model into an unified model to estimate the posterior probability of regulatory potential.

cepip can support 16 ENCODE cell types as follows:

Coding Tissue Description
A549 Epithelium epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), "This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male." - ATCC, newly promoted to tier 2: not in 2011 analysis.
CD14 Monocytes Monocytes-CD14+ are CD14-positive cells from human leukapheresis production, from donor RO 01746 (draw 1 ID is RO 01746, draw 2 ID is RO 01826), newly promoted to tier 2: not in 2011 analysis.
GM12878 Blood B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus.
H1-hESC Embryonic stem cell embryonic stem cells.
HMEC Breast mammary epithelial cells.
HSMM Muscle skeletal muscle myoblasts.
HSMMT Muscle HSMM cell derived skeletal muscle myotubes cell line.
HUVEC Vessel umbilical vein endothelial cells.
HeLa-S3 Cervix cervical carcinoma.
HepG2 Liver hepatocellular carcinoma.
IMR90 Lung fetal lung fibroblasts, newly promoted to tier 2: not in 2011 analysis.
K562 Blood leukemia, "The continuous cell line K-562 was established by Lozzio and Lozzio from the pleural effusion of a 53-year-old female with chronic myelogenous leukemia in terminal blast crises." - ATCC.
NH-A Brain astrocytes (also called Astrocy).
NHDF Skin dermal fibroblasts from temple / breast
NHEK Skin epidermal keratinocytes.
NHLF Lung lung fibroblasts.
NoteBy default, cepip will use GM12878 for context-dependent prioritization.

cepip can also support 127 RoadMap human reference epigenomes:

Note Epigenomes of some specific marks are missing in some tissues/cells, which may affect the prediction. We will use imputed epigenomes to handle this missing data problem soon! java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell E116
Coding Lineage Group Epigenome Mnemonic Epigenome Name Anatomy Type
E001 ESC ESC.I3 ES-I3 Cell Line ESC CellLine
E002 ESC ESC.WA7 ES-WA7 Cell Line ESC CellLine
E003 ESC ESC.H1 H1 Cell Line ESC CellLine
E004 ES-deriv ESDR.H1.BMP4.MESO H1 BMP4 Derived Mesendoderm Cultured Cells ESC_DERIVED CellLineDerived
E005 ES-deriv ESDR.H1.BMP4.TROP H1 BMP4 Derived Trophoblast Cultured Cells ESC_DERIVED CellLineDerived
E006 ES-deriv ESDR.H1.MSC H1 Derived Mesenchymal Stem Cells ESC_DERIVED CellLineDerived
E007 ES-deriv ESDR.H1.NEUR.PROG H1 Derived Neuronal Progenitor Cultured Cells ESC_DERIVED CellLineDerived
E008 ESC ESC.H9 H9 Cell Line ESC CellLine
E009 ES-deriv ESDR.H9.NEUR.PROG H9 Derived Neuronal Progenitor Cultured Cells ESC_DERIVED CellLineDerived
E010 ES-deriv ESDR.H9.NEUR H9 Derived Neuron Cultured Cells ESC_DERIVED CellLineDerived
E011 ES-deriv ESDR.CD184.ENDO hESC Derived CD184+ Endoderm Cultured Cells ESC_DERIVED CellLineDerived
E012 ES-deriv ESDR.CD56.ECTO hESC Derived CD56+ Ectoderm Cultured Cells ESC_DERIVED CellLineDerived
E013 ES-deriv ESDR.CD56.MESO hESC Derived CD56+ Mesoderm Cultured Cells ESC_DERIVED CellLineDerived
E014 ESC ESC.HUES48 HUES48 Cell Line ESC CellLine
E015 ESC ESC.HUES6 HUES6 Cell Line ESC CellLine
E016 ESC ESC.HUES64 HUES64 Cell Line ESC CellLine
E017 IMR90 LNG.IMR90 IMR90 fetal lung fibroblasts Cell Line LUNG CellLine
E018 iPSC IPSC.15b iPS-15b Cell Line IPSC CellLine
E019 iPSC IPSC.18 iPS-18 Cell Line IPSC CellLine
E020 iPSC IPSC.20B iPS-20b Cell Line IPSC CellLine
E021 iPSC IPSC.DF.6.9 iPS DF 6.9 Cell Line IPSC CellLine
E022 iPSC IPSC.DF.19.11 iPS DF 19.11 Cell Line IPSC CellLine
E023 Mesench FAT.MSC.DR.ADIP Mesenchymal Stem Cell Derived Adipocyte Cultured Cells FAT CellLineDerived
E024 ESC ESC.4STAR   Cell Line ESC CellLine
E025 Mesench FAT.ADIP.DR.MSC Adipose Derived Mesenchymal Stem Cell Cultured Cells FAT PrimaryCell
E026 Mesench STRM.MRW.MSC Bone Marrow Derived Cultured Mesenchymal Stem Cells NECTIVE PrimaryCell
E027 Epithelial BRST.MYO Breast Myoepithelial Primary Cells BREAST PrimaryCell
E028 Epithelial BRST.HMEC.35 Breast variant Human Mammary Epithelial Cells (vHMEC) BREAST PrimaryCell
E029 HSC & B-cell BLD.CD14.PC Primary monocytes from peripheral blood BLOOD PrimaryCell
E030 HSC & B-cell BLD.CD15.PC Primary neutrophils from peripheral blood BLOOD PrimaryCell
E031 HSC & B-cell BLD.CD19.CPC Primary B cells from cord blood BLOOD PrimaryCell
E032 HSC & B-cell BLD.CD19.PPC Primary B cells from peripheral blood BLOOD PrimaryCell
E033 Blood & T-cell BLD.CD3.CPC Primary T cells from cord blood BLOOD PrimaryCell
E034 Blood & T-cell BLD.CD3.PPC Primary T cells from peripheral blood BLOOD PrimaryCell
E035 HSC & B-cell BLD.CD34.PC Primary hematopoietic stem cells BLOOD PrimaryCell
E036 HSC & B-cell BLD.CD34.CC Primary hematopoietic stem cells short term culture BLOOD PrimaryCell
E037 Blood & T-cell BLD.CD4.MPC Primary T helper memory cells from peripheral blood 2 BLOOD PrimaryCell
E038 Blood & T-cell BLD.CD4.NPC Primary T helper naive cells from peripheral blood BLOOD PrimaryCell
E039 Blood & T-cell BLD.CD4.CD25M.CD45RA.NPC Primary T helper naive cells from peripheral blood BLOOD PrimaryCell
E040 Blood & T-cell BLD.CD4.CD25M.CD45RO.MPC Primary T helper memory cells from peripheral blood 1 BLOOD PrimaryCell
E041 Blood & T-cell BLD.CD4.CD25M.IL17M.PL.TPC Primary T helper cells PMA-I stimulated BLOOD PrimaryCell
E042 Blood & T-cell BLD.CD4.CD25M.IL17P.PL.TPC Primary T helper 17 cells PMA-I stimulated BLOOD PrimaryCell
E043 Blood & T-cell BLD.CD4.CD25M.TPC Primary T helper cells from peripheral blood BLOOD PrimaryCell
E044 Blood & T-cell BLD.CD4.CD25.CD127M.TREGPC Primary T regulatory cells from peripheral blood BLOOD PrimaryCell
E045 Blood & T-cell BLD.CD4.CD25I.CD127.TMEMPC Primary T cells effector/memory enriched from peripheral blood BLOOD PrimaryCell
E046 HSC & B-cell BLD.CD56.PC Primary Natural Killer cells from peripheral blood BLOOD PrimaryCell
E047 Blood & T-cell BLD.CD8.NPC Primary T killer naive cells from peripheral blood BLOOD PrimaryCell
E048 Blood & T-cell BLD.CD8.MPC Primary T killer memory cells from peripheral blood BLOOD PrimaryCell
E049 Mesench STRM.CHON.MRW.DR.MSC Mesenchymal Stem Cell Derived Chondrocyte Cultured Cells NECTIVE PrimaryCell
E050 HSC & B-cell BLD.MOB.CD34.PC.F Primary hematopoietic stem cells G-CSF-mobilized Female BLOOD PrimaryCell
E051 HSC & B-cell BLD.MOB.CD34.PC.M Primary hematopoietic stem cells G-CSF-mobilized Male BLOOD PrimaryCell
E052 Myosat MUS.SAT Muscle Satellite Cultured Cells MUSCLE PrimaryCell
E053 Neurosph BRN.CRTX.DR.NRSPHR Cortex derived primary cultured neurospheres BRAIN PrimaryCell
E054 Neurosph BRN.GANGEM.DR.NRSPHR Ganglion Eminence derived primary cultured neurospheres BRAIN PrimaryCell
E055 Epithelial SKIN.PEN.FRSK.FIB.01 Foreskin Fibroblast Primary Cells skin01 SKIN PrimaryCell
E056 Epithelial SKIN.PEN.FRSK.FIB.02 Foreskin Fibroblast Primary Cells skin02 SKIN PrimaryCell
E057 Epithelial SKIN.PEN.FRSK.KER.02 Foreskin Keratinocyte Primary Cells skin02 SKIN PrimaryCell
E058 Epithelial SKIN.PEN.FRSK.KER.03 Foreskin Keratinocyte Primary Cells skin03 SKIN PrimaryCell
E059 Epithelial SKIN.PEN.FRSK.MEL.01 Foreskin Melanocyte Primary Cells skin01 SKIN PrimaryCell
E061 Epithelial SKIN.PEN.FRSK.MEL.03 Foreskin Melanocyte Primary Cells skin03 SKIN PrimaryCell
E062 Blood & T-cell BLD.PER.MONUC.PC Primary mononuclear cells from peripheral blood BLOOD PrimaryCell
E063 Adipose FAT.ADIP.NUC Adipose Nuclei FAT PrimaryTissue
E065 Heart VAS.AOR Aorta VASCULAR PrimaryTissue
E066 Other LIV.ADLT Liver LIVER PrimaryTissue
E067 Brain BRN.ANG.GYR Brain Angular Gyrus BRAIN PrimaryTissue
E068 Brain BRN.ANT.CAUD Brain Anterior Caudate BRAIN PrimaryTissue
E069 Brain BRN.CING.GYR Brain Cingulate Gyrus BRAIN PrimaryTissue
E070 Brain BRN.GRM.MTRX Brain Germinal Matrix BRAIN PrimaryTissue
E071 Brain BRN.HIPP.MID Brain Hippocampus Middle BRAIN PrimaryTissue
E072 Brain BRN.INF.TMP Brain Inferior Temporal Lobe BRAIN PrimaryTissue
E073 Brain BRN.DL.PRFRNTL.CRTX Brain Dorsolateral Prefrontal Cortex BRAIN PrimaryTissue
E074 Brain BRN.SUB.NIG Brain Substantia Nigra BRAIN PrimaryTissue
E075 Digestive GI.CLN.MUC Colonic Mucosa GI_COLON PrimaryTissue
E076 Sm. Muscle GI.CLN.SM.MUS Colon Smooth Muscle GI_COLON PrimaryTissue
E077 Digestive GI.DUO.MUC Duodenum Mucosa GI_DUODENUM PrimaryTissue
E078 Sm. Muscle GI.DUO.SM.MUS Duodenum Smooth Muscle GI_DUODENUM PrimaryTissue
E079 Digestive GI.ESO Esophagus S PrimaryTissue
E080 Other ADRL.GLND.FET Fetal Adrenal Gland ADRENAL PrimaryTissue
E081 Brain BRN.FET.M Fetal Brain Male BRAIN PrimaryTissue
E082 Brain BRN.FET.F Fetal Brain Female BRAIN PrimaryTissue
E083 Heart HRT.FET Fetal Heart HEART PrimaryTissue
E084 Digestive GI.L.INT.FET Fetal Intestine Large GI_INTESTINE PrimaryTissue
E085 Digestive GI.S.INT.FET Fetal Intestine Small GI_INTESTINE PrimaryTissue
E086 Other KID.FET Fetal Kidney KIDNEY PrimaryTissue
E087 Other PANC.ISLT Pancreatic Islets PANCREAS PrimaryTissue
E088 Other LNG.FET Fetal Lung LUNG PrimaryTissue
E089 Muscle MUS.TRNK.FET Fetal Muscle Trunk MUSCLE PrimaryTissue
E090 Muscle MUS.LEG.FET Fetal Muscle Leg MUSCLE_LEG PrimaryTissue
E091 Other PLCNT.FET Placenta PLACENTA PrimaryTissue
E092 Digestive GI.STMC.FET Fetal Stomach GI_STOMACH PrimaryTissue
E093 Thymus THYM.FET Fetal Thymus THYMUS PrimaryTissue
E094 Digestive GI.STMC.GAST Gastric GI_STOMACH PrimaryTissue
E095 Heart HRT.VENT.L Left Ventricle HEART PrimaryTissue
E096 Other LNG Lung LUNG PrimaryTissue
E097 Other OVRY Ovary OVARY PrimaryTissue
E098 Other PANC Pancreas PANCREAS PrimaryTissue
E099 Other PLCNT.AMN Placenta Amnion PLACENTA PrimaryTissue
E100 Muscle MUS.PSOAS Psoas Muscle MUSCLE PrimaryTissue
E101 Digestive GI.RECT.MUC.29 Rectal Mucosa Donor 29 GI_RECTUM PrimaryTissue
E102 Digestive GI.RECT.MUC.31 Rectal Mucosa Donor 31 GI_RECTUM PrimaryTissue
E103 Sm. Muscle GI.RECT.SM.MUS Rectal Smooth Muscle GI_RECTUM PrimaryTissue
E104 Heart HRT.ATR.R Right Atrium HEART PrimaryTissue
E105 Heart HRT.VNT.R Right Ventricle HEART PrimaryTissue
E106 Digestive GI.CLN.SIG Sigmoid Colon GI_COLON PrimaryTissue
E107 Muscle MUS.SKLT.M Skeletal Muscle Male MUSCLE PrimaryTissue
E108 Muscle MUS.SKLT.F Skeletal Muscle Female MUSCLE PrimaryTissue
E109 Digestive GI.S.INT Small Intestine GI_INTESTINE PrimaryTissue
E110 Digestive GI.STMC.MUC Stomach Mucosa GI_STOMACH PrimaryTissue
E111 Sm. Muscle GI.STMC.MUS Stomach Smooth Muscle GI_STOMACH PrimaryTissue
E112 Thymus THYM Thymus THYMUS PrimaryTissue
E113 Other SPLN Spleen SPLEEN PrimaryTissue
E114 ENCODE LNG.A549.ETOH002.CNCR A549 EtOH 0.02pct Lung Carcinoma Cell Line LUNG CellLine_Cancer
E115 ENCODE BLD.DND41.CNCR Dnd41 TCell Leukemia Cell Line BLOOD CellLine_Cancer
E116 ENCODE BLD.GM12878 GM12878 Lymphoblastoid Cell Line BLOOD CellLine
E117 ENCODE CRVX.HELAS3.CNCR HeLa-S3 Cervical Carcinoma Cell Line CERVIX CellLine_Cancer
E118 ENCODE LIV.HEPG2.CNCR HepG2 Hepatocellular Carcinoma Cell Line LIVER CellLine_Cancer
E119 ENCODE BRST.HMEC HMEC Mammary Epithelial Primary Cells BREAST CellLine
E120 ENCODE MUS.HSMM HSMM Skeletal Muscle Myoblasts Cell Line MUSCLE CellLine
E121 ENCODE MUS.HSMMT HSMM cell derived Skeletal Muscle Myotubes Cell Line MUSCLE CellLine
E122 ENCODE VAS.HUVEC HUVEC Umbilical Vein Endothelial Cells Cell Line VASCULAR CellLine
E123 ENCODE BLD.K562.CNCR K562 Leukemia Cell Line BLOOD CellLine
E124 ENCODE BLD.CD14.MONO Monocytes-CD14+ RO01746 Cell Line BLOOD CellLine
E125 ENCODE BRN.NHA NH-A Astrocytes Cell Line BRAIN CellLine
E126 ENCODE SKIN.NHDFAD NHDF-Ad Adult Dermal Fibroblast Primary Cells SKIN CellLine
E127 ENCODE SKIN.NHEK NHEK-Epidermal Keratinocyte Primary Cells SKIN CellLine
E128 ENCODE LNG.NHLF NHLF Lung Fibroblast Primary Cells LUNG CellLine
E129 ENCODE BONE.OSTEO Osteoblast Primary Cells BONE CellLine

The option will append cell type-specifc regulatory potential (Cell_P) and combined probability (Combine_P) to the output file:

Adjust prediction scores in composite model


Notecepip allows to adjust prediction score in the composite, including CADD, DANN, fathmm-MKL, FunSeq, FunSeq2, GWAS3D, GWAVA and SuRFR. Here named "dbncfp" database.

Predict regulatory variants: --db-score dbncfp --regulatory-causing-predict
java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict 1,3,4,5,6,8,10,11 Use at most 8 existing algorithms for 11 available functional impact scores (listed below) to RE-predict whether a single nucleotide variant (SNV) or Indel will potentially be regulatory. By default, cepip uses all algorithms (8 scores) for a combinatorial prediction. However, the iterative searching shows the best combination for 4 scores can achieve top performance (CADD_cscore, GWAS3D, SuRFR, GWAVA_TSS). Figure: Receiver operating characteristic (ROC) and area under the curves (AUC) of individual scores and combined score by composite model
Note: Figures from our paper.

On the other hand, one can FIX the prediction using a specified subset or full set of the 4 impact scores by option like --regulatory-causing-predict 1,6,8,10
The coding for the functional impact scores used in'--regulatory-causing-predict' options is listed below:

Coding Method Description
1 CADD_CScore "Raw" CADD scores come straight from the CADD model, and are interpretable as the extent to which the annotation profile for a given variant suggests that that variant is likely to be "observed" (negative values) vs "simulated" (positive values). These values have no absolute unit of meaning and are incomparable across distinct annotation combinations, training sets, or model parameters. However, raw values do have relative meaning, with higher values indicating that a variant is more likely to be simulated (or "not observed") and therefore more likely to have deleterious effects.
2 CADD_PHRED Since the CScores do have relative meaning, one can take a specific group of variants, define the rank for each variant within that group, and then use that value as a "normalized" and now externally comparable unit of analysis. CADD scored and ranked all ~8.6 billion SNVs of the GRCh37/hg19 reference and then "PHRED-scaled" those values by expressing the rank in order of magnitude terms rather than the precise rank itself.
3 DANN DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture nonlinear relationships among features and are better suited than SVMs for problems with a large number of samples and features.
4 FunSeq FunSeq filters mutations overlapping 1000 Genomes variants and then prioritizes those in regions under strong selection (sensitive and ultrasensitive), breaking TF motifs, and those associated with hubs. It can score the deleterious potential of variants in single or multiple genomes. The scores for each noncoding variant vary from 0 to 6, with 6 corresponding to maximum deleterious effect. When multiple tumor genomes are given as input, FunSeq also identifies recurrent mutations in the same element.
5 FunSeq2 FunSeq2 is originally to annotate and prioritize somatic alterations integrating various resources from genomic and cancer studies. The framework consists of two components: (1) data context from uniformly processing large-scale datasets; and (2) a high-throughput variant prioritization pipeline. FunSeq2 can also be used to prioritize noncoding genetic variants.
6 GWAS3D GWAS3D systematically assesses the genetic variants that could affect regulatory elements, by integrating annotations from cell type-specific chromatin states, epigenetic modifications, sequence motifs and cross-species conservation. It combines the original GWAS signal, risk haplotype, binding affinity significance and conservation information to prioritize the leading variants, and infer the putative causal variant in the LD of leading variant.
7 GWAVA_Region GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was composed of all 1KG variants in the 1 kb surrounding each of the HGMD variants.
8 GWAVA_TSS GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was matched for distance to the nearest TSS genome-wide.
9 GWAVA_Unmatched GWAVA uses the random forest algorithm to build three classifiers using all available annotations to discriminate between the disease variants and variants from each of the three control sets. This control set first was constructed from a random selection of SNVs from across the genome in order to sample overall background.
10 SuRFR SuRFR integrates functional annotation and prior biological knowledge to prioritise candidate functional variants by regression model. It introduces novel training and validation datasets that i) capture the regional heterogeneity of genomic annotation better than previously applied approaches, and ii) facilitate understanding of which annotations are most important for discriminating different classes of functionally relevant variants from background variants.
11 FATHMM-MKL FATHMM-MKL uses MKL classifier to predict the functional consequences of both coding and non-coding sequence variants from various genomic annotations and weights the significance of each component annotation source.
Note Predictions at variants with missing scores at specified methods will use population mean of corresponding method!

The option will append the corresponding score of selected prediction methods to the output file, as well as the bayes factor (BF) and composite probability (Composite_P):


Custom annotations

cepip also supports custom annotations which are defined by user. We suggest the user prepare all chromatin features that was used in our prediction model including DNase, H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K9me3, H3K79me2, H3K27me3. If the user can not provide certain marks, cepip also require to keep empty files with fixed nomination. To use this custom annotations function, please following instructions:

1. The annotation should be sorted ENCODE narrowPeak format (https://genome.ucsc.edu/FAQ/FAQformat#format12);
2. The annotation file should start with the tissue/cell type name (eg. cellA) and append with "-[MarkName].narrowPeak.sorted", such as "cellA-DNase.narrowPeak.sorted";
3. Using gunzip to compress the above narrowPeak file. The final annotation file for certain mark in specific cell is "cellA-DNase.narrowPeak.sorted.gz";
4. All required mark gz files should be prepared for each custom tissue/cell type, including DNase, H3K4me1, H3K4me2, H3K4me3, H3K36me3, H3K9me3, H3K79me2 and H3K27me,3 even if some mark are not available currently (compress empty narrowPeak file for unavailable marks).
5. Please put all mark gz files into the cepip annotation reource folder, which locates in "[cepip path]/resources/hg19/all_cell_signal/";
6. We currently only support hg19.

java -jar cepip.jar --vcf-file path/to/file1 --db-score dbncfp --regulatory-causing-predict all --cell customCellName

cepip_PL whole genome all possible SNVs cell type-specific scoring without downloading huge annotations

Command line:perl cepip_PL.pl -i examples/example.vcf -f vcf -t /usr/bin/tabix -r ftp://147.8.193.36/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz -s 1,3,4,5,6,8,10,11 -a ftp://147.8.193.36/cepip/cell_signal/ -c HepG2 -o cepip.out By default, cepip use Tabix to visit remote compiled dataset for random access. You have to install Tabix and assign the progaram path to script by "-t".

Options: cepip_PL.pl -- Perl version of cepip for prediction cell type-specific regulatory variant -h output help information to screen -i the input variants file -f the format of input file, supporting VCF and ANNOVAR format; default: vcf -t the path of executable tabix program; default: /user/bin/tabix -r the path of dbNCFP reference database; default: ftp://147.8.193.36/PRVCS/v1.1/dbNCFP_whole_genome_SNVs.bgz -s the selected tool scores; default: 1,3,4,5,6,8,10,11 1 CADD_cscore 2 CADD_PHRED 3 DANN_score 4 FunSeq_score 5 FunSeq2_score 6 GWAS3D_score 7 GWAVA_region_score 8 GWAVA_TSS_score 9 GWAVA_unmatched_score 10 SuRFR_score 11 Fathmm_MKL_score -p the casual distribution folder; default: resources/all_causal_distribution -n the neutral distribution folder; default: resources/all_neutral_distribution -a the path of tissue/cell type reference epigenome; default: ftp://147.8.193.36/cepip/cell_signal/ -c the dependent tissue/cell type; default: E116 -o the path of output file; default: cepip1.flt.txt


FAQ?

1) Why cepip does not read my VCF file?

If you use standard VCF output from GATK pipeline, it usually contains variants on mitochondrial DNA. However, mitochondrial DNA is not annotated by gene feature database. Therefore cepip currently only accept VCF file excluding variants on ChrM.

2) Whether cepip supports rare variants or somatic variants?

Our composite model only supports the genetic variants from 1000 Genomes Project phase 1 since we currently prefer to make all collected eight methods work well. We will support all possible SNVs in the human genome very soon. For context-dependent model, it can be apply to any variants in the human genome.

3) Can I use ANNOVAR and cepip together on my dataset?

cepip is quite flexible for interacting with other sequence-oriented analytical programs/software (including SamTools, ANNOVAR, etc). It can accept various input formats, and output different kinds of sequence data. In case of ANNOVAR, cepip can read ANNOVAR-formatted sequence variants, and write the final remaining variants in ANNOVAR format.

4) Can I run cepip on my laptop? Is it time consuming to run a complete cepip process?

Normally, cepip run well and fast with >=1 GB RAM memory. Hence current laptop are certainty affordable for running cepip. The whole process need only <10 minutes, unless first downloading time.

5) How do I report an error or bugs to cepip?

You are welcomed to write an email to mulin0424.li@gmail.com or limx54@yahoo.com.