DiffCircaPipeline is a workflow for a systematic detection of multifaceted differential circadian characteristics (\(\Delta A\), \(\Delta \phi\), \(\Delta M\), \(\Delta R^2\)) with accurate false positive control. The pipeline allows interactive exploration of the data and the analysis outputs are accompanied by informative visualization. This tutorial demonstrates the pipeline using an public data set from Gene Expression Omnibus (GEO) with the accession number GSE54650.
GSE54650 contains data from 12 tissues. In this tutorial we will be comparing rhythmicity between two tissues: hypothalamus and skeletal muscle.
GSE.ID = "GSE54650"
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Bioconductor version '3.13' is out-of-date; the current release version '3.15'
## is available with R version '4.2'; see https://bioconductor.org/install
if(!require("GEOquery", quietly = TRUE))
BiocManager::install("GEOquery")
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
if(!require("Biobase", quietly = TRUE))
BiocManager::install("Biobase")
Sys.setenv(VROOM_CONNECTION_SIZE=131072*5)
gset <- GEOquery::getGEO(GSE.ID)
## Found 1 file(s)
## GSE54650_series_matrix.txt.gz
## Rows: 35556 Columns: 289
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (289): ID_REF, GSM1320990, GSM1320991, GSM1320992, GSM1320993, GSM132099...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## File stored at:
## /var/folders/sw/rwyyf2h547ddmv59k_c5g08c0000gn/T//RtmpBojbWE/GPL6246.soft
data.pdata = Biobase::pData(gset[[1]])
data.fdata = Biobase::fData(gset[[1]])
data.exprs = Biobase::exprs(gset[[1]])
table(data.pdata$source_name_ch1)
##
## adrenal gland aorta brainstem brown adipose cerebellum
## 24 24 24 24 24
## heart hypothalamus kidney liver lung
## 24 24 24 24 24
## skeletal muscle white adipose
## 24 24
gI = "hypothalamus"; gII = "skeletal muscle"
data.pdata.sub = data.pdata %>% filter(source_name_ch1 %in% c(gI, gII))
data.exprs.sub = data.exprs[, data.pdata.sub$geo_accession]
#check if need log2 transformation
ex = data.exprs.sub
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
(qx[6]-qx[1] > 50 && qx[2] > 0)
if (LogC) {
ex[ex == 0] <- NaN
ex <- log2(ex) }
# box-and-whisker plot
par(mar=c(7,4,2,1))
title <- paste0(GSE.ID, ": box-and-whisker plot")
boxplot(ex, boxwex=0.7, notch=T, main=title, outline=FALSE, las=2,)#distribution looks good
# expression value distribution plot
par(mar=c(4,4,2,1))
title <- paste0(GSE.ID, ": density distribution")
limma::plotDensities(ex, main=title, legend=F)
There is an apparant distribution difference between the two tissues, so we perform quantile normalization to the data.
ex.norm = limma::normalizeQuantiles(ex)
limma::plotDensities(ex.norm, main=title, legend=F)
This section performs the differential rhythmicity (DR) tests using DiffCircaPipeline.
Format the data for DiffCircaPipeline. The data for each group is a list of three components: (1) the expression data, which must be formatted into dataframe; (2) the time, which should have the same order as the columns in the expression, and (3) gname, which is a vector of feature identifier, the order must be the same as the rows in the expression.
data.pdata.x1 = data.pdata.sub %>% filter(source_name_ch1==gI) %>% mutate(time = as.numeric(gsub(".*_CT([0-9]+)$", "\\1", title)))
data.pdata.x2 = data.pdata.sub %>% filter(source_name_ch1==gII) %>% mutate(time = as.numeric(gsub(".*_CT([0-9]+)$", "\\1", title)))
x1 = list(data = as.data.frame(ex.norm[, data.pdata.x1$geo_accession]),
time = data.pdata.x1$time,
gname = rownames(ex.norm))
x2 = list(data = as.data.frame(ex.norm[, data.pdata.x2$geo_accession]),
time = data.pdata.x2$time,
gname = rownames(ex.norm))
In this section we run all the analyses using the default arguments, for example, the period is 24 h, the amplitude cutoff for rhythmic gene is 0, etc. Details about changing the arguments can be found in the function documents by “?DCP_Rhythmicity”.
library(DiffCircaPipeline)
DCP_rhythm = DCP_Rhythmicity(x1, x2)
## Warning: One very small variance detected, has been offset away from zero
Parameter estimates for each group are in “DCP_rhythm$x1$rhythm”
head(DCP_rhythm$x1$rhythm)
## gname M A phase peak pvalue qvalue
## 10338001 10338001 11.698275 0.106567532 2.251517 15.399839 0.1041008 0.4474593
## 10338002 10338002 6.050822 0.023407676 3.698849 9.871436 0.6594015 0.8510622
## 10338003 10338003 9.907403 0.119066212 2.285528 15.269926 0.2366674 0.5911106
## 10338004 10338004 8.581350 0.101562258 2.388286 14.877418 0.3831228 0.6938206
## 10338005 10338005 3.905233 0.005056579 3.035410 12.405586 0.7292322 0.8845179
## 10338006 10338006 4.035082 0.004422785 4.238822 7.808893 0.8496744 0.9396513
## sigma R2
## 10338001 0.173321221 0.17764992
## 10338002 0.088124253 0.03875448
## 10338003 0.248693560 0.11581224
## 10338004 0.262838040 0.07861265
## 10338005 0.007451227 0.20833470
## 10338006 0.016898674 0.03766805
head(DCP_rhythm$x2$rhythm )
## gname M A phase peak pvalue qvalue
## 10338001 10338001 11.871364 0.032571804 2.264964 15.348474 0.5364214 0.8602667
## 10338002 10338002 6.406782 0.065812076 4.297505 7.584741 0.3832761 0.7960966
## 10338003 10338003 10.240580 0.038659149 2.668252 13.808030 0.5545686 0.8684537
## 10338004 10338004 9.356769 0.042091463 3.626426 10.148074 0.5395708 0.8611368
## 10338005 10338005 3.867527 0.002321991 5.137218 4.377274 0.9504393 0.9891442
## 10338006 10338006 4.006544 0.004536261 4.879352 5.362247 0.8903137 0.9732819
## sigma R2
## 10338001 0.10399801 0.05307744
## 10338002 0.16362562 0.08461980
## 10338003 0.12646623 0.05069035
## 10338004 0.13366257 0.05362808
## 10338005 0.01321188 0.01734423
## 10338006 0.02427578 0.01956282
Types of rhythmicity for the two groups are in “DCP_rhythm$rhythm.joint”
head(DCP_rhythm$rhythm.joint)
## gname action1 action2 pG1 pG2 TOJR TOJR.FDR
## 1 10338001 1 2 0.1041008 0.5364214 arrhy arrhy
## 2 10338002 2 1 0.6594015 0.3832761 arrhy arrhy
## 3 10338003 1 2 0.2366674 0.5545686 arrhy arrhy
## 4 10338004 1 2 0.3831228 0.5395708 arrhy arrhy
## 5 10338005 1 2 0.7292322 0.9504393 arrhy arrhy
## 6 10338006 1 2 0.8496744 0.8903137 arrhy arrhy
Summarize number of genes identified for each TOJR:
table(DCP_rhythm$rhythm.joint$TOJR)
##
## arrhy both rhyI rhyII
## 29479 735 2728 2614
Summarize number of genes identified for each TOJR after genome-wide FDR correction:
table(DCP_rhythm$rhythm.joint$TOJR.FDR)
##
## arrhy both rhyI rhyII
## 34671 38 146 701
DCP_dR2 = DCP_DiffR2(DCP_rhythm)
head(DCP_dR2[order(DCP_dR2$p.R2), ])
## gname R2.1 R2.2 delta.R2 p.R2 q.R2
## 2728 10443527 0.015023297 0.9104134 0.8953901 4.954334e-09 3.010749e-05
## 1078 10363224 0.923506602 0.1321794 -0.7913272 3.714506e-08 1.128653e-04
## 1891 10402325 0.014711202 0.8797053 0.8649941 8.015765e-08 1.623727e-04
## 5867 10595856 0.008308555 0.8600023 0.8516938 2.206424e-07 3.303082e-04
## 2351 10425822 0.199990565 0.9182301 0.7182395 2.717692e-07 3.303082e-04
## 684 10345675 0.033601787 0.8714594 0.8378576 3.401598e-07 3.445252e-04
By default, the genes that are rhyI/rhyII/both in “DCP_rhythm$rhythm.joint$TOJR” will be used. If users want to use other TOJR results, e.g., “DCP_rhythm$rhythm.joint$TOJR.FDR”, please refer to next section.
DCP_dparam = DCP_DiffPar(DCP_rhythm, Par = "A&phase")
head(DCP_dparam[order(DCP_dparam$p.overall), 1:7])
## gname p.overall q.overall post.hoc.A.By.p post.hoc.peak.By.p
## 10556463 10556463 8.004971e-16 5.883654e-13 TRUE FALSE
## 10556487 10556487 6.149367e-13 1.712021e-10 TRUE FALSE
## 10518781 10518781 6.987839e-13 1.712021e-10 TRUE TRUE
## 10356601 10356601 1.945671e-12 3.575170e-10 TRUE FALSE
## 10598638 10598638 7.895862e-12 1.160692e-09 TRUE TRUE
## 10409278 10409278 1.025601e-11 1.256362e-09 TRUE FALSE
## post.hoc.A.By.q post.hoc.peak.By.q
## 10556463 TRUE FALSE
## 10556487 TRUE FALSE
## 10518781 TRUE TRUE
## 10356601 TRUE FALSE
## 10598638 TRUE FALSE
## 10409278 TRUE FALSE
The rest of the columns contains parameter estimates from each group and the p-value for parameter comparision without multiple testing adjustment.
By default the function plots the genes that are most rhythmic in group I (hypothalamus).
p.scatter = DCP_ScatterPlot(DCP_rhythm, Info1 = gI, Info2 = gII, filename = NULL)
#
#if given file name the function will automatically save the plots to pdf files in the given directory.
DCP_PlotDisplay(p.scatter) #DCP_ScatterPlot only stores the plots in a list, DCP_PlotDisplay will display plots side-by-side.
## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[3]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[4]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[5]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[6]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[7]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[8]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[9]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[10]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Plot the top 2 genes with \(Delta R^2\)
g.topR2 = DCP_dR2$gname[order(DCP_dR2$p.R2)][1:2]
p.scatter = DCP_ScatterPlot(DCP_rhythm, g.topR2)
DCP_PlotDisplay(p.scatter)
## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Plot the top 2 genes with \(Delta A\)
g.top.dA= DCP_dparam$gname[order(DCP_dparam$p.delta.A)][1:2]
p.scatter = DCP_ScatterPlot(DCP_rhythm, g.top.dA)
DCP_PlotDisplay(p.scatter)
## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Plot the top 2 genes with \(Delta \phi\). Since both groups are control samples, the phase difference is not large.
g.top.dphase= DCP_dparam$gname[order(DCP_dparam$p.delta.peak)][1:2]
p.scatter = DCP_ScatterPlot(DCP_rhythm, g.top.dphase)
DCP_PlotDisplay(p.scatter)
## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
##
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
The default is plot the top 100 rhythmic genes in group I. Users are able to plot selected genes by specifying the genes.plot argument.
DCP_PlotHeatmap(DCP_rhythm)
## Warning in DCP_PlotHeatmap(DCP_rhythm):
DCP_PlotPeakRadar(DCP_rhythm)
## Warning in DCP_PlotPeakRadar(DCP_rhythm): There is no TOJR input,
## x$rhythm.joint$TOJR will be used for joint rhythmicity category.
The peak histogram is essentially the same as the peak radar plot with only the presentation form changed.
DCP_PlotPeakHist(DCP_rhythm)
## Warning in DCP_PlotPeakHist(DCP_rhythm): There is no TOJR input,
## x$rhythm.joint$TOJR will be used for joint rhythmicity category.
## Warning: Removed 4 rows containing missing values (geom_bar).
The default peak difference plot:
DCP_PlotPeakDiff(DCP_rhythm, Info1 = gI, Info2 = gII)
## Warning in DCP_PlotPeakDiff(DCP_rhythm, Info1 = gI, Info2 = gII): color.cut or
## color.df input is ignored. color.cut or color.df is only used when dPhase is
## given.
## Warning in DCP_PlotPeakDiff(DCP_rhythm, Info1 = gI, Info2 = gII): There is
## no input for TOJR or dPhase, x$rhythm.joint$TOJR will be used for extracting
## RhyBoth genes.
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
If users have performed differential rhythm parameter test that contains differential phase, the corresponding result can be displayed. The “dPhase” should be the result from differential rhythm parameter, and “color.cut” specifies the color regime according to the “dPhase”.
DCP_PlotPeakDiff(DCP_rhythm,
dPhase = DCP_dparam,
color.cut = list(param = "p.delta.peak", fun = "<", val = 0.05, color.sig = "#b33515", color.none = "dark grey"),
Info1 = gI, Info2 = gII)
## Warning in DCP_PlotPeakDiff(DCP_rhythm, dPhase = DCP_dparam, color.cut =
## list(param = "p.delta.peak", : dPhase is given. Genes contained in dPhase will
## be used as RhyBoth in regardless of any TOJR input.
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'
The functions DCP_dR2 and DCP_DiffPar are only performed to a subset of genes for biologically meaningful tests (see the manuscript for more details). By default, the functions use the gene categories in “DCP_rhythm$rhythm.joint$TOJR”. However, user are able to specify a different (types of joint rhythmicity) TOJR assignment. For example, we can use the genome-wide FDR corrected TOJR.
DCP_dR2 = DCP_DiffR2(DCP_rhythm, TOJR = DCP_rhythm$rhythm.joint$TOJR.FDR)
DCP_dparam = DCP_DiffPar(DCP_rhythm, Par = "A&phase", TOJR = DCP_rhythm$rhythm.joint$TOJR.FDR)
Separate TOJR result can be generated with the toTOJR functoin. For example, we can perform an intuitive (but statistically criticized) venn diagram.
new.TOJR = toTOJR(DCP_rhythm, "VDA", alpha = 0.05, adjustP = TRUE) #adjustP = TRUE gives genome-wide FDR adjusted results
DCP_dR2 = DCP_DiffR2(DCP_rhythm, TOJR = new.TOJR)
DCP_dparam = DCP_DiffPar(DCP_rhythm, Par = "A&phase", TOJR = new.TOJR)
If the study involves human subjects, usually researchers do not have control of when and where the data will be collected. The function DCP_getZT converts a recorded clock time to Zeitgeber time according to the location and date of the collection.
t = as.POSIXlt("2022-09-17 13:43:00 EST", tz="America/New_York",usetz=TRUE)
lat = 40.4406; long = -79.9959 #This is the Pittsburgh coordinates
t.correct = DCP_getZT(t, lat, long, ZT.min = -6)
## Please make sure that the input time has the correct time zone.
print(t.correct)
## [1] 6.655715
Summarize the DR result to a table:
SummarizeDR(list(DCP_dR2$p.R2, DCP_dR2$q.R2), test = c("DRF", "DRF"), type = c("p-value", "q-value"), val = c(0.05, 0.05), out = "long")
## nTRUE test cutoff
## 1 756(/1376) DRF p-value<0.05
## 2 463(/1376) DRF q-value<0.05
SummarizeDR(list(DCP_dR2$p.R2, DCP_dR2$q.R2), test = c("DRF", "DRF"), type = c("p-value", "q-value"), val = c(0.05, 0.05), out = "wide")
## DRF p-value<0.05 DRF q-value<0.05
## 1 756(/1376) 463(/1376)