DiffCircaPipeline is a workflow for a systematic detection of multifaceted differential circadian characteristics (\(\Delta A\), \(\Delta \phi\), \(\Delta M\), \(\Delta R^2\)) with accurate false positive control. The pipeline allows interactive exploration of the data and the analysis outputs are accompanied by informative visualization. This tutorial demonstrates the pipeline using an public data set from Gene Expression Omnibus (GEO) with the accession number GSE54650.

Prepare data

Download data from GEO

GSE54650 contains data from 12 tissues. In this tutorial we will be comparing rhythmicity between two tissues: hypothalamus and skeletal muscle.

GSE.ID = "GSE54650"
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
## Bioconductor version '3.13' is out-of-date; the current release version '3.15'
##   is available with R version '4.2'; see https://bioconductor.org/install
if(!require("GEOquery", quietly = TRUE))
  BiocManager::install("GEOquery")
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
if(!require("Biobase", quietly = TRUE))
  BiocManager::install("Biobase")
Sys.setenv(VROOM_CONNECTION_SIZE=131072*5)
gset <- GEOquery::getGEO(GSE.ID)
## Found 1 file(s)
## GSE54650_series_matrix.txt.gz
## Rows: 35556 Columns: 289
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (289): ID_REF, GSM1320990, GSM1320991, GSM1320992, GSM1320993, GSM132099...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## File stored at:
## /var/folders/sw/rwyyf2h547ddmv59k_c5g08c0000gn/T//RtmpBojbWE/GPL6246.soft
data.pdata =   Biobase::pData(gset[[1]])
data.fdata = Biobase::fData(gset[[1]])
data.exprs = Biobase::exprs(gset[[1]])
table(data.pdata$source_name_ch1)
## 
##   adrenal gland           aorta       brainstem   brown adipose      cerebellum 
##              24              24              24              24              24 
##           heart    hypothalamus          kidney           liver            lung 
##              24              24              24              24              24 
## skeletal muscle   white adipose 
##              24              24

Preprocessing

gI = "hypothalamus"; gII = "skeletal muscle"
data.pdata.sub = data.pdata %>% filter(source_name_ch1 %in% c(gI, gII))
data.exprs.sub = data.exprs[, data.pdata.sub$geo_accession]

#check if need log2 transformation
ex = data.exprs.sub
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
  (qx[6]-qx[1] > 50 && qx[2] > 0)
if (LogC) {
  ex[ex == 0] <- NaN
  ex <- log2(ex) }

# box-and-whisker plot
par(mar=c(7,4,2,1))
title <- paste0(GSE.ID, ": box-and-whisker plot")
boxplot(ex, boxwex=0.7, notch=T, main=title, outline=FALSE, las=2,)#distribution looks good

# expression value distribution plot
par(mar=c(4,4,2,1))
title <- paste0(GSE.ID, ": density distribution")
limma::plotDensities(ex, main=title, legend=F)

There is an apparant distribution difference between the two tissues, so we perform quantile normalization to the data.

ex.norm = limma::normalizeQuantiles(ex)
limma::plotDensities(ex.norm, main=title, legend=F)

DiffCircaPipeline (quick start)

This section performs the differential rhythmicity (DR) tests using DiffCircaPipeline.

Format the input data

Format the data for DiffCircaPipeline. The data for each group is a list of three components: (1) the expression data, which must be formatted into dataframe; (2) the time, which should have the same order as the columns in the expression, and (3) gname, which is a vector of feature identifier, the order must be the same as the rows in the expression.

data.pdata.x1 = data.pdata.sub %>% filter(source_name_ch1==gI) %>% mutate(time = as.numeric(gsub(".*_CT([0-9]+)$", "\\1", title)))
data.pdata.x2 = data.pdata.sub %>% filter(source_name_ch1==gII) %>% mutate(time = as.numeric(gsub(".*_CT([0-9]+)$", "\\1", title)))
x1 = list(data = as.data.frame(ex.norm[, data.pdata.x1$geo_accession]), 
          time = data.pdata.x1$time, 
          gname = rownames(ex.norm))
x2 = list(data = as.data.frame(ex.norm[, data.pdata.x2$geo_accession]), 
          time = data.pdata.x2$time, 
          gname = rownames(ex.norm))

Rhythm analysis

In this section we run all the analyses using the default arguments, for example, the period is 24 h, the amplitude cutoff for rhythmic gene is 0, etc. Details about changing the arguments can be found in the function documents by “?DCP_Rhythmicity”.

library(DiffCircaPipeline)
DCP_rhythm = DCP_Rhythmicity(x1, x2)
## Warning: One very small variance detected, has been offset away from zero

Parameter estimates for each group are in “DCP_rhythm$x1$rhythm”

head(DCP_rhythm$x1$rhythm)
##             gname         M           A    phase      peak    pvalue    qvalue
## 10338001 10338001 11.698275 0.106567532 2.251517 15.399839 0.1041008 0.4474593
## 10338002 10338002  6.050822 0.023407676 3.698849  9.871436 0.6594015 0.8510622
## 10338003 10338003  9.907403 0.119066212 2.285528 15.269926 0.2366674 0.5911106
## 10338004 10338004  8.581350 0.101562258 2.388286 14.877418 0.3831228 0.6938206
## 10338005 10338005  3.905233 0.005056579 3.035410 12.405586 0.7292322 0.8845179
## 10338006 10338006  4.035082 0.004422785 4.238822  7.808893 0.8496744 0.9396513
##                sigma         R2
## 10338001 0.173321221 0.17764992
## 10338002 0.088124253 0.03875448
## 10338003 0.248693560 0.11581224
## 10338004 0.262838040 0.07861265
## 10338005 0.007451227 0.20833470
## 10338006 0.016898674 0.03766805
head(DCP_rhythm$x2$rhythm )
##             gname         M           A    phase      peak    pvalue    qvalue
## 10338001 10338001 11.871364 0.032571804 2.264964 15.348474 0.5364214 0.8602667
## 10338002 10338002  6.406782 0.065812076 4.297505  7.584741 0.3832761 0.7960966
## 10338003 10338003 10.240580 0.038659149 2.668252 13.808030 0.5545686 0.8684537
## 10338004 10338004  9.356769 0.042091463 3.626426 10.148074 0.5395708 0.8611368
## 10338005 10338005  3.867527 0.002321991 5.137218  4.377274 0.9504393 0.9891442
## 10338006 10338006  4.006544 0.004536261 4.879352  5.362247 0.8903137 0.9732819
##               sigma         R2
## 10338001 0.10399801 0.05307744
## 10338002 0.16362562 0.08461980
## 10338003 0.12646623 0.05069035
## 10338004 0.13366257 0.05362808
## 10338005 0.01321188 0.01734423
## 10338006 0.02427578 0.01956282

Types of rhythmicity for the two groups are in “DCP_rhythm$rhythm.joint”

head(DCP_rhythm$rhythm.joint)
##      gname action1 action2       pG1       pG2  TOJR TOJR.FDR
## 1 10338001       1       2 0.1041008 0.5364214 arrhy    arrhy
## 2 10338002       2       1 0.6594015 0.3832761 arrhy    arrhy
## 3 10338003       1       2 0.2366674 0.5545686 arrhy    arrhy
## 4 10338004       1       2 0.3831228 0.5395708 arrhy    arrhy
## 5 10338005       1       2 0.7292322 0.9504393 arrhy    arrhy
## 6 10338006       1       2 0.8496744 0.8903137 arrhy    arrhy

Summarize number of genes identified for each TOJR:

table(DCP_rhythm$rhythm.joint$TOJR)
## 
## arrhy  both  rhyI rhyII 
## 29479   735  2728  2614

Summarize number of genes identified for each TOJR after genome-wide FDR correction:

table(DCP_rhythm$rhythm.joint$TOJR.FDR)
## 
## arrhy  both  rhyI rhyII 
## 34671    38   146   701

Compare rhythm fitness

DCP_dR2 = DCP_DiffR2(DCP_rhythm)
head(DCP_dR2[order(DCP_dR2$p.R2), ])
##         gname        R2.1      R2.2   delta.R2         p.R2         q.R2
## 2728 10443527 0.015023297 0.9104134  0.8953901 4.954334e-09 3.010749e-05
## 1078 10363224 0.923506602 0.1321794 -0.7913272 3.714506e-08 1.128653e-04
## 1891 10402325 0.014711202 0.8797053  0.8649941 8.015765e-08 1.623727e-04
## 5867 10595856 0.008308555 0.8600023  0.8516938 2.206424e-07 3.303082e-04
## 2351 10425822 0.199990565 0.9182301  0.7182395 2.717692e-07 3.303082e-04
## 684  10345675 0.033601787 0.8714594  0.8378576 3.401598e-07 3.445252e-04

By default, the genes that are rhyI/rhyII/both in “DCP_rhythm$rhythm.joint$TOJR” will be used. If users want to use other TOJR results, e.g., “DCP_rhythm$rhythm.joint$TOJR.FDR”, please refer to next section.

Compare rhythm parameters

DCP_dparam = DCP_DiffPar(DCP_rhythm, Par = "A&phase")      
head(DCP_dparam[order(DCP_dparam$p.overall), 1:7])
##             gname    p.overall    q.overall post.hoc.A.By.p post.hoc.peak.By.p
## 10556463 10556463 8.004971e-16 5.883654e-13            TRUE              FALSE
## 10556487 10556487 6.149367e-13 1.712021e-10            TRUE              FALSE
## 10518781 10518781 6.987839e-13 1.712021e-10            TRUE               TRUE
## 10356601 10356601 1.945671e-12 3.575170e-10            TRUE              FALSE
## 10598638 10598638 7.895862e-12 1.160692e-09            TRUE               TRUE
## 10409278 10409278 1.025601e-11 1.256362e-09            TRUE              FALSE
##          post.hoc.A.By.q post.hoc.peak.By.q
## 10556463            TRUE              FALSE
## 10556487            TRUE              FALSE
## 10518781            TRUE               TRUE
## 10356601            TRUE              FALSE
## 10598638            TRUE              FALSE
## 10409278            TRUE              FALSE

The rest of the columns contains parameter estimates from each group and the p-value for parameter comparision without multiple testing adjustment.

Visualization

Scatter plots

By default the function plots the genes that are most rhythmic in group I (hypothalamus).

p.scatter = DCP_ScatterPlot(DCP_rhythm, Info1 = gI, Info2 = gII, filename = NULL) 
#
#if given file name the function will automatically save the plots to pdf files in the given directory.
DCP_PlotDisplay(p.scatter) #DCP_ScatterPlot only stores the plots in a list, DCP_PlotDisplay will display plots side-by-side. 

## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[3]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[4]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[5]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[6]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[7]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[8]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[9]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[10]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Plot the top 2 genes with \(Delta R^2\)

g.topR2 = DCP_dR2$gname[order(DCP_dR2$p.R2)][1:2]
p.scatter = DCP_ScatterPlot(DCP_rhythm, g.topR2)
DCP_PlotDisplay(p.scatter)

## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Plot the top 2 genes with \(Delta A\)

g.top.dA= DCP_dparam$gname[order(DCP_dparam$p.delta.A)][1:2]
p.scatter = DCP_ScatterPlot(DCP_rhythm, g.top.dA)
DCP_PlotDisplay(p.scatter)

## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Plot the top 2 genes with \(Delta \phi\). Since both groups are control samples, the phase difference is not large.

g.top.dphase= DCP_dparam$gname[order(DCP_dparam$p.delta.peak)][1:2]
p.scatter = DCP_ScatterPlot(DCP_rhythm, g.top.dphase)
DCP_PlotDisplay(p.scatter)

## [[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 
## [[2]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Heatmap

The default is plot the top 100 rhythmic genes in group I. Users are able to plot selected genes by specifying the genes.plot argument.

DCP_PlotHeatmap(DCP_rhythm)
## Warning in DCP_PlotHeatmap(DCP_rhythm):

Peak radar plot

DCP_PlotPeakRadar(DCP_rhythm)
## Warning in DCP_PlotPeakRadar(DCP_rhythm): There is no TOJR input,
## x$rhythm.joint$TOJR will be used for joint rhythmicity category.

Peak histogram

The peak histogram is essentially the same as the peak radar plot with only the presentation form changed.

DCP_PlotPeakHist(DCP_rhythm)
## Warning in DCP_PlotPeakHist(DCP_rhythm): There is no TOJR input,
## x$rhythm.joint$TOJR will be used for joint rhythmicity category.
## Warning: Removed 4 rows containing missing values (geom_bar).

Peak difference plot

The default peak difference plot:

DCP_PlotPeakDiff(DCP_rhythm, Info1 = gI, Info2 = gII) 
## Warning in DCP_PlotPeakDiff(DCP_rhythm, Info1 = gI, Info2 = gII): color.cut or
## color.df input is ignored. color.cut or color.df is only used when dPhase is
## given.
## Warning in DCP_PlotPeakDiff(DCP_rhythm, Info1 = gI, Info2 = gII): There is
## no input for TOJR or dPhase, x$rhythm.joint$TOJR will be used for extracting
## RhyBoth genes.
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

If users have performed differential rhythm parameter test that contains differential phase, the corresponding result can be displayed. The “dPhase” should be the result from differential rhythm parameter, and “color.cut” specifies the color regime according to the “dPhase”.

DCP_PlotPeakDiff(DCP_rhythm, 
                              dPhase = DCP_dparam, 
                              color.cut = list(param = "p.delta.peak", fun = "<", val = 0.05, color.sig = "#b33515", color.none = "dark grey"), 
                              Info1 = gI, Info2 = gII)
## Warning in DCP_PlotPeakDiff(DCP_rhythm, dPhase = DCP_dparam, color.cut =
## list(param = "p.delta.peak", : dPhase is given. Genes contained in dPhase will
## be used as RhyBoth in regardless of any TOJR input.
## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

## Warning in is.na(x): is.na() applied to non-(list or vector) of type
## 'expression'

Perform DR test with other TOJR result

The functions DCP_dR2 and DCP_DiffPar are only performed to a subset of genes for biologically meaningful tests (see the manuscript for more details). By default, the functions use the gene categories in “DCP_rhythm$rhythm.joint$TOJR”. However, user are able to specify a different (types of joint rhythmicity) TOJR assignment. For example, we can use the genome-wide FDR corrected TOJR.

DCP_dR2 = DCP_DiffR2(DCP_rhythm, TOJR = DCP_rhythm$rhythm.joint$TOJR.FDR)
DCP_dparam = DCP_DiffPar(DCP_rhythm, Par = "A&phase", TOJR = DCP_rhythm$rhythm.joint$TOJR.FDR)

Separate TOJR result can be generated with the toTOJR functoin. For example, we can perform an intuitive (but statistically criticized) venn diagram.

new.TOJR = toTOJR(DCP_rhythm, "VDA", alpha = 0.05, adjustP = TRUE) #adjustP = TRUE gives genome-wide FDR adjusted results
DCP_dR2 = DCP_DiffR2(DCP_rhythm, TOJR = new.TOJR)
DCP_dparam = DCP_DiffPar(DCP_rhythm, Par = "A&phase", TOJR = new.TOJR)

Other functions

Calculate ZT from local clock time

If the study involves human subjects, usually researchers do not have control of when and where the data will be collected. The function DCP_getZT converts a recorded clock time to Zeitgeber time according to the location and date of the collection.

t = as.POSIXlt("2022-09-17 13:43:00 EST", tz="America/New_York",usetz=TRUE)
lat = 40.4406; long = -79.9959 #This is the Pittsburgh coordinates
t.correct = DCP_getZT(t, lat, long, ZT.min = -6)
## Please make sure that the input time has the correct time zone.
print(t.correct)
## [1] 6.655715

Summarize results by p-value cutoff

Summarize the DR result to a table:

SummarizeDR(list(DCP_dR2$p.R2, DCP_dR2$q.R2), test = c("DRF", "DRF"),  type = c("p-value", "q-value"), val = c(0.05, 0.05), out = "long")
##        nTRUE test       cutoff
## 1 756(/1376)  DRF p-value<0.05
## 2 463(/1376)  DRF q-value<0.05
SummarizeDR(list(DCP_dR2$p.R2, DCP_dR2$q.R2), test = c("DRF", "DRF"),  type = c("p-value", "q-value"), val = c(0.05, 0.05), out = "wide")
##   DRF p-value<0.05 DRF q-value<0.05
## 1       756(/1376)       463(/1376)