Title: | Differential Pattern Analysis via Linear Modeling |
---|---|
Description: | Individual gene expression patterns are encoded into a series of eigenvector patterns ('WGCNA' package). Using the framework of linear model-based differential expression comparisons ('limma' package), time-course expression patterns for genes in different conditions are compared and analyzed for significant pattern changes. For reference, see: Greenham K, Sartor RC, Zorich S, Lou P, Mockler TC and McClung CR. eLife. 2020 Sep 30;9(4). <doi:10.7554/eLife.58993>. |
Authors: | Ryan C. Sartor <[email protected]>, Kathleen Greenham <[email protected]> |
Maintainer: | Ryan C. Sartor <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2 |
Built: | 2025-02-11 04:43:56 UTC |
Source: | https://github.com/cran/DiPALM |
Individual gene expression patterns are encoded into a series of eigenvector patterns (WGCNA package). Using the framework of linear model-based differential expression comparisons (limma package), time-course expression patterns for genes in different conditions are compared and analyzed for significant pattern changes.
Ryan C. Sartor & Kathleen Greenham
Maintainer: Ryan C. Sartor <[email protected]>
limma package
blockwiseModules
from the WGCNA package
Used to calculate the false discovery rate associated with a specified cutoff. It uses a vector of actual test scores and a vector of permuted scores. The result can be considered a corrected p-value or q-value.
AdjustPvalue(tVal, tVec, pVec)
AdjustPvalue(tVal, tVec, pVec)
tVal |
The threshold or cutoff value to use (given this value, the function returns the FDR associated with using any score above this one as a positive result). |
tVec |
A numeric vector of actual test result values (used to determine true positives). |
pVec |
A numeric vector of permuted test result values (used to determine false positives). |
This function assumes all test result values above the given threshold are true positives (TP) and all permuted result values above the threshold are false positives (FP). It calculates FDR as (FP)/(TP). When applied to individual test results, this works to estimate an adjusted p-value (or q-value) for that result.
A numeric value representing the FDR associated with the given cutoff (tVal).
Ryan C. Sartor
# Returns a vector of adjusted p-values from a vector of test results data("testData") AdjPval_kMEs<-sapply(testData$testResults[1:100],function(x) AdjustPvalue(tVal = x, tVec = testData$testResults, pVec = testData$permutedResults))
# Returns a vector of adjusted p-values from a vector of test results data("testData") AdjPval_kMEs<-sapply(testData$testResults[1:100],function(x) AdjustPvalue(tVal = x, tVec = testData$testResults, pVec = testData$permutedResults))
Creates a blank plotting window of desired size.
BlankPlot(xrng = c(0, 1), yrng = c(0, 1), Main = "", xlab = "", ylab = "", ...)
BlankPlot(xrng = c(0, 1), yrng = c(0, 1), Main = "", xlab = "", ylab = "", ...)
xrng |
The range of the x-axis. |
yrng |
The range of the y-axis. |
Main |
The main plot title. |
xlab |
X-axis label. |
ylab |
Y-axis label. |
... |
Additional arguments passed to the |
This function is used to initiate a blank plotting window in R. xrng and yrng are used to initiate the size of the plotting region.
Generates a blank plot
Ryan C. Sartor
BlankPlot(xrng=c(0,10),yrng=c(0,10),Main="Test",xlab="xlab",ylab="ylab")
BlankPlot(xrng=c(0,10),yrng=c(0,10),Main="Test",xlab="xlab",ylab="ylab")
This function fits a linear model (using lmFit
form the limma package) to each set of kME values returned from BuildModMembership
.
BuildLimmaLM(dataMat, designMat, contrastStr)
BuildLimmaLM(dataMat, designMat, contrastStr)
dataMat |
a matrix containing module membership values (a single element of the list returned by |
designMat |
a linear model design matrix containing binary values (0 and 1) with rows as samples and columns as treatments. Use the |
contrastStr |
a linear model contrast specifying the desired differential comparison to be made using the linear model (see the |
This function uses the limma package functionality to carry out an analysis of differential patterning. Asking if a gene displays a significantly different expression pattern across time when looking at one condition compared to another.
The concept of differential patterning is similar to differential expression. However, the expression level of most genes fluctuates significantly throughout the day. Most of this is due to diurnal or circadian rhythms. These regular changes in expression severely complicate differential expression analysis and give rise to the need for examining differential patterning.
By "encoding" the expression pattern of each gene into a vector of module membership (kME) values using the BuildModMembership
function, the DiPALM analysis provides a way to use the limma method to examine differential patterning.
This function returns an MArrayLM-class
object. This is a custom class from the limma
package. It is an object that contains all information and results of individual linear models built for each gene in the input matrix.
Ryan C. Sartor
BuildLimmaLM
, lmFit
, makeContrasts
, contrasts.fit
, eBayes
data("testData") # First calculate the module memberships kMEsList<-BuildModMembership(MeMat=testData$moduleEigengenes, TCsLst=testData$timeCourseList) # Then build the limma models LimmaMods<-lapply(kMEsList, function(x) BuildLimmaLM(dataMat = x, designMat = testData$modelDesign, contrastStr = testData$modelContrast))
data("testData") # First calculate the module memberships kMEsList<-BuildModMembership(MeMat=testData$moduleEigengenes, TCsLst=testData$timeCourseList) # Then build the limma models LimmaMods<-lapply(kMEsList, function(x) BuildLimmaLM(dataMat = x, designMat = testData$modelDesign, contrastStr = testData$modelContrast))
This function calculates module membership (kME) between each gene's expression vector throughout a time-course and the module eigengenes of the data set (from the WGCNA package, using blockwiseModules
). Module membership is defined as the Pearson correlation between the expression of a single gene and the expression of a single eigengene.
BuildModMembership(MeMat, TCsLst)
BuildModMembership(MeMat, TCsLst)
MeMat |
A data frame of eigengenes with each column being a different eigengene vector. |
TCsLst |
A named list of time-course matrices. Each element of the list is a time-course expression matrix (genes as rows and time-points as columns). |
The matrix of eigengene vectors, MeMat
(see blockwiseModules
from the WGCNA package) is used as a convenient way to summarize all the unique expression patterns (across time) that are present in a time-course data set.
Any large gene expression data set from a complex organism will contain tens of thousands of genes. However, extensive coordinated regulation is present in all biological systems. Because of this co-regulation, usually only dozens of distinct expression patterns exist and most genes can be categorized by one (or a few) of them.
By using a relatively small set of descriptive expression vectors (eigengenes), we can "encode" the pattern of expression for any single gene into a length N vector where N is the number of eigengenes and the values of the vector are the Pearson correlation of the gene to each eigengene (kME).
These kME values can then be compared using the limma
package to determine differential patterning similar to how expression values are compared to determine differential expression.
This function is the first step to calculate all kMEs for each gene in a list of time-course matrices relative to the set of eigengenes for the whole data set.
This function returns a list of kME matrices. The list contains a separate element for each eigengene. Each element is a matrix with a row for each gene and a column for each member of the input TCsLst
.
Ryan C. Sartor
data("testData") kMEsList<-BuildModMembership(MeMat=testData$moduleEigengenes, TCsLst=testData$timeCourseList)
data("testData") kMEsList<-BuildModMembership(MeMat=testData$moduleEigengenes, TCsLst=testData$timeCourseList)
A list of data objects that are used to run through the DiPALM example vignette.
data("exampleData")
data("exampleData")
A List of 3 objects:
$rawCounts - A data frame [42383 rows X 48 columns] containing raw counts for mRNA expression data on Brassica rapa R500.
rows : Each row represents the expression of one gene across many different samples.
columns: Each column represents a different sample. Both drought-treated and properly watered samples are present, with two replicates of each time-course. Tissue was sampled at 4-hour intervals and column names are encoded as follows: S[replicate number] [D-Drought / W-Watered] [Time - sample] Example: S2D10 is part of the second replcate of the drought data and is the 10th time-point (ZT13, 13th hour after dawn on day two).
$moduleEigengenes - A dataframe [12 rows and 90 columns] representing 90 different expression vectors that describe descrete expression patterns found in the whole dataset ($timeCourseList).
Rows: each row represents a single time-point.
Columns: each column represents a distinct expression pattern seen repeatedly in the full data.
$geneAnnotations - A data frame [42383 rows and 5 columns] giving the locations within the Brassica rapa R500 genome where each gene is found. This is in SAF format which is used by the Linux Subread software package.
rows: Each row represents a single gene.
$GeneID: (character string) Unique gene accession.
$Chr: (character string) Chromosome number.
$Start: (integer) The position where the gene starts (first basepair).
$End: (integer) The position where the gene ends (last basepair).
$Strand: (character) is the gene on the positive [+] or negative [-] strand.
Raw count data was summarized by mapping RNA-seq counts to the R500 Brassica rapa genome using the hisat2 aligner. Mapped reads were counted using the featureCounts function from the Linux Subread software package.
Data is from: Greenham et al., eLife 2017;6:e29655
This function generates one or more smoothed density distributions using ggplot2
functionality
ggPlotMultiDensities(denslist, main = "", xlab = "", ylab = "Normalized Frequency", scale = T, cols = c("red", "blue", "grey50", "black", "skyblue", "orange"), ledgx = "topright", ltype = NULL, lwidth = NULL, xrng = NULL, yrng = NULL, Ledge = T)
ggPlotMultiDensities(denslist, main = "", xlab = "", ylab = "Normalized Frequency", scale = T, cols = c("red", "blue", "grey50", "black", "skyblue", "orange"), ledgx = "topright", ltype = NULL, lwidth = NULL, xrng = NULL, yrng = NULL, Ledge = T)
denslist |
A named list of populations. Each element is a vector of numbers. A separate density plot will be generated for each population. |
main |
The main plot title. |
xlab |
The x-axis label. |
ylab |
The y-axis label. |
scale |
When multiple densities are plotted, should they be scaled to represent the relative number of samples in each population. |
cols |
A vector of colors (one for each population in |
ledgx |
The x value from |
ltype |
A vector of integers specifying the line types to be used for each population (repeated if necessary for multiple populations). |
lwidth |
The line width to be used in the plot. |
xrng |
A numeric vector of 2, the x-axis range to be plotted. |
yrng |
A numeric vector of 2, the y-axis range to be plotted. |
Ledge |
A logical: should the legend be plotted? |
This function takes in a list of one or more numeric vectors. Each vector is considered a separate population and a smoothed density plot (similar to a histogram) will be generated for each and plotted together. The actual numeric y-axis values are meaningless. A density plot is normalized so that the total area under the curve is equal to 1. When scale = T, the total area under the density plots may not equal one, but instead the total area is scaled to the relative proportions of the total population sizes of each population.
Generates multiple density plots on the same graph.
Ryan C. Sartor
data(testData) require(ggplot2) # Two populations of the same size, unscaled ggPlotMultiDensities(denslist = list(Test=testData$testResults,Permuted=testData$permutedResults), main = "Pattern Change Scores", xlab = "Differential Pattern Score",lwidth = 1) # Two populations of different sizes, scaled ggPlotMultiDensities(denslist = list(Test=testData$testResults[1:3000], Permuted=testData$permutedResults), scale=TRUE , main = "Pattern Change Scores (Scaled)", xlab = "Differential Pattern Score",lwidth = 1)
data(testData) require(ggplot2) # Two populations of the same size, unscaled ggPlotMultiDensities(denslist = list(Test=testData$testResults,Permuted=testData$permutedResults), main = "Pattern Change Scores", xlab = "Differential Pattern Score",lwidth = 1) # Two populations of different sizes, scaled ggPlotMultiDensities(denslist = list(Test=testData$testResults[1:3000], Permuted=testData$permutedResults), scale=TRUE , main = "Pattern Change Scores (Scaled)", xlab = "Differential Pattern Score",lwidth = 1)
A plotting function that generates a line plot for a single gene in multiple time-course expression sets. Each time-course is overlaid on the same plot.
PlotTCs(TClst, tgene, main = "", scale = TRUE, xlab = "Time", ylab = "", xAxsLabs = colnames(TClst[[1]]), ledgeX = "top", colAdj = 0.4, tcols = c("red", "red", "blue", "blue"), tltys = c(1, 1, 2, 2))
PlotTCs(TClst, tgene, main = "", scale = TRUE, xlab = "Time", ylab = "", xAxsLabs = colnames(TClst[[1]]), ledgeX = "top", colAdj = 0.4, tcols = c("red", "red", "blue", "blue"), tltys = c(1, 1, 2, 2))
TClst |
A named list of time-course matrices. Each element of the list is a time-course expression matrix (Genes as rows and time points as columns). |
tgene |
A string of a single gene accession (found in the row names of one element of TClst). |
main |
Main figure title. |
scale |
Logical: should the expression values of each time-course be scaled and centered? |
xlab |
The X-axis label. |
ylab |
The Y-axis label. |
xAxsLabs |
A vector of names to be associated with each x-axis point. |
ledgeX |
The x value from |
colAdj |
Controls the alpha channel for the line colors using |
tcols |
A vector of colors controlling the colors of the lines. Must list one color for each time-course in TClst. |
tltys |
A vector of integers controlling the line type (lty) of the lines. Must list one lty for each time-course in TClst. |
This function is used to view the expression pattern of a single gene across multiple time-course samples.
The function generates a plot with multiple lines.
Ryan C. Sartor
data(testData) PlotTCs(TClst = testData$timeCourseList, tgene = "BraA05g36370R" , scale = TRUE,tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2), ledgeX = "topleft")
data(testData) PlotTCs(TClst = testData$timeCourseList, tgene = "BraA05g36370R" , scale = TRUE,tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2), ledgeX = "topleft")
A plotting function that generates a ribbon plot representing the summarized expression of a set of genes. A separate ribbon is generated for each time-course. The ribbon represents + & - 1 standard deviation from the mean at each point.
PlotTCsRibbon(TClst, tgenes, main = "", xlab = "Time", ylab = "", xAxsLabs = colnames(TClst[[1]]),scale = TRUE, alpha = 0.1, tcols, tltys, splits = rep(1, length(TClst)))
PlotTCsRibbon(TClst, tgenes, main = "", xlab = "Time", ylab = "", xAxsLabs = colnames(TClst[[1]]),scale = TRUE, alpha = 0.1, tcols, tltys, splits = rep(1, length(TClst)))
TClst |
A named list of time-course matrices. Each element of the list is a time-course expression matrix (Genes as rows and time points as columns). |
tgenes |
A vector of gene accessions (found in the row names of one element of TClst). |
main |
The main title of the plot, if splits are defined, multiple plots will be generated and a vector of titles may be defined here. |
xlab |
The X-axis label. |
ylab |
The Y-axis label. |
xAxsLabs |
The time point labels for the x-axis, by default they will be the column names of the first time-course in TClst. |
scale |
Logical: should the expression values of each time-course be scaled and centered? |
alpha |
The alpha channel to be used for the ribbons (number from 0 - 1 defining transparency). See |
tcols |
A vector of colors controlling the colors of the lines. Must list one color for each time-course in TClst. |
tltys |
A vector of integers controlling the line type (lty) of the lines. Must list one lty for each time-course in TClst. |
splits |
A vector of integers defining groups to be plotted on separate plots if desired (see examples). |
This can be used to plot the summarized expression pattern of an entire cluster of genes and compare between multiple time courses. Line plots are generated representing the mean expression value at each time point for the group. Ribbons are plotted along the lines in semi-transparent colors. The ribbons represent + and - one standard deviation from the mean expression level of the group. #
This function generates multiple line plots representing the average expression of a group of genes with ribbons representing standard deviation.
Ryan C. Sartor
data("testData") # Calculate the module memberships kMEsList<-BuildModMembership(MeMat=testData$moduleEigengenes, TCsLst=testData$timeCourseList) # Build the limma models LimmaMods<-lapply(kMEsList, function(x) BuildLimmaLM(dataMat = x, designMat = testData$modelDesign, contrastStr = testData$modelContrast)) # Get the adjusted Pvalues for each gene AdjPval_kMEs<-sapply(testData$testResults[rownames(LimmaMods[[1]])],function(x) AdjustPvalue(tVal = x, tVec = testData$testResults, pVec = testData$permutedResults)) # Cluster the genes base on the limma results LimmaSums<-do.call(cbind,lapply(LimmaMods,function(x) x$t)) LimmaModsSig<-LimmaSums[names(AdjPval_kMEs)[which(AdjPval_kMEs<0.01)],] patternCor<-cor(t(LimmaModsSig)) patternTree<-hclust(as.dist(1-patternCor),method = "complete") patternClusters<-cutree(tree = patternTree, k = 25) patternClusters<-tapply(rownames(patternCor),INDEX = patternClusters,function(x) x) #Replicates are grouped together (same color + lty values) PlotTCsRibbon(TClst = testData$timeCourseList, main="Drought Time-Course", xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", tgenes = patternClusters[[15]], scale = TRUE, tcols = c("red","red","blue","blue"), tltys = c(1,1,1,1)) #Replicates are in different groups (same color but different lty values) PlotTCsRibbon(TClst = testData$timeCourseList, main="Drought Time-Course", xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", tgenes = patternClusters[[15]], scale = TRUE, tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2)) #Replicates are represented on differnt plots (using the splits argument) PlotTCsRibbon(TClst = testData$timeCourseList, main=c("Rep1","Rep2"), xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", tgenes = patternClusters[[15]], scale = TRUE, tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2), splits = c(1,2,1,2))
data("testData") # Calculate the module memberships kMEsList<-BuildModMembership(MeMat=testData$moduleEigengenes, TCsLst=testData$timeCourseList) # Build the limma models LimmaMods<-lapply(kMEsList, function(x) BuildLimmaLM(dataMat = x, designMat = testData$modelDesign, contrastStr = testData$modelContrast)) # Get the adjusted Pvalues for each gene AdjPval_kMEs<-sapply(testData$testResults[rownames(LimmaMods[[1]])],function(x) AdjustPvalue(tVal = x, tVec = testData$testResults, pVec = testData$permutedResults)) # Cluster the genes base on the limma results LimmaSums<-do.call(cbind,lapply(LimmaMods,function(x) x$t)) LimmaModsSig<-LimmaSums[names(AdjPval_kMEs)[which(AdjPval_kMEs<0.01)],] patternCor<-cor(t(LimmaModsSig)) patternTree<-hclust(as.dist(1-patternCor),method = "complete") patternClusters<-cutree(tree = patternTree, k = 25) patternClusters<-tapply(rownames(patternCor),INDEX = patternClusters,function(x) x) #Replicates are grouped together (same color + lty values) PlotTCsRibbon(TClst = testData$timeCourseList, main="Drought Time-Course", xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", tgenes = patternClusters[[15]], scale = TRUE, tcols = c("red","red","blue","blue"), tltys = c(1,1,1,1)) #Replicates are in different groups (same color but different lty values) PlotTCsRibbon(TClst = testData$timeCourseList, main="Drought Time-Course", xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", tgenes = patternClusters[[15]], scale = TRUE, tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2)) #Replicates are represented on differnt plots (using the splits argument) PlotTCsRibbon(TClst = testData$timeCourseList, main=c("Rep1","Rep2"), xAxsLabs = c(seq(1,23,4),seq(1,23,4)), xlab="ZT Time (hours)", tgenes = patternClusters[[15]], scale = TRUE, tcols = c("red","red","blue","blue"), tltys = c(1,2,1,2), splits = c(1,2,1,2))
A list of objects that can be used to test individual DiPALM functions.
data("testData")
data("testData")
A List of 6 objects:
$timeCourseList - A List of 4 matrices containing mRNA expression data on Brassica rapa R500.
$Drought.R1 (Plants exposed to drought, Replicate 1) A matrix of time-course gene expression data [18428 Genes X 12 Time-points]
$Drought.R2 (Plants exposed to drought, Replicate 2) A matrix of time-course gene expression data [18428 Genes X 12 Time-points]
$Watered.R1 (Properly watered plants, Replicate 1) A matrix of time-course gene expression data [18428 Genes X 12 Time-points]
$Watered.R2 (Properly watered plants, Replicate 2) A matrix of time-course gene expression data [18428 Genes X 12 Time-points]
$moduleEigengenes - A dataframe [12 rows and 90 columns] representing 90 different expression vectors that describe descrete expression patterns found in the whole dataset ($timeCourseList).
Rows: each row represents a single time-point
Columns: each column represents a distinct expression pattern seen repeatedly in the full data
$modelDesign - A matrix [4 rows and 2 columns] used as a design matrix for a simple linear model. This matrix provides a mapping between the input instances (timeCourseList) and the two fixed effects fitted in the mode (Drought and Watered). This matrix was created with model.matrix
.
$modelContrast - A character string specifying the contrast to be evaluated using the fitted model for each gene. This example is used to compare watered to drought conditions.
$testResults - A named numeric vector with 18428 values from DiPALM test results on actual data.
names: Gene accessions
values: The summed absolute values of t-values generated from limma model contrasts related to each eigengene pattern. This can be thought of as a DiPALM score for differential patterning.
$permutedResults- A named numeric vector with 18428 values from DiPALM test results on permuted data. This vector acts as a null distribution in order to estimate significance of actual test results.
names: Gene accessions
values: The summed absolute values of t-values generated from limma model contrasts related to each eigengene pattern using permuted expression data. This can be thought of as a null distribution of DiPALM scores for differential patterning.
Data is from: Greenham et al., eLife 2017;6:e29655