[BioC] confused about J-G statistic of GSEAlm

Wed Nov 19 06:32:32 CET 2008

Hi Mark,
   Good questions, I am not sure if either Zhen or Assaf will be reading this,
and if so they might have some different views/ideas.

Mark Kimpel wrote:
> I'm confused about how the J-G statistic of Jiang and Gentleman (2007) is
> calculated. To check the results I was getting, I calculated the statistic
> with my own code and came up with a very different answer to that obtained
> using gsealmPerm (which uses GSNormalize). Before proceeding to my code, I

  I don't think that there is an absolute prescription here, but rather an
algorithm, that probably needs a bit of tweaking in any given situation.  That
said, we also learned a bit between the two papers, and adjusting for the mean
of the t-statistics seems like a good idea (in the parametric model).

> also note that it looks like the statistic is calculated slightly
> differently in the paper of Oron, et al (2008) and in the GSEAlm vignette.
> The former does not subtract t statistics from the mean t stat of the entire
> dataset (t.ref) but the latter does. Am I interpreting that correctly and,
> if so, why is it presented differently in the 2 sources?

  I don't follow here - what do you mean "presented so differently", certainly
the papers are different, and some details are different, as I said we learned
some things in the intervening years.  Probably most important, for parametric
fits, is to remove the effect that arises if the t-statistics don't sum to zero.
 In those situations, what you will see in some cases is that larger gene sets
are preferred (and the size/nature of this effect depends on how different the
mean of the t-stats is from zero).  It is of less interest in a permutational
approach, as in that case all permutations have the same difference and it
essentially cancels out.

  An easier way to the end, using your data is as follows:

library(genefilter)
rtt = rowttests(exprsSet, factor(exprsSet$y))  ##y should have been a factor

> sum(rtt$statistic[amat[1,]])/sqrt(sum(amat))
[1] 7.031002

(the p-value is pretty close to that below)

About what you get the non-para below. This one when you think of a null model
where the individual t-stats are centered about zero. When the sample sizes are
large-ish then the number of df is large, and the t distribution is similar to
the Normal (0,1) and hence sum(t's)/sqrt(number) is approx N(0,1) under the
null.  But this assumption essentially breaks down when the sum of the t-stats
is not zero.  So we developed an approach that accounts for that (and allows for
the variance to change as well).  This second approach involves fitting a linear
model with an intercept

> lm(rtt$statistic~amat[1,])

Call:
lm(formula = rtt$statistic ~ amat[1, ])

Coefficients:
(Intercept)    amat[1, ]
  -0.002797     1.087386

about what you get for the para below, but deciding if the gene set is
predictive is not quite as simple as just looking at the coefficient:

> summary(lm(rtt$statistic~amat[1,]))

Call:
lm(formula = rtt$statistic ~ amat[1, ])

Residuals:
      Min        1Q    Median        3Q       Max
-4.818258 -0.711718 -0.002836  0.695385  6.999615

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002797   0.010133  -0.276    0.783
amat[1, ]    1.087386   0.161908   6.716 1.95e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.108 on 11998 degrees of freedom
Multiple R-squared: 0.003745,   Adjusted R-squared: 0.003662
F-statistic: 45.11 on 1 and 11998 DF,  p-value: 1.951e-11

so the coefficient is really significant (and divided by its standard error
comes close to the other stat, above)

You can't really compare the observed statistics (as you seemed to), it is their
standard errors (and the model) that tells you whether they are important.

best wishes
   Robert

> 
> In the code below, which is a self-contained example, I construct an
> artificial ExpressionSet with a fold change difference between groups A and
> B of phenotype y in probesets corresponding to a GeneSet. I then calculate
> the statistic using gsealmPerm and then using parametric assumptions (which
> should hold considering how the dataset was constructed) using my own code.
> As can be seen, very different answers are obtained.
> 

> What am I misunderstanding?
> 
> Here is the code followed by my output and sessionInfo(). First come a few
> necessary functions with a hack of mine to output the observedStatistic from
> gsealmPerm:
> 
> ###################################################
> gsealmPerm.MWK <- function (eSet, formula = "", mat, nperm, na.rm = TRUE,
> ...)
> {
>     nSamp = ncol(eSet)
>     if (formula == "") {
>         nvar <- 0
>     } else {
>       xvarnames <- all.vars(formula)
>       nvar <- length(xvarnames)
>     }
>     obsRaw <- lmPerGene(eSet = eSet, formula = formula, na.rm = na.rm)
>     if (nvar > 0) {
>         observedStats = GSNormalize(obsRaw$tstat[2, ], incidence = mat,
>             ...)
>     }
>     else {
>         observedStats = GSNormalize(t(obsRaw$tstat), incidence = mat,
>             fun2 = identity, ...)
>     }
>     perm.eset = eSet
>     i <- 1L
>     if (nvar > 0) {
>         permMat <- matrix(0, nrow = nrow(eSet), ncol = nperm)
>         while (i < (nperm + 1)) {
>             if (nvar >= 2) {
>                 splitter = pData(eSet)[, xvarnames[2]]
>                 if (nvar > 2)
>                   splitter = as.list(pData(eSet)[, xvarnames[2:nvar]])
>                 label.perm = unsplit(lapply(split(1:nSamp, splitter),
>                   sample), splitter)
>                 pData(perm.eset)[, xvarnames[1]] <- pData(eSet)[label.perm,
>                   xvarnames[1]]
>             }
>             else if (nvar == 1) {
>                 pData(perm.eset)[, xvarnames[1]] <-
> pData(eSet)[sample(1:nSamp),
>                   xvarnames[1]]
>             }
>             temp.results <- lmPerGene(eSet = perm.eset, formula = formula,
>                 na.rm = na.rm)
>             permMat[, i] <- temp.results$tstat[2, ]
>             i <- i + 1L
>         }
>         permMat <- GSNormalize(permMat, incidence = mat, ...)
>         rownames(permMat) = rownames(mat)
>     }
>     else if (nvar == 0) {
>         permMat <- matrix(0, nrow = nrow(mat), ncol = nperm)
>         rownames(permMat) = rownames(mat)
>         for (i in 1:nperm) permMat[, i] = GSNormalize(t(obsRaw$tstat),
>             incidence = mat[, sample(1:ncol(mat))], fun2 = identity,
>             ...)
>     }
>     output <- data.frame(pvalFromPermMat(observedStats, permMat),
> observedStats)
> }
> #####################################3
> pvalFromPermMat <- function(obs, perms) { #needed because this is an
> internal function of the GSEAlm package
>     N <- ncol(perms)
>     pvals <- matrix(as.double(NA), nr=nrow(perms), ncol=2)
>     dimnames(pvals) <- list(rownames(perms), c("Lower", "Upper"))
> 
>     tempObs <- rep(obs, ncol(perms))
>     dim(tempObs) <- dim(perms)
>     pvals[ , 1] <- rowSums(perms <= tempObs)/N
>     pvals[ , 2] <- rowSums(perms >= tempObs)/N
>     pvals
> }
> #########################
> ## mat: A 0/1 incidence matrix with each row representing a gene set
> ##           and each column representing a gene.  A 1 indicates
> ##           membership of a gene in a gene set.
> makeGSmat <- function(eset, geneSet.lst){
>   amat <- matrix(0, nrow= length(geneSet.lst), ncol =
> length(featureNames(eset)))
>   for (i in 1:nrow(amat)){
>     amat[i,] <- featureNames(eset) %in% geneIds(geneSet.lst[[i]])
>   }
>   rownames(amat) <- names(geneSet.lst)
>   amat
> }
> ####################################################
> ###########
> # simulate non-parametric GSEA
> require(GSEABase); require(GSEAlm)
> set.seed(123)
> eset.mat <- matrix(rnorm(n = 12*12000, mean = 10, sd = 1), ncol = 12)
> rownames(eset.mat) <- paste("A", 1:12000, sep = "_")
> gs <- GeneSet(paste("A", unique(floor(runif(n = 50, min = 1, max = 300))),
> sep = "_"), setIdentifier = "test.gs")
> eset.mat[geneIds(gs),1:6] <- eset.mat[geneIds(gs),1:6] + 0.5
> df <- data.frame(x = 1:12, y = c(rep("A", 6), rep("B", 6)))
> adf <- as(df, "AnnotatedDataFrame")
> exprsSet <- new("ExpressionSet", exprs = eset.mat, phenoData = adf)
> #
> gs.lst <- list(test.gs = gs)
> #####################################
> amat <- makeGSmat(exprsSet, gs.lst)
> #################################################3
> GSEAResults.nonpara <- gsealmPerm.MWK(eSet = exprsSet, formula = ~y, mat =
> amat, nperm = 1000)
> GSEAResults.nonpara
> ###################################################
> # simulate parametric GSEA
> gene.ts <- (lmPerGene(eSet = exprsSet, formula = ~y)$tstat)[2,]
> pos <- match(geneIds(gs), names(gene.ts))
> gene.ts[pos]
> t.ref <- mean(gene.ts)
> 
> summation <- 0
> GSEAResults.para <- {
>   for (i in 1:length(gene.ts[geneIds(gs)])){
>     summation <- summation + (gene.ts[geneIds(gs)][i] - t.ref)
>   }
>   GSEAResults.para <- summation/length(gene.ts[geneIds(gs)])
>   GSEAResults.para
> }
> GSEAResults.para
> GSEAResults.nonpara
> GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)])
> ####################################################
>> GSEAResults.para
>     A_121
> -1.083127
>> GSEAResults.nonpara
>         Lower Upper observedStats
> test.gs     0     1     -7.435567
>> GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)])
> [1] -0.1582036
>> sessionInfo()
> R version 2.9.0 Under development (unstable) (2008-10-28 r46793)
> x86_64-unknown-linux-gnu
> 
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] splines   tools     stats     graphics  grDevices utils     datasets
> [8] methods   base
> 
> other attached packages:
>  [1] GSEAlm_1.3.0         GSEABase_1.5.0       affycoretools_1.15.0
>  [4] annaffy_1.15.0       KEGG.db_2.2.5        gcrma_2.15.1
>  [7] matchprobes_1.15.0   biomaRt_1.17.0       GOstats_2.9.0
> [10] Category_2.9.7       genefilter_1.23.0    survival_2.34-1
> [13] RBGL_1.19.0          annotate_1.21.1      xtable_1.5-4
> [16] GO.db_2.2.5          RSQLite_0.7-1        DBI_0.2-4
> [19] AnnotationDbi_1.5.5  limma_2.17.3         affy_1.21.0
> [22] Biobase_2.3.3        graph_1.21.0
> 
> loaded via a namespace (and not attached):
> [1] affyio_1.11.2        cluster_1.11.11      preprocessCore_1.5.1
> [4] RCurl_0.92-0         XML_1.98-1
> ------------------------------------------------------------
> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
> 
> 15032 Hunter Court, Westfield, IN  46074
> 
> (317) 490-5129 Work, & Mobile & VoiceMail
> (317) 399-1219  Home
> Skype:  mkimpel
> 
> "The real problem is not whether machines think but whether men do." -- B.
> F. Skinner
> ******************************************************************
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org