Robert,
Thank you for the explanation, it gives me a better understanding of what is
going on. After reading you email, I realized that the big discrepancies in
the observed t-stats that I noted were in fact do to a computational error
on my part. If you look at my code, you will see that I forgot to take the
sqrt of the size of the GeneSet. I had stared at this for a couple of days
and didn't see it.
As for the differences in the algorithms, what I noted was not between the
two published papers but between your latest and the current GSEAlm
vignette. Although I understand the distinction you make between a
parametric vs. permutation based approach, this doesn't come across clearly
in the vignette, which appears to closely follow you latest paper. Perhaps a
bit of explanation on this part could help future readers.
As a non-statistician, clearly I have some studying to do to completely
understand both your papers. Questions may follow. If they involved more
theory than application (as in how to use packages), is it still appropriate
to post to the BioC help list? This would seem to benefit your target
audience, but may be a bit outside the posting guidelines for the list.
Mark
------------------------------------------------------------
Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine
15032 Hunter Court, Westfield, IN 46074
(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Home
Skype: mkimpel
"The real problem is not whether machines think but whether men do." -- B.
F. Skinner
******************************************************************
On Wed, Nov 19, 2008 at 12:32 AM, Robert Gentleman wrote:
> Hi Mark,
> Good questions, I am not sure if either Zhen or Assaf will be reading
> this,
> and if so they might have some different views/ideas.
>
> Mark Kimpel wrote:
> > I'm confused about how the J-G statistic of Jiang and Gentleman (2007) is
> > calculated. To check the results I was getting, I calculated the
> statistic
> > with my own code and came up with a very different answer to that
> obtained
> > using gsealmPerm (which uses GSNormalize). Before proceeding to my code,
> I
>
> I don't think that there is an absolute prescription here, but rather an
> algorithm, that probably needs a bit of tweaking in any given situation.
> That
> said, we also learned a bit between the two papers, and adjusting for the
> mean
> of the t-statistics seems like a good idea (in the parametric model).
>
> > also note that it looks like the statistic is calculated slightly
> > differently in the paper of Oron, et al (2008) and in the GSEAlm
> vignette.
> > The former does not subtract t statistics from the mean t stat of the
> entire
> > dataset (t.ref) but the latter does. Am I interpreting that correctly
> and,
> > if so, why is it presented differently in the 2 sources?
>
> I don't follow here - what do you mean "presented so differently",
> certainly
> the papers are different, and some details are different, as I said we
> learned
> some things in the intervening years. Probably most important, for
> parametric
> fits, is to remove the effect that arises if the t-statistics don't sum to
> zero.
> In those situations, what you will see in some cases is that larger gene
> sets
> are preferred (and the size/nature of this effect depends on how different
> the
> mean of the t-stats is from zero). It is of less interest in a
> permutational
> approach, as in that case all permutations have the same difference and it
> essentially cancels out.
>
> An easier way to the end, using your data is as follows:
>
> library(genefilter)
> rtt = rowttests(exprsSet, factor(exprsSet$y)) ##y should have been a
> factor
>
> > sum(rtt$statistic[amat[1,]])/sqrt(sum(amat))
> [1] 7.031002
>
> (the p-value is pretty close to that below)
>
> About what you get the non-para below. This one when you think of a null
> model
> where the individual t-stats are centered about zero. When the sample sizes
> are
> large-ish then the number of df is large, and the t distribution is similar
> to
> the Normal (0,1) and hence sum(t's)/sqrt(number) is approx N(0,1) under the
> null. But this assumption essentially breaks down when the sum of the
> t-stats
> is not zero. So we developed an approach that accounts for that (and
> allows for
> the variance to change as well). This second approach involves fitting a
> linear
> model with an intercept
>
> > lm(rtt$statistic~amat[1,])
>
> Call:
> lm(formula = rtt$statistic ~ amat[1, ])
>
> Coefficients:
> (Intercept) amat[1, ]
> -0.002797 1.087386
>
> about what you get for the para below, but deciding if the gene set is
> predictive is not quite as simple as just looking at the coefficient:
>
> > summary(lm(rtt$statistic~amat[1,]))
>
> Call:
> lm(formula = rtt$statistic ~ amat[1, ])
>
> Residuals:
> Min 1Q Median 3Q Max
> -4.818258 -0.711718 -0.002836 0.695385 6.999615
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) -0.002797 0.010133 -0.276 0.783
> amat[1, ] 1.087386 0.161908 6.716 1.95e-11 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 1.108 on 11998 degrees of freedom
> Multiple R-squared: 0.003745, Adjusted R-squared: 0.003662
> F-statistic: 45.11 on 1 and 11998 DF, p-value: 1.951e-11
>
> so the coefficient is really significant (and divided by its standard error
> comes close to the other stat, above)
>
> You can't really compare the observed statistics (as you seemed to), it is
> their
> standard errors (and the model) that tells you whether they are important.
>
> best wishes
> Robert
>
> >
> > In the code below, which is a self-contained example, I construct an
> > artificial ExpressionSet with a fold change difference between groups A
> and
> > B of phenotype y in probesets corresponding to a GeneSet. I then
> calculate
> > the statistic using gsealmPerm and then using parametric assumptions
> (which
> > should hold considering how the dataset was constructed) using my own
> code.
> > As can be seen, very different answers are obtained.
> >
>
>
> > What am I misunderstanding?
> >
> > Here is the code followed by my output and sessionInfo(). First come a
> few
> > necessary functions with a hack of mine to output the observedStatistic
> from
> > gsealmPerm:
> >
> > ###################################################
> > gsealmPerm.MWK <- function (eSet, formula = "", mat, nperm, na.rm = TRUE,
> > ...)
> > {
> > nSamp = ncol(eSet)
> > if (formula == "") {
> > nvar <- 0
> > } else {
> > xvarnames <- all.vars(formula)
> > nvar <- length(xvarnames)
> > }
> > obsRaw <- lmPerGene(eSet = eSet, formula = formula, na.rm = na.rm)
> > if (nvar > 0) {
> > observedStats = GSNormalize(obsRaw$tstat[2, ], incidence = mat,
> > ...)
> > }
> > else {
> > observedStats = GSNormalize(t(obsRaw$tstat), incidence = mat,
> > fun2 = identity, ...)
> > }
> > perm.eset = eSet
> > i <- 1L
> > if (nvar > 0) {
> > permMat <- matrix(0, nrow = nrow(eSet), ncol = nperm)
> > while (i < (nperm + 1)) {
> > if (nvar >= 2) {
> > splitter = pData(eSet)[, xvarnames[2]]
> > if (nvar > 2)
> > splitter = as.list(pData(eSet)[, xvarnames[2:nvar]])
> > label.perm = unsplit(lapply(split(1:nSamp, splitter),
> > sample), splitter)
> > pData(perm.eset)[, xvarnames[1]] <-
> pData(eSet)[label.perm,
> > xvarnames[1]]
> > }
> > else if (nvar == 1) {
> > pData(perm.eset)[, xvarnames[1]] <-
> > pData(eSet)[sample(1:nSamp),
> > xvarnames[1]]
> > }
> > temp.results <- lmPerGene(eSet = perm.eset, formula =
> formula,
> > na.rm = na.rm)
> > permMat[, i] <- temp.results$tstat[2, ]
> > i <- i + 1L
> > }
> > permMat <- GSNormalize(permMat, incidence = mat, ...)
> > rownames(permMat) = rownames(mat)
> > }
> > else if (nvar == 0) {
> > permMat <- matrix(0, nrow = nrow(mat), ncol = nperm)
> > rownames(permMat) = rownames(mat)
> > for (i in 1:nperm) permMat[, i] = GSNormalize(t(obsRaw$tstat),
> > incidence = mat[, sample(1:ncol(mat))], fun2 = identity,
> > ...)
> > }
> > output <- data.frame(pvalFromPermMat(observedStats, permMat),
> > observedStats)
> > }
> > #####################################3
> > pvalFromPermMat <- function(obs, perms) { #needed because this is an
> > internal function of the GSEAlm package
> > N <- ncol(perms)
> > pvals <- matrix(as.double(NA), nr=nrow(perms), ncol=2)
> > dimnames(pvals) <- list(rownames(perms), c("Lower", "Upper"))
> >
> > tempObs <- rep(obs, ncol(perms))
> > dim(tempObs) <- dim(perms)
> > pvals[ , 1] <- rowSums(perms <= tempObs)/N
> > pvals[ , 2] <- rowSums(perms >= tempObs)/N
> > pvals
> > }
> > #########################
> > ## mat: A 0/1 incidence matrix with each row representing a gene set
> > ## and each column representing a gene. A 1 indicates
> > ## membership of a gene in a gene set.
> > makeGSmat <- function(eset, geneSet.lst){
> > amat <- matrix(0, nrow= length(geneSet.lst), ncol =
> > length(featureNames(eset)))
> > for (i in 1:nrow(amat)){
> > amat[i,] <- featureNames(eset) %in% geneIds(geneSet.lst[[i]])
> > }
> > rownames(amat) <- names(geneSet.lst)
> > amat
> > }
> > ####################################################
> > ###########
> > # simulate non-parametric GSEA
> > require(GSEABase); require(GSEAlm)
> > set.seed(123)
> > eset.mat <- matrix(rnorm(n = 12*12000, mean = 10, sd = 1), ncol = 12)
> > rownames(eset.mat) <- paste("A", 1:12000, sep = "_")
> > gs <- GeneSet(paste("A", unique(floor(runif(n = 50, min = 1, max =
> 300))),
> > sep = "_"), setIdentifier = "test.gs")
> > eset.mat[geneIds(gs),1:6] <- eset.mat[geneIds(gs),1:6] + 0.5
> > df <- data.frame(x = 1:12, y = c(rep("A", 6), rep("B", 6)))
> > adf <- as(df, "AnnotatedDataFrame")
> > exprsSet <- new("ExpressionSet", exprs = eset.mat, phenoData = adf)
> > #
> > gs.lst <- list(test.gs = gs)
> > #####################################
> > amat <- makeGSmat(exprsSet, gs.lst)
> > #################################################3
> > GSEAResults.nonpara <- gsealmPerm.MWK(eSet = exprsSet, formula = ~y, mat
> =
> > amat, nperm = 1000)
> > GSEAResults.nonpara
> > ###################################################
> > # simulate parametric GSEA
> > gene.ts <- (lmPerGene(eSet = exprsSet, formula = ~y)$tstat)[2,]
> > pos <- match(geneIds(gs), names(gene.ts))
> > gene.ts[pos]
> > t.ref <- mean(gene.ts)
> >
> > summation <- 0
> > GSEAResults.para <- {
> > for (i in 1:length(gene.ts[geneIds(gs)])){
> > summation <- summation + (gene.ts[geneIds(gs)][i] - t.ref)
> > }
> > GSEAResults.para <- summation/length(gene.ts[geneIds(gs)])
> > GSEAResults.para
> > }
> > GSEAResults.para
> > GSEAResults.nonpara
> > GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)])
> > ####################################################
> >> GSEAResults.para
> > A_121
> > -1.083127
> >> GSEAResults.nonpara
> > Lower Upper observedStats
> > test.gs 0 1 -7.435567
> >> GSEAResults.nonpara$observedStats/length(gene.ts[geneIds(gs)])
> > [1] -0.1582036
> >> sessionInfo()
> > R version 2.9.0 Under development (unstable) (2008-10-28 r46793)
> > x86_64-unknown-linux-gnu
> >
> > locale:
> >
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] splines tools stats graphics grDevices utils datasets
> > [8] methods base
> >
> > other attached packages:
> > [1] GSEAlm_1.3.0 GSEABase_1.5.0 affycoretools_1.15.0
> > [4] annaffy_1.15.0 KEGG.db_2.2.5 gcrma_2.15.1
> > [7] matchprobes_1.15.0 biomaRt_1.17.0 GOstats_2.9.0
> > [10] Category_2.9.7 genefilter_1.23.0 survival_2.34-1
> > [13] RBGL_1.19.0 annotate_1.21.1 xtable_1.5-4
> > [16] GO.db_2.2.5 RSQLite_0.7-1 DBI_0.2-4
> > [19] AnnotationDbi_1.5.5 limma_2.17.3 affy_1.21.0
> > [22] Biobase_2.3.3 graph_1.21.0
> >
> > loaded via a namespace (and not attached):
> > [1] affyio_1.11.2 cluster_1.11.11 preprocessCore_1.5.1
> > [4] RCurl_0.92-0 XML_1.98-1
> > ------------------------------------------------------------
> > Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry
> > Indiana University School of Medicine
> >
> > 15032 Hunter Court, Westfield, IN 46074
> >
> > (317) 490-5129 Work, & Mobile & VoiceMail
> > (317) 399-1219 Home
> > Skype: mkimpel
> >
> > "The real problem is not whether machines think but whether men do." --
> B.
> > F. Skinner
> > ******************************************************************
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor@stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
>
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem@fhcrc.org
>
[[alternative HTML version deleted]]