[BioC] Bug in GSVA with methods other than default?
Robert Castelo
robert.castelo at upf.edu
Mon Feb 24 11:47:28 CET 2014
hi Jonathan,
i'm cc'ing your email to Justin Guinney, who's the actual maintainer of
GSVA and who developed that bootstrapping part, he should be able to
clarify this.
cheers,
robert.
On 02/24/2014 11:36 AM, Jonathan Manning wrote:
> Hi Robert,
>
> I confirm the fix worked (though I don't think the new version is in the
> repo yet so I installed from source).
>
> Could there maybe be some more documentation in the gsva() help page on
> bootstrapping and how it doesn't apply to the other methods?
>
> Thanks,
>
> Jon
>
>
> On 21/02/2014 15:08, Robert Castelo wrote:
>> hi Jonathan cc Sonja,
>>
>> you're right, this is a bug occuring when the input is an
>> 'ExpressionSet' object and the argument 'method' as a value different
>> than 'gsva'.
>>
>> i've just submitted a fix to both, the devel and the release versions
>> of GSVA, they should become available via biocLite() in the following
>> 24/48 hrs as versions 1.11.4 and 1.10.3, respectively.
>>
>> in the meantime, you can always check out directly from the
>> corresponding svn repository (current release or devel) the updated
>> source and build it yourself; see
>>
>> http://master.bioconductor.org/developers/how-to/source-control
>>
>>
>>
>> sorry for the inconvenience,
>>
>> robert.
>>
>> On 02/21/2014 03:42 PM, Jonathan Manning wrote:
>>> Hi Sonja,
>>>
>>> No, that's not it.
>>>
>>> The probe IDs map perfectly fine through the annotation package and
>>> GSEABase::mapIdentifiers() etc, and everything works fine if I hack
>>> around the bug I described previously. Again, what you have is some code
>>> (e.g. /'Biobase::exprs(eScoEset) <- eSco$es.obs) /), which assumes that
>>> a list is returned by /GSVA:::.gsva()/. In the case of "
>>> /method='plage'/ " et al, it is not. If you tweak things with that in
>>> mind, there's no problem.
>>>
>>> Here is the code for gsva() for the ExpressionSet signature (retrieved
>>> with /getMethod(gsva, signature=c(expr="ExpressionSet",
>>> gset.idx.list="GeneSetCollection", annotation="missing"))/ ), with my
>>> hacks included in bold:
>>>
>>>
>>> function (expr, gset.idx.list, annotation = NULL, ...)
>>> {
>>> .local <- function (expr, gset.idx.list, annotation = NULL,
>>> method = c("gsva", "ssgsea", "zscore", "plage"), rnaseq = FALSE,
>>> abs.ranking = FALSE, min.sz = 1, max.sz = Inf, no.bootstraps = 0,
>>> bootstrap.percent = 0.632, parallel.sz = 0, parallel.type = "SOCK",
>>> mx.diff = TRUE, tau = switch(method, gsva = 1, ssgsea = 0.25,
>>> NA), kernel = TRUE, verbose = TRUE)
>>> {
>>> sdGenes <- Biobase::esApply(expr, 1, sd)
>>> if (any(sdGenes == 0)) {
>>> if (verbose)
>>> cat("Filtering out", sum(sdGenes == 0), "genes with constant expression
>>> values throuhgout the samples\n")
>>> expr <- expr[sdGenes > 0, ]
>>> }
>>> if (nrow(expr) < 2)
>>> stop("Less than two genes in the input ExpressionSet object\n")
>>> if (verbose)
>>> cat("Mapping identifiers between gene sets and feature names\n")
>>> mapped.gset.idx.list <- GSEABase::mapIdentifiers(gset.idx.list,
>>> GSEABase::AnnoOrEntrezIdentifier(Biobase::annotation(expr)))
>>> tmp <- lapply(geneIds(mapped.gset.idx.list), function(x,
>>> y) na.omit(match(x, y)), featureNames(expr))
>>> names(tmp) <- names(mapped.gset.idx.list)
>>> mapped.gset.idx.list <- filterGeneSets(tmp, min.sz = max(1, min.sz),
>>> max.sz = max.sz)
>>> eSco <- GSVA:::.gsva(Biobase::exprs(expr), mapped.gset.idx.list,
>>> method, rnaseq, abs.ranking, no.bootstraps, bootstrap.percent,
>>> parallel.sz, parallel.type, mx.diff, tau, kernel,
>>> verbose)
>>> eScoEset <- expr
>>> #Biobase::exprs(eScoEset) <- eSco$es.obs
>>> * if (method=='gsva'){
>>> Biobase::exprs(eScoEset) <- eSco$es.obs
>>> }else{
>>> Biobase::exprs(eScoEset) <- eSco
>>> }*
>>>
>>> Biobase::annotation(eScoEset) <- ""
>>> #return(list(es.obs = eScoEset, bootstrap = eSco$bootstrap,
>>> # p.vals.sign = eSco$p.vals.sign))
>>> *return (eScoEset)*
>>> }
>>> .local(expr, gset.idx.list, annotation, ...)
>>> }
>>>
>>> Since you say the bootstrapping isn't much use anyway, I've just set it
>>> to return the matrix in all cases.
>>>
>>> Jon
>>>
>>>
>>>
>>>
>>> On 21/02/2014 14:05, Sonja Haenzelmann wrote:
>>>> I think the problem lies in your expression set.
>>>> The gene names have to be in ENTREZ gene Id format, as the c2BroadSets
>>>> are in this format. gsva maps the genes in your expression set with
>>>> the genes in the gene set list. If they do not have the same
>>>> identifiers it cannot map the genes, and hence not calculate any
>>>> statistics.
>>>>
>>>>
>>>> library(GSVAdata)
>>>> data(c2BroadSets)
>>>> library(GSVA)
>>>>
>>>>
>>>> example<- as.matrix(read.table("example.txt"))
>>>> eset_example<- ExpressionSet(example, annotation='illuminaHumanv4')
>>>> pdata_example<- data.frame(class=c(rep('class1', 5), c(rep('class2',
>>>> 5))))
>>>> rownames(pdata_example)<- letters[1:10]
>>>> pData(eset_example)<- pdata_example
>>>>
>>>>
>>>>> featureNames(eset_example)
>>>> [1] "ILMN_1762337" "ILMN_2055271" "ILMN_1736007" "ILMN_2383229"
>>>> "ILMN_1806310"
>>>> [6] "ILMN_1779670" "ILMN_1653355" "ILMN_1717783" "ILMN_1705025"
>>>> "ILMN_1814316"
>>>>
>>>> you will have to map your ILMN ids to entrez gene ids, then gsva
>>>> should be working.
>>>>
>>>>
>>>> For example in the toy example that shows up, when you do ?gsva, you
>>>> will find that the gene names are the same in the gene set list as
>>>> well as in the gene expression matrix.
>>>>
>>>> p<- 10 ## number of genes
>>>> n<- 30 ## number of samples
>>>> nGrp1<- 15 ## number of samples in group 1
>>>> nGrp2<- n - nGrp1 ## number of samples in group 2
>>>>
>>>> ## consider three disjoint gene sets
>>>> geneSets<- list(set1=paste("g", 1:3, sep=""),
>>>> set2=paste("g", 4:6, sep=""),
>>>> set3=paste("g", 7:10, sep=""))
>>>>
>>>>
>>>>
>>>>> geneSets
>>>> $set1
>>>> [1] "g1" "g2" "g3"
>>>>
>>>> $set2
>>>> [1] "g4" "g5" "g6"
>>>>
>>>> $set3
>>>> [1] "g7" "g8" "g9" "g10"
>>>>
>>>>
>>>> ## sample data from a normal distribution with mean 0 and st.dev. 1
>>>> y<- matrix(rnorm(n*p), nrow=p, ncol=n,
>>>> dimnames=list(paste("g", 1:p, sep="") , paste("s",
>>>> 1:n, sep="")))
>>>>
>>>> rownames(y)
>>>> [1] "g1" "g2" "g3" "g4" "g5" "g6" "g7" "g8" "g9" "g10"
>>>>
>>>> Best
>>>> Sonja
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 21, 2014 at 2:39 PM, Jonathan Manning
>>>> <Jonathan.Manning at ed.ac.uk> wrote:
>>>>> example<- as.matrix(read.table("example.txt"))
>>>>> eset_example<- ExpressionSet(example, annotation='illuminaHumanv4')
>>>>> pdata_example<- data.frame(class=c(rep('class1', 5), c(rep('class2',
>>>>> 5))))
>>>>> rownames(pdata_example)<- letters[1:10]
>>>>> pData(eset_example)<- pdata_example
>>>>> gsva_es<- gsva(eset_example, c2BroadSets , mx.diff=1, method='plage')
>>>
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550
More information about the Bioconductor
mailing list