[BioC] Surrogate variable analysis fails with “subscript out of bounds”
Vebjorn Ljosa
ljosa at broad.mit.edu
Mon Jul 2 20:12:44 CEST 2012
(This is a crossposting of a question I have asked on
stackoverflow.com, so far with no responses:
http://stackoverflow.com/questions/11252724/surrogate-variable-analysis-fails-with-subscript-out-of-bounds)
I am trying to apply surrogate variable analysis using [Bioconductor's
sva package][1]. The example in [the vignette][2] works fine, but when
I try it with real data, I get a "subscript out of bounds" error in
`irwsva.build`:
$ R
R version 2.15.0 (2012-03-30)
…
> trainData <-
read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData.txt')
> trainpheno <-
read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
> testData <-
read.table('http://www.broadinstitute.org/~ljosa/svaproblem/testData.txt')
> trainData <- as.matrix(trainData)
> testData <- as.matrix(testData)
> library(sva)
> trainMod <- model.matrix(~as.factor(label), trainpheno)
> num.sv(trainData, trainMod)
[1] 8
> trainMod0 <- model.matrix(~1, trainpheno)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is: 8
Iteration (out of 5 ):1 2 3 4 5 Error in irwsva.build(dat =
dat, mod = mod, mod0 = mod0, n.sv = n.sv, :
subscript out of bounds
An attempt to narrow it down with `debug()` revealed that `fast.svd`
is being called on a 453 x 100 matrix of all zeros. (The dimensions
453 x 100 are the same as my training set.) This results in a `V` that
is 100 x 0; the "subscript out of bounds" error is because
`irwsva.build` attempts to index into `V`. There must be something
about my data that causes this behaviour—but what?
As a possible workaround, I tried calling `sva` with `method="two-step"`:
> trainSv <- sva(trainData, trainMod, trainMod0, method='two-step')
Number of significant surrogate variables is: 8
That worked, but I need to subsequently call `fsva`. That failed
because calling `sva` with `method="two-step"` resulted in
`trainSv$pprob.b` being NULL.
So how do my data differ from those in the vignette? The training and
test data are matrices in both cases. In the vignette the training
matrix is 22283 x 30; in my case, it is 453 x 100. In the vignette,
the variable of interest (_cancer_) is binary; in my case, the
dependent variable can take 12 different values.
The last difference seems to be important, for if I reduce the range
to [0, 7], it works:
> trainMod <- model.matrix(~as.factor(label), trainpheno %% 8)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is: 9
Iteration (out of 5 ):1 2 3 4 5 >
Thinking that perhaps 100 samples (columns) is just insufficient for
12 classes, I tried a similar dataset with 293 columns. (The data are
from the same experiment, but profile the 293 individual samples
rather than the 100 treatments.) It did not help:
> trainData <-
read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainData3.txt')
> trainpheno <-
read.table('http://www.broadinstitute.org/~ljosa/svaproblem/trainpheno.txt')
> trainData <- as.matrix(trainData)
> trainMod <- model.matrix(~as.factor(label), trainpheno)
> trainMod0 <- model.matrix(~1, trainpheno)
> trainSv <- sva(trainData, trainMod, trainMod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5 Error in irwsva.build(dat =
dat, mod = mod, mod0 = mod0, n.sv = n.sv, :
subscript out of bounds
If I limit sva to one iteration, it is able to run to completion, but
I don't know if I can trust the results:
> trainSv <- sva(trainData, trainMod, trainMod0, B=1)
Number of significant surrogate variables is: 11
Iteration (out of 1 ):1 >
Does anyone understand `irwsva` well enough to say why this is
happening? Is there anything I can do to make it work on my data?
Thanks,
Vebjorn
[1]: http://www.bioconductor.org/packages/release/bioc/html/sva.html
[2]: http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf
More information about the Bioconductor
mailing list