[BioC] About subsampling of VST in lumi
    Pan Du 
    dupan at northwestern.edu
       
    Fri Dec 14 15:16:48 CET 2007
    
    
  
Hi Ligia,
Thanks for your report.
Yes, we use down-sampling to speed up the parameter estimation. If you want
to use all the data points, you can set the parameter "nSupport" of vst
function as the length of the vector. I will add this to the vignette or
help file. Thanks!
Pan
On 12/14/07 5:18 AM, "ligia at ebi.ac.uk" <ligia at ebi.ac.uk> wrote:
> Dear Pan Du,
> 
>> From what I understand when looking at "vst", the random subsampling that
> affects my data occurs at step 4 below:
> 
> 1       if (c3 != 0) {
> 2            selInd <- selInd & (std^2 > c3)
> 3            dd <- data.frame(y = sqrt(std[selInd]^2 - c3), x1 = u[selInd])
> 4            if (nrow(dd) > 5000   dd <- dd[sample(1:nrow(dd), 5000), ]
> 5            lmm <- lm(y ~ x1, dd)
> 6            c1 <- lmm$coef[2]
> 7            c2 <- lmm$coef[1]
> 8        }
> 
> because my "dd" matrix has around 5500 rows. Maybe it would be nice to
> have the option to turn this off, or add the option to provide the max
> value allowed for nrow(dd)...
> 
> Cheers,
> Lígia
> 
> 
>> Dear Ligia
>> 
>> I believe this is because they random subsample the data to "speed
>> processing", see the man page and the  nSupport parameter.
>> 
>> I cc Pan Du with the suggestion to make the explanation of this in the
>> man page more clear. Is there an option to switch off the random
>> subsampling?
>> 
>>   Best wishes
>> Wolfgang
>> 
>> 
>> 
>> ligia at ebi.ac.uk ha scritto:
>>> Hi Wolfgang,
>>> 
>>> I noticed a peculiar behaviour in lumi package: when I apply the
>>> variance
>>> stabilizing transformation,
>>> it gives slightly different results each time I run the method. See
>>> below
>>> for a subset of the data:
>>> 
>>> 
>>>> load("dat.rda")
>>>> library("lumi")
>>> 
>>>> x1 <- lumiT(dat, method="vst", ifPlot=!TRUE)
>>> 2007-12-13 10:56:35 , processing array  1
>>> 2007-12-13 10:56:35 , processing array  2
>>> 2007-12-13 10:56:35 , processing array  3
>>> 2007-12-13 10:56:35 , processing array  4
>>> 
>>>> x2 <- lumiT(dat, method="vst", ifPlot=!TRUE)
>>> 2007-12-13 10:56:36 , processing array  1
>>> 2007-12-13 10:56:36 , processing array  2
>>> 2007-12-13 10:56:36 , processing array  3
>>> 2007-12-13 10:56:37 , processing array  4
>>> 
>>> 
>>>> table(exprs(x1)==exprs(x2))
>>> 
>>> FALSE  TRUE
>>> 88705     3
>>> 
>>>> range(exprs(x1)-exprs(x2))
>>> [1] -0.05682931  0.03592777
>>> 
>>>> sessionInfo()
>>> R version 2.7.0 Under development (unstable) (2007-11-29 r43558)
>>> i686-pc-linux-gnu
>>> 
>>> locale:
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8
>>> ;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAM
>>> E=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION
>>> =C
>>> 
>>> attached base packages:
>>> [1] tools     stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>> 
>>> other attached packages:
>>>  [1] lumi_1.5.10            annotate_1.15.6        AnnotationDbi_1.1.6
>>>  [4] RSQLite_0.6-0          DBI_0.2-3              mgcv_1.3-29
>>>  [7] affy_1.15.7            preprocessCore_0.99.12 affyio_1.5.7
>>> [10] Biobase_1.17.6
>>> 
>>> Cheers,
>>> Ligia
>> 
>> 
>> --
>> 
>> Best wishes
>>    Wolfgang
>> 
>> ------------------------------------------------------------------
>> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>> 
> 
> 
---------------------------------------------------
Pan Du, PhD
Research Assistant Professor
Robert H. Lurie Comprehensive Cancer Center
Northwestern University
676 ST Clair St., #1200
Chicago, IL 60611
Office (312)695-4781
dupan at northwestern.edu
    
    
More information about the Bioconductor
mailing list