[Bioc-devel] rfc - rowttests in genefilter package

Patrick Aboyoun paboyoun at fhcrc.org
Thu Jul 16 21:45:27 CEST 2009


Wolfgang,
There are robust one-pass algorithms for calculating variances, if that 
is what you are interested in. Wikipedia has a nice summary of 
algorithms for calculating variance. Here is the link to the robust 
one-pass algorithm:

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm


Patrick


Steven McKinney wrote:
> Hi Wolfgang, 
>
> Two issues:
>
> 1) The style of the formula you show below is known
> to have numerical accuracy problems.
>
> Why not just use the R var() function
> on the data that you used to calculate
> ss and s with? (or the C code that 
> implements it?)  I believe this issue
> has been adequately handled there, though
> I haven't read the source code.
>
> 2) Is that the correct formula?
> An unbiased variance calculation would be
>
> (ss - s * s / n)/(n-1)
>  
>
>  
>
> Steven McKinney, Ph.D.
>
> Statistician
> Molecular Oncology and Breast Cancer Program
> British Columbia Cancer Research Centre
>
> email: smckinney at bccrc.ca
> tel: 604-675-8000 x7561
>
> BCCRC
> Molecular Oncology
> 675 West 10th Ave, Floor 4
> Vancouver B.C. 
> V5Z 1L3
>
> Canada
>
>
>  
>
>  
>
>
>   
>> -----Original Message-----
>> From: bioc-devel-bounces at stat.math.ethz.ch [mailto:bioc-devel-
>> bounces at stat.math.ethz.ch] On Behalf Of Wolfgang Huber
>> Sent: Thursday, July 16, 2009 3:10 AM
>> To: Bioconductor Developers
>> Subject: [Bioc-devel] rfc - rowttests in genefilter package
>>
>> Hi,
>>
>> I noted in this function (which I wrote) that if the number of samples
>> in each group is large (more than, say, 1000), floating point errors
>> become significant, to the point of invalidating the results.
>> Essentially, the reason is that I compute the within group variances
>> via
>>
>>     ss - s * s / n
>>
>> where ss is the sum of squared values, s is the sum of values, and n
>> the sample size [1].
>>
>> I've added a warning to the man page asking people only to use the
>> function when the number of samples is dozens to a few hundred. I can
>> think of a few obvious ways to make the code less vulnerable to the
>> finite precision of floating point arithmetic, but I am sure this
>> problem has been solved many times before and would like to ask for
>> pointers or suggestions.
>>
>> Best wishes
>>       Wolfgang
>>
>> [1]
>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/genefilter/
>> src/rowttests.c
>>
>> -------------------------------------------------------
>> Wolfgang Huber
>> EMBL
>> http://www.embl.de/research/units/genome_biology/huber
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>     
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list