[Bioc-devel] rfc - rowttests in genefilter package
Wolfgang Huber
whuber at embl.de
Thu Jul 16 23:45:50 CEST 2009
Dear Patrick
thank you! What Wikipedia calls Welford's/Knuth's one-pass algorithm
(and the apparent absence of even cleverer ones) is just what I was
looking for.
Implementation is always a trade-off of choosing between algorithms with
different computational cost, implementation effort and performance
characteristics. Needless to say, I did not have groups with thousands
of samples in mind when I made the "fast & simple" rowttests function
public many years ago. However, Welford's/Knuth's algorithm is uniformly
better on all three of the above criteria.
Thanks again and best wishes
Wolfgang
> Wolfgang,
> There are robust one-pass algorithms for calculating variances, if that
> is what you are interested in. Wikipedia has a nice summary of
> algorithms for calculating variance. Here is the link to the robust
> one-pass algorithm:
>
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
>
>
>
> Patrick
>
>
> Steven McKinney wrote:
>> Hi Wolfgang,
>> Two issues:
>>
>> 1) The style of the formula you show below is known
>> to have numerical accuracy problems.
>>
>> Why not just use the R var() function
>> on the data that you used to calculate
>> ss and s with? (or the C code that implements it?) I believe this issue
>> has been adequately handled there, though
>> I haven't read the source code.
>>
>> 2) Is that the correct formula?
>> An unbiased variance calculation would be
>>
>> (ss - s * s / n)/(n-1)
>>
>>
>>
>>
>> Steven McKinney, Ph.D.
>>
>> Statistician
>> Molecular Oncology and Breast Cancer Program
>> British Columbia Cancer Research Centre
>>
>> email: smckinney at bccrc.ca
>> tel: 604-675-8000 x7561
>>
>> BCCRC
>> Molecular Oncology
>> 675 West 10th Ave, Floor 4
>> Vancouver B.C. V5Z 1L3
>>
>> Canada
>>
>>
>>
>>
>>
>>
>>
>>
>>> -----Original Message-----
>>> From: bioc-devel-bounces at stat.math.ethz.ch [mailto:bioc-devel-
>>> bounces at stat.math.ethz.ch] On Behalf Of Wolfgang Huber
>>> Sent: Thursday, July 16, 2009 3:10 AM
>>> To: Bioconductor Developers
>>> Subject: [Bioc-devel] rfc - rowttests in genefilter package
>>>
>>> Hi,
>>>
>>> I noted in this function (which I wrote) that if the number of samples
>>> in each group is large (more than, say, 1000), floating point errors
>>> become significant, to the point of invalidating the results.
>>> Essentially, the reason is that I compute the within group variances
>>> via
>>>
>>> ss - s * s / n
>>>
>>> where ss is the sum of squared values, s is the sum of values, and n
>>> the sample size [1].
>>>
>>> I've added a warning to the man page asking people only to use the
>>> function when the number of samples is dozens to a few hundred. I can
>>> think of a few obvious ways to make the code less vulnerable to the
>>> finite precision of floating point arithmetic, but I am sure this
>>> problem has been solved many times before and would like to ask for
>>> pointers or suggestions.
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>> [1]
>>> https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/genefilter/
>>> src/rowttests.c
>>>
>>> -------------------------------------------------------
>>> Wolfgang Huber
>>> EMBL
>>> http://www.embl.de/research/units/genome_biology/huber
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
--
Best wishes
Wolfgang
-------------------------------------------------------
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioc-devel
mailing list