[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives

Bert Gunter gunter.berton at gene.com
Wed Jul 13 17:08:41 CEST 2011


Rather strange ...

Why would one convert a numeric to character and then back again to
numeric? Why would one assume that such a conversion would retain full
machine precision?

In fact,
 ?as.character

tells you:

as.character represents real and complex numbers to 15 significant
digits (technically the compiler's setting of the ISO C constant
DBL_DIG, which will be 15 on machines supporting IEC60559 arithmetic
according to the C99 standard). This ensures that all the digits in
the result will be reliable (and not the result of representation
error), but does mean that conversion to character and back to numeric
may change the number. If you want to convert numbers to character
with the maximum possible precision, use format.

... which is exactly what you saw.

-- Bert


On Wed, Jul 13, 2011 at 7:46 AM, Lyndon Estes <lestes at princeton.edu> wrote:
> Hello Eik,
>
> Thanks very much for your response and for directing me to a useful
> explanation.
>
> To make sure I am grasping your answer correctly, the two problems I
> was experiencing are related to the fact that the floating point
> numbers I was calculating and using in subsequent indices were not
> entirely equivalent, even if they were calculated using the same
> function (quantile).
>
> I confirmed this as follows:
>
> j <- 0.055  # bins value for 5.5% threshold
> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
> bin.ct
> #430.29
> bin.ct2 <- quantile(x, j, na.rm = TRUE)
> bin.ct2
> #  5.5%
> #430.29
> bin.ct - bin.ct2
> #        5.5%
> #5.684342e-14
> length(x) - length(x[x <= bin.ct])
> length(x) - length(x[x <= bin.ct]) # 2875
> length(x) - length(x[x <= bin.ct2])  # 3029
>
> Testing the unname() option does not fix the result, I should however note:
> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
> bin.ct2 <- unname(quantile(x, j, na.rm = TRUE))
> bin.ct - bin.ct2  # 5.684342e-14
>
> But rounding, as an alternative approach, works:
> bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2)
> bin.ct - bin.ct2
> #5.5%
> #   0
> length(x) - length(x[x <= bin.ct]) # 2875
> length(x) - length(x[x <= bin.ct2])  # 2875
>
> As to my code, it is part of a custom ROC function I created a while
> back and started using again recently (the data are rainfall
> values). I can't remember why I did this rather than using one of the
> existing ROC functions, but I thought (probably incorrectly) that I
> had some compelling reason. In
> any case, it is quite unwieldy, so I will explore those other
> packages, or try revise this to be more efficient
>
> (e.g. maybe this is a better approach, although the answers are fairly
> different?).
>
> x2 <- x[order(x)]
> y2 <- y[order(y)]
> bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2)
> threshold <- seq(0, 100, by = 0.5)
> tp <- rep(0, length(bins))
> fp <- rep(0, length(bins))
> for(i in 1:length(threshold)) {
>  tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>  fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
> }
> ctch <- cbind(threshold, bins, tp, fp)
> ctch[1:20, ]
>
> Thanks again for your help.
>
> Cheers, Lyndon
>
>
> On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi
> <E.Vettorazzi at uke.uni-hamburg.de> wrote:
>> Hi,
>>
>> Am 11.07.2011 22:57, schrieb Lyndon Estes:
>>> ctch[ctch$threshold == 3.5, ]
>>> # [1] threshold val       tp        fp        tn        fn        tpr
>>>      fpr       tnr       fnr
>>> #<0 rows> (or 0-length row.names)
>>
>> this is the very effective FAQ 7.31 trap.
>> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
>>
>> Welcome to the first circle of Patrick Burns' R Inferno!
>>
>> Also, unname() is a more intuitive way of removing names.
>>
>> And I think your code is quite inefficient, because you calculate
>> quantiles many times, which involves repeated ordering of x, and you may
>> use a inefficient size of bin (either to small and therefore calculating
>> the same split many times or to large and then missing some splits).
>> I'm a bit puzzled what is x and y in your code, so any further advise is
>> vague but you might have a look at any package that calculates
>> ROC-curves such as ROCR or pROC (and many more).
>>
>> Hth
>>
>> --
>> Eik Vettorazzi
>>
>> Department of Medical Biometry and Epidemiology
>> University Medical Center Hamburg-Eppendorf
>>
>> Martinistr. 52
>> 20246 Hamburg
>>
>> T ++49/40/7410-58243
>> F ++49/40/7410-57790
>>
>
>
>
> --
> Lyndon Estes
> Research Associate
> Woodrow Wilson School
> Princeton University
> lestes at princeton.edu
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
"Men by nature long to get on to the ultimate truths, and will often
be impatient with elementary studies or fight shy of them. If it were
possible to reach the ultimate truths without the elementary studies
usually prefixed to them, these would not be preparatory studies but
superfluous diversions."

-- Maimonides (1135-1204)

Bert Gunter
Genentech Nonclinical Biostatistics



More information about the R-help mailing list