[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives

Wed Jul 13 19:35:09 CEST 2011

Hi Lyndon,
I'm sorry, I just saw that you provided your dataset.
I checked my program and it reproduced the same error.

con <- url("http://sites.google.com/site/ldemisc/r_general/tpfpdat.Rdata")
load(con)
close(con)

bins<-quantile(x,0:200/200)
ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
colnames(ctch)<-c("bin","tp","fp")
head(ctch,20)

so it all comes down to quantile not producing strictly ordered values
bins[11]>bins[12] #TRUE

this is due to how "quantile" handles ties in conjunction with floating
point arithmetic.
There are 9 different way in R to calculate quantiles, see ?quantile
The last 6 produce non-monotone quantiles:
 sapply(1:9,function(i)any(diff(quantile(x,0:200/200,type=i))<0))

while the first three are ok. So

bins<-quantile(x,0:200/200,type=1)

or rounding bins up to 4 digits will suffice.

cheers.

Am 13.07.2011 19:03, schrieb Eik Vettorazzi:
> to cut a long story short, how about that:
> 
> bins<-quantile(x,0:200/200) #maybe replacing x by c(x,y)
> ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
> colnames(ctch)<-c("bin","tp","fp")
> 
> Cheers.
> 
> Am 13.07.2011 18:49, schrieb Eik Vettorazzi:
>> Hello Lyndon,
>> I'm still puzzled what x and y actually are and how they are related.
>> Your "bins" are calculated from x only, so you might missing some range
>> of y and since you use a fixed number of bins of equal size, some bins
>> might be empty (and therefore reproducing the result from the bin
>> before), and some might contain more than one distinct value (and so are
>> ignoring an intermediate step). But this might be intentional and the
>> reason for you to build up your own function.
>> If not, have a look at pROC, there is a clever way to calculate the
>> required bins, in particular look at
>>
>> getAnywhere("roc.utils.thresholds")
>>
>>
>> Some tweaking of your code is possible anyway:
>>
>>> for(i in 1:length(threshold)) {
>>>   tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>>>   fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
>>> }
>>
>> calculates length(x2) and length(y2) in every single loop, but this is a
>> constant, so it is wasting time. Anyway, you get away even without
>> calculating this once, using
>>
>> tp[i]<-sum(x2>bins[i]) #indexing is not necessary,
>>                        #and 1-x2[j]<=tr == x2[j]>tr
>>
>> and on top one that, the fanciest or at least R'ish way, by common
>> understanding, is avoiding loops and using *apply, but that's a matter
>> of taste, when it comes to that ;)
>>
>> tp<-sapply(bins,function(b)sum(x>b))
>> fp<-sapply(bins,function(b)sum(y>b))
>>
>> #or all at once
>> ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
>>
>> A last minor remark: sorting x and y is not necessary in the new code
>> fragment.
>>
>> hth.
>>
>>
>>
>> Am 13.07.2011 16:46, schrieb Lyndon Estes:
>>> Hello Eik,
>>>
>>> Thanks very much for your response and for directing me to a useful
>>> explanation.
>>>
>>> To make sure I am grasping your answer correctly, the two problems I
>>> was experiencing are related to the fact that the floating point
>>> numbers I was calculating and using in subsequent indices were not
>>> entirely equivalent, even if they were calculated using the same
>>> function (quantile).
>>>
>>> I confirmed this as follows:
>>>
>>> j <- 0.055  # bins value for 5.5% threshold
>>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>>> bin.ct
>>> #430.29
>>> bin.ct2 <- quantile(x, j, na.rm = TRUE)
>>> bin.ct2
>>> #  5.5%
>>> #430.29
>>> bin.ct - bin.ct2
>>> #        5.5%
>>> #5.684342e-14
>>> length(x) - length(x[x <= bin.ct])
>>> length(x) - length(x[x <= bin.ct]) # 2875
>>> length(x) - length(x[x <= bin.ct2])  # 3029
>>>
>>> Testing the unname() option does not fix the result, I should however note:
>>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>>> bin.ct2 <- unname(quantile(x, j, na.rm = TRUE))
>>> bin.ct - bin.ct2  # 5.684342e-14
>>>
>>> But rounding, as an alternative approach, works:
>>> bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2)
>>> bin.ct - bin.ct2
>>> #5.5%
>>> #   0
>>> length(x) - length(x[x <= bin.ct]) # 2875
>>> length(x) - length(x[x <= bin.ct2])  # 2875
>>>
>>> As to my code, it is part of a custom ROC function I created a while
>>> back and started using again recently (the data are rainfall
>>> values). I can't remember why I did this rather than using one of the
>>> existing ROC functions, but I thought (probably incorrectly) that I
>>> had some compelling reason. In
>>> any case, it is quite unwieldy, so I will explore those other
>>> packages, or try revise this to be more efficient
>>>
>>> (e.g. maybe this is a better approach, although the answers are fairly
>>> different?).
>>>
>>> x2 <- x[order(x)]
>>> y2 <- y[order(y)]
>>> bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2)
>>> threshold <- seq(0, 100, by = 0.5)
>>> tp <- rep(0, length(bins))
>>> fp <- rep(0, length(bins))
>>> for(i in 1:length(threshold)) {
>>>   tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>>>   fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
>>> }
>>> ctch <- cbind(threshold, bins, tp, fp)
>>> ctch[1:20, ]
>>>
>>> Thanks again for your help.
>>>
>>> Cheers, Lyndon
>>>
>>>
>>> On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi
>>> <E.Vettorazzi at uke.uni-hamburg.de> wrote:
>>>> Hi,
>>>>
>>>> Am 11.07.2011 22:57, schrieb Lyndon Estes:
>>>>> ctch[ctch$threshold == 3.5, ]
>>>>> # [1] threshold val       tp        fp        tn        fn        tpr
>>>>>      fpr       tnr       fnr
>>>>> #<0 rows> (or 0-length row.names)
>>>>
>>>> this is the very effective FAQ 7.31 trap.
>>>> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
>>>>
>>>> Welcome to the first circle of Patrick Burns' R Inferno!
>>>>
>>>> Also, unname() is a more intuitive way of removing names.
>>>>
>>>> And I think your code is quite inefficient, because you calculate
>>>> quantiles many times, which involves repeated ordering of x, and you may
>>>> use a inefficient size of bin (either to small and therefore calculating
>>>> the same split many times or to large and then missing some splits).
>>>> I'm a bit puzzled what is x and y in your code, so any further advise is
>>>> vague but you might have a look at any package that calculates
>>>> ROC-curves such as ROCR or pROC (and many more).
>>>>
>>>> Hth
>>>>
>>>> --
>>>> Eik Vettorazzi
>>>>
>>>> Department of Medical Biometry and Epidemiology
>>>> University Medical Center Hamburg-Eppendorf
>>>>
>>>> Martinistr. 52
>>>> 20246 Hamburg
>>>>
>>>> T ++49/40/7410-58243
>>>> F ++49/40/7410-57790
>>>>
>>>
>>>
>>>
>>
> 

-- 
Eik Vettorazzi

Department of Medical Biometry and Epidemiology
University Medical Center Hamburg-Eppendorf

Martinistr. 52
20246 Hamburg

T ++49/40/7410-58243
F ++49/40/7410-57790