[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives

Wed Jul 13 19:03:16 CEST 2011

to cut a long story short, how about that:

bins<-quantile(x,0:200/200) #maybe replacing x by c(x,y)
ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
colnames(ctch)<-c("bin","tp","fp")

Cheers.

Am 13.07.2011 18:49, schrieb Eik Vettorazzi:
> Hello Lyndon,
> I'm still puzzled what x and y actually are and how they are related.
> Your "bins" are calculated from x only, so you might missing some range
> of y and since you use a fixed number of bins of equal size, some bins
> might be empty (and therefore reproducing the result from the bin
> before), and some might contain more than one distinct value (and so are
> ignoring an intermediate step). But this might be intentional and the
> reason for you to build up your own function.
> If not, have a look at pROC, there is a clever way to calculate the
> required bins, in particular look at
> 
> getAnywhere("roc.utils.thresholds")
> 
> 
> Some tweaking of your code is possible anyway:
> 
>> for(i in 1:length(threshold)) {
>>   tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>>   fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
>> }
> 
> calculates length(x2) and length(y2) in every single loop, but this is a
> constant, so it is wasting time. Anyway, you get away even without
> calculating this once, using
> 
> tp[i]<-sum(x2>bins[i]) #indexing is not necessary,
>                        #and 1-x2[j]<=tr == x2[j]>tr
> 
> and on top one that, the fanciest or at least R'ish way, by common
> understanding, is avoiding loops and using *apply, but that's a matter
> of taste, when it comes to that ;)
> 
> tp<-sapply(bins,function(b)sum(x>b))
> fp<-sapply(bins,function(b)sum(y>b))
> 
> #or all at once
> ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
> 
> A last minor remark: sorting x and y is not necessary in the new code
> fragment.
> 
> hth.
> 
> 
> 
> Am 13.07.2011 16:46, schrieb Lyndon Estes:
>> Hello Eik,
>>
>> Thanks very much for your response and for directing me to a useful
>> explanation.
>>
>> To make sure I am grasping your answer correctly, the two problems I
>> was experiencing are related to the fact that the floating point
>> numbers I was calculating and using in subsequent indices were not
>> entirely equivalent, even if they were calculated using the same
>> function (quantile).
>>
>> I confirmed this as follows:
>>
>> j <- 0.055  # bins value for 5.5% threshold
>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>> bin.ct
>> #430.29
>> bin.ct2 <- quantile(x, j, na.rm = TRUE)
>> bin.ct2
>> #  5.5%
>> #430.29
>> bin.ct - bin.ct2
>> #        5.5%
>> #5.684342e-14
>> length(x) - length(x[x <= bin.ct])
>> length(x) - length(x[x <= bin.ct]) # 2875
>> length(x) - length(x[x <= bin.ct2])  # 3029
>>
>> Testing the unname() option does not fix the result, I should however note:
>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>> bin.ct2 <- unname(quantile(x, j, na.rm = TRUE))
>> bin.ct - bin.ct2  # 5.684342e-14
>>
>> But rounding, as an alternative approach, works:
>> bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2)
>> bin.ct - bin.ct2
>> #5.5%
>> #   0
>> length(x) - length(x[x <= bin.ct]) # 2875
>> length(x) - length(x[x <= bin.ct2])  # 2875
>>
>> As to my code, it is part of a custom ROC function I created a while
>> back and started using again recently (the data are rainfall
>> values). I can't remember why I did this rather than using one of the
>> existing ROC functions, but I thought (probably incorrectly) that I
>> had some compelling reason. In
>> any case, it is quite unwieldy, so I will explore those other
>> packages, or try revise this to be more efficient
>>
>> (e.g. maybe this is a better approach, although the answers are fairly
>> different?).
>>
>> x2 <- x[order(x)]
>> y2 <- y[order(y)]
>> bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2)
>> threshold <- seq(0, 100, by = 0.5)
>> tp <- rep(0, length(bins))
>> fp <- rep(0, length(bins))
>> for(i in 1:length(threshold)) {
>>   tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>>   fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
>> }
>> ctch <- cbind(threshold, bins, tp, fp)
>> ctch[1:20, ]
>>
>> Thanks again for your help.
>>
>> Cheers, Lyndon
>>
>>
>> On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi
>> <E.Vettorazzi at uke.uni-hamburg.de> wrote:
>>> Hi,
>>>
>>> Am 11.07.2011 22:57, schrieb Lyndon Estes:
>>>> ctch[ctch$threshold == 3.5, ]
>>>> # [1] threshold val       tp        fp        tn        fn        tpr
>>>>      fpr       tnr       fnr
>>>> #<0 rows> (or 0-length row.names)
>>>
>>> this is the very effective FAQ 7.31 trap.
>>> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
>>>
>>> Welcome to the first circle of Patrick Burns' R Inferno!
>>>
>>> Also, unname() is a more intuitive way of removing names.
>>>
>>> And I think your code is quite inefficient, because you calculate
>>> quantiles many times, which involves repeated ordering of x, and you may
>>> use a inefficient size of bin (either to small and therefore calculating
>>> the same split many times or to large and then missing some splits).
>>> I'm a bit puzzled what is x and y in your code, so any further advise is
>>> vague but you might have a look at any package that calculates
>>> ROC-curves such as ROCR or pROC (and many more).
>>>
>>> Hth
>>>
>>> --
>>> Eik Vettorazzi
>>>
>>> Department of Medical Biometry and Epidemiology
>>> University Medical Center Hamburg-Eppendorf
>>>
>>> Martinistr. 52
>>> 20246 Hamburg
>>>
>>> T ++49/40/7410-58243
>>> F ++49/40/7410-57790
>>>
>>
>>
>>
> 

-- 
Eik Vettorazzi
Institut für Medizinische Biometrie und Epidemiologie
Universitätsklinikum Hamburg-Eppendorf

Martinistr. 52
20246 Hamburg

T ++49/40/7410-58243
F ++49/40/7410-57790