[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives

Lyndon Estes lestes at princeton.edu
Thu Jul 14 04:29:50 CEST 2011


Hi Eik,

Thanks very much for those helpful coding suggestions, and for
confirming the error.

The apply function is much nicer, and I will incorporate that plus the
correct quantile type. The function I had written actually gives the
option to calculate in the manner you suggest in one of your comments:

# maybe replacing x by c(x,y)

But using the less efficient loops.  Replacing with the apply method
you have suggested will make it much nicer and give me the option to
use both kinds of approaches to extracting thresholds.

Thanks again,

Lyndon


On Wed, Jul 13, 2011 at 1:35 PM, Eik Vettorazzi
<E.Vettorazzi at uke.uni-hamburg.de> wrote:
> Hi Lyndon,
> I'm sorry, I just saw that you provided your dataset.
> I checked my program and it reproduced the same error.
>
> con <- url("http://sites.google.com/site/ldemisc/r_general/tpfpdat.Rdata")
> load(con)
> close(con)
>
> bins<-quantile(x,0:200/200)
> ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
> colnames(ctch)<-c("bin","tp","fp")
> head(ctch,20)
>
> so it all comes down to quantile not producing strictly ordered values
> bins[11]>bins[12] #TRUE
>
> this is due to how "quantile" handles ties in conjunction with floating
> point arithmetic.
> There are 9 different way in R to calculate quantiles, see ?quantile
> The last 6 produce non-monotone quantiles:
>  sapply(1:9,function(i)any(diff(quantile(x,0:200/200,type=i))<0))
>
> while the first three are ok. So
>
> bins<-quantile(x,0:200/200,type=1)
>
> or rounding bins up to 4 digits will suffice.
>
> cheers.
>
>
> Am 13.07.2011 19:03, schrieb Eik Vettorazzi:
>> to cut a long story short, how about that:
>>
>> bins<-quantile(x,0:200/200) #maybe replacing x by c(x,y)
>> ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
>> colnames(ctch)<-c("bin","tp","fp")
>>
>> Cheers.
>>
>> Am 13.07.2011 18:49, schrieb Eik Vettorazzi:
>>> Hello Lyndon,
>>> I'm still puzzled what x and y actually are and how they are related.
>>> Your "bins" are calculated from x only, so you might missing some range
>>> of y and since you use a fixed number of bins of equal size, some bins
>>> might be empty (and therefore reproducing the result from the bin
>>> before), and some might contain more than one distinct value (and so are
>>> ignoring an intermediate step). But this might be intentional and the
>>> reason for you to build up your own function.
>>> If not, have a look at pROC, there is a clever way to calculate the
>>> required bins, in particular look at
>>>
>>> getAnywhere("roc.utils.thresholds")
>>>
>>>
>>> Some tweaking of your code is possible anyway:
>>>
>>>> for(i in 1:length(threshold)) {
>>>>   tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>>>>   fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
>>>> }
>>>
>>> calculates length(x2) and length(y2) in every single loop, but this is a
>>> constant, so it is wasting time. Anyway, you get away even without
>>> calculating this once, using
>>>
>>> tp[i]<-sum(x2>bins[i]) #indexing is not necessary,
>>>                        #and 1-x2[j]<=tr == x2[j]>tr
>>>
>>> and on top one that, the fanciest or at least R'ish way, by common
>>> understanding, is avoiding loops and using *apply, but that's a matter
>>> of taste, when it comes to that ;)
>>>
>>> tp<-sapply(bins,function(b)sum(x>b))
>>> fp<-sapply(bins,function(b)sum(y>b))
>>>
>>> #or all at once
>>> ctch<-t(sapply(bins,function(b)c(b,sum(x>b),sum(y>b))))
>>>
>>> A last minor remark: sorting x and y is not necessary in the new code
>>> fragment.
>>>
>>> hth.
>>>
>>>
>>>
>>> Am 13.07.2011 16:46, schrieb Lyndon Estes:
>>>> Hello Eik,
>>>>
>>>> Thanks very much for your response and for directing me to a useful
>>>> explanation.
>>>>
>>>> To make sure I am grasping your answer correctly, the two problems I
>>>> was experiencing are related to the fact that the floating point
>>>> numbers I was calculating and using in subsequent indices were not
>>>> entirely equivalent, even if they were calculated using the same
>>>> function (quantile).
>>>>
>>>> I confirmed this as follows:
>>>>
>>>> j <- 0.055  # bins value for 5.5% threshold
>>>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>>>> bin.ct
>>>> #430.29
>>>> bin.ct2 <- quantile(x, j, na.rm = TRUE)
>>>> bin.ct2
>>>> #  5.5%
>>>> #430.29
>>>> bin.ct - bin.ct2
>>>> #        5.5%
>>>> #5.684342e-14
>>>> length(x) - length(x[x <= bin.ct])
>>>> length(x) - length(x[x <= bin.ct]) # 2875
>>>> length(x) - length(x[x <= bin.ct2])  # 3029
>>>>
>>>> Testing the unname() option does not fix the result, I should however note:
>>>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>>>> bin.ct2 <- unname(quantile(x, j, na.rm = TRUE))
>>>> bin.ct - bin.ct2  # 5.684342e-14
>>>>
>>>> But rounding, as an alternative approach, works:
>>>> bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2)
>>>> bin.ct - bin.ct2
>>>> #5.5%
>>>> #   0
>>>> length(x) - length(x[x <= bin.ct]) # 2875
>>>> length(x) - length(x[x <= bin.ct2])  # 2875
>>>>
>>>> As to my code, it is part of a custom ROC function I created a while
>>>> back and started using again recently (the data are rainfall
>>>> values). I can't remember why I did this rather than using one of the
>>>> existing ROC functions, but I thought (probably incorrectly) that I
>>>> had some compelling reason. In
>>>> any case, it is quite unwieldy, so I will explore those other
>>>> packages, or try revise this to be more efficient
>>>>
>>>> (e.g. maybe this is a better approach, although the answers are fairly
>>>> different?).
>>>>
>>>> x2 <- x[order(x)]
>>>> y2 <- y[order(y)]
>>>> bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2)
>>>> threshold <- seq(0, 100, by = 0.5)
>>>> tp <- rep(0, length(bins))
>>>> fp <- rep(0, length(bins))
>>>> for(i in 1:length(threshold)) {
>>>>   tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>>>>   fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
>>>> }
>>>> ctch <- cbind(threshold, bins, tp, fp)
>>>> ctch[1:20, ]
>>>>
>>>> Thanks again for your help.
>>>>
>>>> Cheers, Lyndon
>>>>
>>>>
>>>> On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi
>>>> <E.Vettorazzi at uke.uni-hamburg.de> wrote:
>>>>> Hi,
>>>>>
>>>>> Am 11.07.2011 22:57, schrieb Lyndon Estes:
>>>>>> ctch[ctch$threshold == 3.5, ]
>>>>>> # [1] threshold val       tp        fp        tn        fn        tpr
>>>>>>      fpr       tnr       fnr
>>>>>> #<0 rows> (or 0-length row.names)
>>>>>
>>>>> this is the very effective FAQ 7.31 trap.
>>>>> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
>>>>>
>>>>> Welcome to the first circle of Patrick Burns' R Inferno!
>>>>>
>>>>> Also, unname() is a more intuitive way of removing names.
>>>>>
>>>>> And I think your code is quite inefficient, because you calculate
>>>>> quantiles many times, which involves repeated ordering of x, and you may
>>>>> use a inefficient size of bin (either to small and therefore calculating
>>>>> the same split many times or to large and then missing some splits).
>>>>> I'm a bit puzzled what is x and y in your code, so any further advise is
>>>>> vague but you might have a look at any package that calculates
>>>>> ROC-curves such as ROCR or pROC (and many more).
>>>>>
>>>>> Hth
>>>>>
>>>>> --
>>>>> Eik Vettorazzi
>>>>>
>>>>> Department of Medical Biometry and Epidemiology
>>>>> University Medical Center Hamburg-Eppendorf
>>>>>
>>>>> Martinistr. 52
>>>>> 20246 Hamburg
>>>>>
>>>>> T ++49/40/7410-58243
>>>>> F ++49/40/7410-57790
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>
> --
> Eik Vettorazzi
>
> Department of Medical Biometry and Epidemiology
> University Medical Center Hamburg-Eppendorf
>
> Martinistr. 52
> 20246 Hamburg
>
> T ++49/40/7410-58243
> F ++49/40/7410-57790
>



-- 
Lyndon Estes
Research Associate
Woodrow Wilson School
Princeton University
+1-609-258-2392 (o)
+1-609-258-6082 (f)
+1-202-431-0496 (m)
lestes at princeton.edu



More information about the R-help mailing list