[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives

Mon Jul 11 22:57:53 CEST 2011

Dear List,

I have encountered an odd problem that I cannot understand. It stems
from the calculation of true and false positives based on two input
vectors x and y based on different thresholds of x, extracted using
the quantile function. I am in certain cases getting different values
of true positives for the same threshold value when the threshold was
found under different quantiles (e.g. the threshold value Z was found
with quantile(x, probs = 0.045) and quantile(x, probs = 0.05). The
following illustrates the problem:

# Start of code with comments
# Load vectors x and y
con <- url("http://sites.google.com/site/ldemisc/r_general/tpfpdat.Rdata")
load(con)
close(con)

# Data frame to collect TP and FP values based on different thresholds
of vector x
bins <- 0.005
ctch <- data.frame(matrix(nrow = 100 / (bins * 100) + 1, ncol = 4))
colnames(ctch) <- c("threshold", "val", "tp", "fp")

# Extract different TP and FP values in loop, where thresholds are
based on 1/2 percent increments
# in quantiles of x
for(i in 1:(100 / (bins * 100) + 1)) {
    bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE)
    tp <- length(x) - length(x[x <= bin.ct])  # N true positives
    fp <- length(y) - length(y[y <= bin.ct])  # N false positives
    ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp)
}

# The problem is here
ctch[1:20, ]

#   threshold    val   tp    fp
#1        0.0 330.57 3139 19485
#2        0.5 374.11 3118 17510
#3        1.0 395.38 3029 16883
#4        1.5 395.38 3029 16883
#5        2.0 395.38 3029 16883
#6        2.5 395.38 3029 16883
#7        3.0 395.38 3029 16883
#8        3.5 395.38 3029 16883
#9        4.0 430.29 2875 15346
#10       4.5 430.29 2875 15346
#11       5.0 430.29 2875 15346
#12       5.5 430.29 3029 15346
#13       6.0 430.29 3029 15346
#14       6.5 430.29 2875 15346
#15       7.0 430.29 2875 15346
#16       7.5 430.29 2875 15346
#17       8.0 430.29 2875 15346
#18       8.5 438.20 2872 14791
#19       9.0 441.66 2835 14656
#20       9.5 441.66 2835 14656

# Note that the values (val) are identical (430.29) for thresholds
ranging between 4.0 and 8.0. However,
# the tp values for thresholds 5.5 and 6.0 are 3029 whereas they are
2875 for thresholds of 4.0-5.0 and
# 6.5 - 8.0.  Given that the threshold value is the same throughout,
it makes no sense that the there is any
# variation in the tp rates (also note that fp is the same throughout
this range).

# The problem seems to be here. Re-running the loop with the following
modification:

for(i in 1:(100 / (bins * 100) + 1)) {
    #bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE)
    # Substitute line above with the following line:
    bin.ct <- as.numeric(as.character(quantile(x, bins * (i - 1),
na.rm = TRUE)))  # Converts bin.ct to nameless vector
    tp <- length(x) - length(x[x <= bin.ct])  # N true positives
    fp <- length(y) - length(y[y <= bin.ct])  # N false positives
    ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp)
}

# Produces more sensible results
ctch[1:20, ]

#   threshold    val   tp    fp
#1        0.0 330.57 3139 19485
#2        0.5 374.11 3118 17510
#3        1.0 395.38 3029 16883
#4        1.5 395.38 3029 16883
#5        2.0 395.38 3029 16883
#6        2.5 395.38 3029 16883
#7        3.0 395.38 3029 16883
#8        3.5 395.38 3029 16883
#9        4.0 430.29 2875 15346
#10       4.5 430.29 2875 15346
#11       5.0 430.29 2875 15346
#12       5.5 430.29 2875 15346  # tp values are consistent now
#13       6.0 430.29 2875 15346  # ""
#14       6.5 430.29 2875 15346
#15       7.0 430.29 2875 15346
#16       7.5 430.29 2875 15346
#17       8.0 430.29 2875 15346
#18       8.5 438.20 2872 14791
#19       9.0 441.66 2835 14656
#20       9.5 441.66 2835 14656

# I am not sure why this is this the way it is. The variable bin.ct
was a vector of class "numeric" in both
# versions above, but in the former case it was named:

# First version
quantile(x, bins * (i - 1), na.rm = TRUE)
# 100%
#771.51
class(quantile(x, bins * (i - 1), na.rm = TRUE))
# "numeric"

# Seond version
as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE)))
#771.51
class(as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE))))
# "numeric"

# I am therefore not clear why the named vectors resulting from the
quantile function might be calculating
# different tp values for the same numeric value but with different
vector names. It is not necessarily the
# fact that the vector is named, either. For instance, if I do this:

j <- 0.055  # bins value for 5.5% threshold
bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
names(bin.ct) <- "5.5%"
length(x) - length(x[x <= bin.ct])
# The value 2875 is returned

# Whereas this (the original construction) produces 3029, the problematic value
bin.ct <- quantile(x, j, na.rm = TRUE)
length(x) - length(x[x <= bin.ct])

# Very curious. One last thing which may or may not be related:
# If I try and subset results from ctch using different values of
ctch$threshold, this happens:

ctch[ctch$threshold == 3.0, ]
#  threshold    val   tp    fp    tn  fn     tpr      fpr      tnr     fnr
#7         3 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535

ctch[ctch$threshold == 3.5, ]
# [1] threshold val       tp        fp        tn        fn        tpr
     fpr       tnr       fnr
#<0 rows> (or 0-length row.names)

ctch[ctch$threshold == 4.0, ]
#  threshold    val   tp    fp    tn  fn      tpr      fpr      tnr      fnr
#9         4 430.29 2875 15346 18425 265 0.915605 0.454414 0.545586 0.084395

# Why is the indexing failing in the case of ctch[ctch$threshold ==
3.5, ], when there is a row corresponding
# to this value in the dataframe ctch?

# A post-hoc fix gets rid of this problem, the cause of which I also
do not understand:
ctch$threshold <- seq(0, 100, by = 0.5)
ctch[ctch$threshold == 3.5, ]
#  threshold    val   tp    fp    tn  fn     tpr      fpr      tnr     fnr
#8       3.5 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535

# End code

I would very much appreciate any insight into the issues detailed
above. Am I doing something wrong with my code, or missing something
obvious? As a last bit of information, I should mention that I have
found the same results with both R 2.13 and 2.13.1 (installed today).

Thanks in advance for your help.

Best, Lyndon

p.s. Here is my sessionInfo()

R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rgdal_0.7-1   raster_1.8-39 sp_0.9-83

loaded via a namespace (and not attached):
[1] grid_2.13.1     lattice_0.19-30 tools_2.13.1

-- 
Lyndon Estes
Research Associate
Woodrow Wilson School
Princeton University
+1-609-258-2392 (o)
+1-609-258-6082 (f)
+1-202-431-0496 (m)
lestes at princeton.edu