[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives
Lyndon Estes
lestes at princeton.edu
Mon Jul 11 22:57:53 CEST 2011
Dear List,
I have encountered an odd problem that I cannot understand. It stems
from the calculation of true and false positives based on two input
vectors x and y based on different thresholds of x, extracted using
the quantile function. I am in certain cases getting different values
of true positives for the same threshold value when the threshold was
found under different quantiles (e.g. the threshold value Z was found
with quantile(x, probs = 0.045) and quantile(x, probs = 0.05). The
following illustrates the problem:
# Start of code with comments
# Load vectors x and y
con <- url("http://sites.google.com/site/ldemisc/r_general/tpfpdat.Rdata")
load(con)
close(con)
# Data frame to collect TP and FP values based on different thresholds
of vector x
bins <- 0.005
ctch <- data.frame(matrix(nrow = 100 / (bins * 100) + 1, ncol = 4))
colnames(ctch) <- c("threshold", "val", "tp", "fp")
# Extract different TP and FP values in loop, where thresholds are
based on 1/2 percent increments
# in quantiles of x
for(i in 1:(100 / (bins * 100) + 1)) {
bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE)
tp <- length(x) - length(x[x <= bin.ct]) # N true positives
fp <- length(y) - length(y[y <= bin.ct]) # N false positives
ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp)
}
# The problem is here
ctch[1:20, ]
# threshold val tp fp
#1 0.0 330.57 3139 19485
#2 0.5 374.11 3118 17510
#3 1.0 395.38 3029 16883
#4 1.5 395.38 3029 16883
#5 2.0 395.38 3029 16883
#6 2.5 395.38 3029 16883
#7 3.0 395.38 3029 16883
#8 3.5 395.38 3029 16883
#9 4.0 430.29 2875 15346
#10 4.5 430.29 2875 15346
#11 5.0 430.29 2875 15346
#12 5.5 430.29 3029 15346
#13 6.0 430.29 3029 15346
#14 6.5 430.29 2875 15346
#15 7.0 430.29 2875 15346
#16 7.5 430.29 2875 15346
#17 8.0 430.29 2875 15346
#18 8.5 438.20 2872 14791
#19 9.0 441.66 2835 14656
#20 9.5 441.66 2835 14656
# Note that the values (val) are identical (430.29) for thresholds
ranging between 4.0 and 8.0. However,
# the tp values for thresholds 5.5 and 6.0 are 3029 whereas they are
2875 for thresholds of 4.0-5.0 and
# 6.5 - 8.0. Given that the threshold value is the same throughout,
it makes no sense that the there is any
# variation in the tp rates (also note that fp is the same throughout
this range).
# The problem seems to be here. Re-running the loop with the following
modification:
for(i in 1:(100 / (bins * 100) + 1)) {
#bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE)
# Substitute line above with the following line:
bin.ct <- as.numeric(as.character(quantile(x, bins * (i - 1),
na.rm = TRUE))) # Converts bin.ct to nameless vector
tp <- length(x) - length(x[x <= bin.ct]) # N true positives
fp <- length(y) - length(y[y <= bin.ct]) # N false positives
ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp)
}
# Produces more sensible results
ctch[1:20, ]
# threshold val tp fp
#1 0.0 330.57 3139 19485
#2 0.5 374.11 3118 17510
#3 1.0 395.38 3029 16883
#4 1.5 395.38 3029 16883
#5 2.0 395.38 3029 16883
#6 2.5 395.38 3029 16883
#7 3.0 395.38 3029 16883
#8 3.5 395.38 3029 16883
#9 4.0 430.29 2875 15346
#10 4.5 430.29 2875 15346
#11 5.0 430.29 2875 15346
#12 5.5 430.29 2875 15346 # tp values are consistent now
#13 6.0 430.29 2875 15346 # ""
#14 6.5 430.29 2875 15346
#15 7.0 430.29 2875 15346
#16 7.5 430.29 2875 15346
#17 8.0 430.29 2875 15346
#18 8.5 438.20 2872 14791
#19 9.0 441.66 2835 14656
#20 9.5 441.66 2835 14656
# I am not sure why this is this the way it is. The variable bin.ct
was a vector of class "numeric" in both
# versions above, but in the former case it was named:
# First version
quantile(x, bins * (i - 1), na.rm = TRUE)
# 100%
#771.51
class(quantile(x, bins * (i - 1), na.rm = TRUE))
# "numeric"
# Seond version
as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE)))
#771.51
class(as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE))))
# "numeric"
# I am therefore not clear why the named vectors resulting from the
quantile function might be calculating
# different tp values for the same numeric value but with different
vector names. It is not necessarily the
# fact that the vector is named, either. For instance, if I do this:
j <- 0.055 # bins value for 5.5% threshold
bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
names(bin.ct) <- "5.5%"
length(x) - length(x[x <= bin.ct])
# The value 2875 is returned
# Whereas this (the original construction) produces 3029, the problematic value
bin.ct <- quantile(x, j, na.rm = TRUE)
length(x) - length(x[x <= bin.ct])
# Very curious. One last thing which may or may not be related:
# If I try and subset results from ctch using different values of
ctch$threshold, this happens:
ctch[ctch$threshold == 3.0, ]
# threshold val tp fp tn fn tpr fpr tnr fnr
#7 3 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535
ctch[ctch$threshold == 3.5, ]
# [1] threshold val tp fp tn fn tpr
fpr tnr fnr
#<0 rows> (or 0-length row.names)
ctch[ctch$threshold == 4.0, ]
# threshold val tp fp tn fn tpr fpr tnr fnr
#9 4 430.29 2875 15346 18425 265 0.915605 0.454414 0.545586 0.084395
# Why is the indexing failing in the case of ctch[ctch$threshold ==
3.5, ], when there is a row corresponding
# to this value in the dataframe ctch?
# A post-hoc fix gets rid of this problem, the cause of which I also
do not understand:
ctch$threshold <- seq(0, 100, by = 0.5)
ctch[ctch$threshold == 3.5, ]
# threshold val tp fp tn fn tpr fpr tnr fnr
#8 3.5 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535
# End code
I would very much appreciate any insight into the issues detailed
above. Am I doing something wrong with my code, or missing something
obvious? As a last bit of information, I should mention that I have
found the same results with both R 2.13 and 2.13.1 (installed today).
Thanks in advance for your help.
Best, Lyndon
p.s. Here is my sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rgdal_0.7-1 raster_1.8-39 sp_0.9-83
loaded via a namespace (and not attached):
[1] grid_2.13.1 lattice_0.19-30 tools_2.13.1
--
Lyndon Estes
Research Associate
Woodrow Wilson School
Princeton University
+1-609-258-2392 (o)
+1-609-258-6082 (f)
+1-202-431-0496 (m)
lestes at princeton.edu
More information about the R-help
mailing list