[Rd] Partial matching performance in data frame rownames using [
Hilmar Berger
h||m@r@berger @end|ng |rom gmx@de
Mon Dec 11 21:11:48 CET 2023
Dear all,
I have seen that others have discussed the partial matching behaviour of
data.frame[idx,] in the past, in particular with respect to unexpected
results sets.
I am aware of the fact that one can work around this using either
match() or switching to tibble/data.table or similar altogether.
I have a different issue with the partial matching, in particular its
performance when used on large data frames or more specifically, with
large queries matched against its row names.
I came across a case where I wanted to extract data from a large table
(approx 1M rows) using an index which matched only about 50% to the row
names, i.e. about 50% row name hits and 50% misses.
What was unexpected is that in this case was that [.data.frame was
hanging for a long time (I waited about 10 minutes and then restarted
R). Also, this cannot be interrupted in interactive mode.
ids <- paste0("cg", sprintf("%06d",0:(1e6-1)))
d1 <- data.frame(row.names=ids, v=1:(1e6) )
q1 <- sample(ids, 1e6, replace=F)
system.time({r <- d1[q1,,drop=F]})
# user system elapsed
# 0.464 0.000 0.465
# those will hang a long time, I stopped R after 10 minutes
q2 <- c(q1[1:5e5], gsub("cg", "ct", q1[(5e5+1):1e6]) )
system.time({r <- d1[q2,,drop=F]})
# same here
q3 <- c(q1[1:5e5], rep("FOO",5e5) )
system.time({r <- d1[q3,,drop=F]})
It seems that the penalty of partial matching the non-hits across the
whole row name vector is not negligible any more with large tables and
queries, compared to small and medium tables.
I checked and pmatch(q2, rownames(d1) is equally slow.
Is there a chance to a) document this in the help page ("with large
indexes/tables use match()") or even better b) add an exact flag to
[.data.frame ?
Thanks a lot!
Best regards
Hilmar
More information about the R-devel
mailing list