[Rd] match and unique
hpages at fredhutch.org
Thu Mar 17 21:20:57 CET 2016
On 03/16/2016 08:03 AM, Therneau, Terry M., Ph.D. wrote:
> Is the phrase "index <- match(x, sort(unique(x)))" reliable, in the
> sense that it will never return NA?
This is assuming that match() and unique() will never disagree on
equality between 2 floating point values. I believe they share some
code internally (same hashing routine?), so maybe it's reliable.
Anyway, it's always preferable to not rely on this kind of assumption.
A safer thing to do is to use rank():
r <- rank(x, ties.method="min") # could use "max"
Think of 'r' as a unique ID assigned to each value in 'x'. This ID takes
its values in the (1,length(x)) range but we want it to take its values
in the (1,length(unique(x))) range:
ID_remapping <- cumsum(tabulate(r, nbins=length(r)) != 0L)
index <- ID_remapping[r]
'index' will be the same as 'match(x, sort(unique(x))' but doesn't rely
on the assumption that match() and unique() agree on equality between
2 floating point values.
Unfortunately rank() is very slow, much slower than sort(). Here is a
faster solution based on sort.list(x, na.last=NA, method="quick"):
assignID <- function(x)
oo <- sort.list(x, na.last=NA, method="quick")
sorted <- x[oo]
is_unique <- c(TRUE, sorted[-length(sorted)] != sorted[-1L])
sorted_ID <- cumsum(is_unique)
ID <- integer(length(x))
ID[oo] <- sorted_ID
'assignID(x)' is also slightly faster than 'match(x, sort(unique(x)))':
x <- runif(5000000)
system.time(index1 <- match(x, sort(unique(x))))
# user system elapsed
# 2.170 0.552 2.725
system.time(index2 <- assignID(x))
# user system elapsed
# 0.885 0.032 0.917
#  TRUE
> Context: Calculation of survival curves involves the concept of unique
> death times. I've had reported cases in the past where survfit failed,
> and it was due to the fact that two "differ by machine precision" values
> would sometimes match and sometimes not, depending on how I compared
> them. I've dealt with those piecemeal in the past, but am going to do a
> code review and make sure that I do things consistently throughout the
> survival package. The basic plan will be to change time to an integer,
> do all the work, then restore labels at the end. The above line is one
> candidate for the first step.
> An alternative is index <- as.numeric(factor(x)), with
> as.numeric(levels(factor(x))) as the final labeling step. This is a
> more severe rounding, is it not? But perhaps it is preferable? The KM
> branch of the current survfit routine does this, and I've had one user
> report a bug in that
> x <- runif(20)
> fit <- survfit(Surv(x) ~1)
> summary(fit, times=x)
> will produce lines with 0, 1 or 2 events when "they all should be 1".
> The same issue just came up in an rpart example, sent to me. For coxph
> models is may only be a matter of time.
> Suggestions and opinions are welcome.
> Terry Therneau
> R-devel at r-project.org mailing list
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-devel