[R] mismatch between match and unique causing ecdf (well, approxfun) to fail

Mon Jun 8 12:51:26 CEST 2015

Aehm, adding on this: I incorrectly *assumed* without testing that rounding would help; it doesn't:

ecdf(round(test2,0)) 	# a rounding that is way too rough for my application...
#Error in xy.coords(x, y) : 'x' and 'y' lengths differ

Digging deeper: The initially mentioned call to unique() is not very helpful, as test2 is a data frame, so I get what I deserve, an unchanged data frame with 1 row. Still, the issue remains and can even be simplified further:

> ecdf(data.frame(a=3, b=4))
Empirical CDF 
Call: ecdf(data.frame(a = 3, b = 4))
 x[1:2] =      3,      4

works ok, but

> ecdf(data.frame(a=3, b=3))
Error in xy.coords(x, y) : 'x' and 'y' lengths differ

doesn't (same for a=b=1 or 2, so likely the same for any a=b). Instead, 

> ecdf(c(a=3, b=3))
Empirical CDF 
Call: ecdf(c(a = 3, b = 3))
 x[1:1] =      3

does the trick. From ?ecdf, I get that x should be a numeric vector - apparently, my misuse of the function by applying it to a row of a data frame (i.e. a data frame with one row). In all my other (dozens of) cases that worked ok, though but not for this particular one. A simple unlist() helps:

> ecdf(unlist(data.frame(a=3, b=3)))
Empirical CDF 
Call: ecdf(unlist(data.frame(a = 3, b = 3)))
 x[1:1] =      3

Yet, I'm even more confused than before: in my other data, there were also duplicated values in the vector (1-row-data frame), and it never caused any issue. For this particular example, it does. I must be missing something fundamental...

Michael

> -----Original Message-----
> From: Meyners, Michael
> Sent: Montag, 8. Juni 2015 12:02
> To: 'r-help at r-project.org'
> Subject: mismatch between match and unique causing ecdf (well,
> approxfun) to fail
> 
> All,
> 
> I encountered the following issue with ecdf which was originally on a vector
> of length 10,000, but I have been able to reduce it to a minimal reproducible
> example (just to avoid questions why I'd want to do this for a vector of
> length 2...):
> 
> test2 = structure(list(X817 = 3.39824670255344, X4789 = 3.39824670255344),
> .Names = c("X817", "X4789"), row.names = 74L, class = "data.frame")
> ecdf(test2)
> 
> # Error in xy.coords(x, y) : 'x' and 'y' lengths differ
> 
> In an attempt to track this down, it occurs that
> 
> unique(test2)
> #       X817    X4789
> #74 3.398247 3.398247
> 
> while
> 
> match(test2, unique(test2))
> #[1] 1 1
> 
> matches both values to the first one. This causes a hiccup in the call to ecdf,
> as this uses (an equivalent to) a call to approxfun with x = test2 and y =
> cumsum(tabulate(match(test2, unique(test2)))), the latter now containing
> one entry less than the former, so xy.coords fails.
> 
> I understand that the issue should be somehow related  to FAQ 7.31, but I
> would have hoped that unique and match would be using the same precision
> and hence both or neither would consider the two values identical, but not
> one match while unique doesn't.
> 
> Last but not least, it doesn't really cause an issue on my end (other than
> breaking my code and hence out of a loop at first place...); rounding will help
> w/o noteworthy changes to the outcome, so no need to propose a
> workaround :-) I'd rather like to raise the issue and learn whether there is a
> purpose for this behavior, and/or whether there is a generic fix to this, or
> whether I am completely missing something.
> 
> Version info (under Windows 7):
> R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> 
> Cheers, Michael