[Rd] `merge()` not consistent in how it treats list columns

Antoine Fabri @nto|ne@|@br| @end|ng |rom gm@||@com
Sun Jan 3 11:14:57 CET 2021


Hi Gabe,


>  [... ]
> Well, sure but that is because it happens to be a list with each element
> having length one. In which case, it really should not have been a list at
> all, and the fact that it was seems a deeper problem that should likely be
> resolved instead of treating the symptom, in my opinion.
>

I wouldn't mind it explicitly failing on the ground that you don't join a
list column on a character column, and I wouldn't mind it succeeding
either, because it's consistent with `c("a", "b") == list("a", "b")`  and
`c("a", "b") %in% list("a", "b")` returning `c(TRUE, TRUE)`. But I feel
strongly that it shouldn't behave differently depending on which data frame
is provided first to the function, and I do think that if we do make it an
error, it is worth making it understandable.


>
>  > df1 <- data.frame(a=1)
>
> > df2 <- data.frame(b=2)
>
> > df1$id <- "ID"
>
> > df2$id <- list(c("ID", "ID2"))
>
> > merge(df1, df2)
>
> [1] id a  b
>
> <0 rows> (or 0-length row.names)
>
>
> Thats probably not what you wanted it to do, right? Or maybe it is, it
> depends, right?. And therein lies the rub.
>
>
> I have to be honest, as a developer, I really wish this, even in your
> example case, threw an error. Anything else just looks to me like a
> debugging nightmare looming in the wings waiting to strike.
>
>
>
What I did wrong in my real case, to provide context, is compute `df2$id <-
lapply(x, fun)`, which was a mistake, but looked ok when printing, `vapply`
solved the issue, `sapply` would still have been problematic because
`df2$id` would be an emply list for a `x` of length 0.

After correcting my mistake I tried to isolate the error and had trouble
reproducing it with my simple case because I was inverting both data frames
argument. This is how the inconsistency +  cryptic message caused me more
trouble than I think it should have.

Imagine that I can have production code work for years with `merge(df1,
df2)`, maybe not written by me, I change it to `merge(df2, df1)` for some
reason and all breaks loose with `Error in sort.list(bx[m$xi]): 'x' must be
atomic for 'sort.list', method "shell" and "quick"`. If I'm not familiar
with list columns and that they can print just like character columns I
might have a rough day.

Here's another oddity that I think is worth fixing :

df1 <- data.frame(a=1, id = "ID")
df3 <- data.frame(c=character(), id = list())
merge(df3, df1)
#> [1] x[FALSE, ] a          id
#> <0 lignes> (ou 'row.names' de longueur nulle)
merge(df1, df3)
#> [1] a          id         y[FALSE, ]
#> <0 lignes> (ou 'row.names' de longueur nulle)

 [...]
>


> There's no reason (in principle) you wouldn't be able to join by a list
> column, they should just both have to be list columns, in my ideal (but
> admittedly unlikely) world.  Id rather the atomic-vector/list mismatch case
> throw an error, myself.
>

The doc does say that "This is intended to work with data frames with
vector-like columns" in a note at the bottom, so anything we do is
consistent with the doc, and fine by me if it fails (that's how {dplyr}
joins work), but let the order of the data frames not matter. A warning is
another option.


> Now I kind of doubt we can change the behavior that works now, but as Avi
> points out, I think this is something that is complicated and case specific
> enough that it really ought to be your job as the coder to take care of
> what should happen when you try to merge on columns that are fundamentally
> different types.
>

Well yes, one can always say it's the developer's fault, but we all
appreciate a software that guides us toward the light. List columns are not
a rare thing at all anymore and using an `lapply` call instad of `sapply`
or `vapply` is probably not a rare mistake. And again, the inconsistency is
wrong in any case.

I'll read other answers when I get the digest.

Thanks,

Antoine

	[[alternative HTML version deleted]]



More information about the R-devel mailing list