[R] a merge() problem
Sam Steingold
sds at gnu.org
Wed Oct 10 18:32:40 CEST 2012
> * Prof Brian Ripley <evcyrl at fgngf.bk.np.hx> [2012-10-08 06:37:07 +0100]:
>
> On 08/10/2012 02:57, Peter Ehlers wrote:
>> On 2012-10-07 14:44, Sam Steingold wrote:
>>>> * Peter Ehlers <ruyref at hpnytnel.pn> [2012-10-07 10:03:42 -0700]:
>>>>
>>>> On 2012-10-07 08:34, Sam Steingold wrote:
>>>>> I know it does not look very good - using the same column names to mean
>>>>> different things in different data frames, but here you go:
>>>>> --8<---------------cut here---------------start------------->8---
>>>>>> x <- data.frame(a=c(1,2,3),b=c(4,5,6))
>>>>>> y <- data.frame(b=c(1,2),a=c("a","b"))
>>>>>> merge(x,y,by.x="a",by.y="b",all.x=TRUE,suffixes=c("","y"))
>>>>> a b a
>>>>> 1 1 4 a
>>>>> 2 2 5 b
>>>>> 3 3 6 <NA>
>>>>> Warning message:
>>>>> In merge.data.frame(x, y, by.x = "a", by.y = "b", all.x = TRUE) :
>>>>> column name 'a' is duplicated in the result
>>>>> --8<---------------cut here---------------end--------------->8---
>>>>> why is the suffixes argument ignored?
>>>>> I mean, I expected that the second "a" to be "a.y".
>>>>
>>>> The 'suffixes' argument refers to _non-by_ names only (as per ?merge).
>>>
>>> yes, but "a" in "y" is _not_ a by-name.
>>
>> Yes, it is.
>> The set of by-names is the union of names specified by by.x and by.y,
>> in your case: c("a", "b").
>> I suppose that a case could be made that ?merge does not spell that
>> out sufficiently explicitly.
>
> It does in 'Details' (and where else would there be such a detail?)
> E.g. in R 2.15.1:
>
> If the remaining columns in the data frames have any common names,
> these have ‘suffixes’ (‘".x"’ and ‘".y"’ by default) appended to
> try to make the names of the result unique. If this is not
> possible, an error is thrown.
>
> Note *remaining*, and read what comes before that.
I read the docs and re-read them after seeing your message and, with all
due respect, I fail to interpret them the way you do:
The doc speaks about "columns to merge on", not "column names".
I specify both by.x and by.y, thus I do not specify the column y$b.
Note, however, that I do not want the doc fixed, I want the behavior modified.
I see no advantage in the current behavior (a warning + duplicate column
names) as opposed to the behavior I expected (renaming the column in the
result to "b.y").
Thanks a lot for your kind replies and insight!
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://americancensorship.org http://iris.org.il
http://jihadwatch.org http://ffii.org http://truepeace.org
Never argue with the person who is preparing your parachute.
More information about the R-help
mailing list