[R] merge gives me too many rows

Mon Sep 18 13:02:53 CEST 2006

Hi Phil and Don,
Le 06-09-18 à 05:02, Phil Spector a écrit :

> Denis -
>    As long as there is one data frame that has exactly 0 or 1  
> observations
> for each level of the by variables, merge in R will operate  
> identically
> to SAS without the need for any special options.
>    The problem arises when there are multiple observations for some of
> the levels of the by variables in both of the data frames.  In
> relational databases and R, new observations are created to  
> represent every combination of observations in the original data  
> set that share
> the same level of the by variables.  In sas, the last observation  
> in the data set with fewer observations having a particular level  
> of the by variable is repeated enough times to provide a match for  
> each
> observation in the other data set.
>     You can see this with a tiny example -- look at the observations
> in the output data set corresponding to id==3:
>
>> a = data.frame(id=c(1,2,3,3,4),x=rnorm(5))
>> b = data.frame(id=c(1,2,3,3,3,4),y=rnorm(6))
>> c = merge(a,b)
>> dim(a)
> [1] 5 2
>> dim(b)
> [1] 6 2
>> dim(c)
> [1] 9 3
>> c
>   id           x            y
> 1  1 -2.15987259 -1.555878070
> 2  2  0.96856842 -0.666941824
> 3  3  1.68080409 -0.009239307
> 4  3  0.06044914 -0.009239307
> 5  3  1.68080409 -0.242538473
> 6  3  0.06044914 -0.242538473
> 7  3  1.68080409 -1.298344557
> 8  3  0.06044914 -1.298344557
> 9  4  1.76424928 -0.420144744
>
> The important thing to remember is that there is no unambiguous way  
> to merge two data sets when there are differing numbers of  
> observations for a given level of
> the by variable in the two data sets -- sas and R simply choose  
> different strategies
> in this case.
>
>                                        - Phil Spector
> 					 Statistical Computing Facility
> 					 Department of Statistics
> 					 UC Berkeley
> 					 spector at stat.berkeley.edu
>
>

I knew this, but it is good to be remembered and be forced to verify  
one's assumptions.

Originally both informations about stomach contents and fish size  
came from the same, larger, database.
I made the 2 smaller databases to ease the memory footprint in R when  
not all variables were needed. The method I used, actually suggested  
to me on this very helpful list a long time back, insured that all  
rows were unique:

fish <- subset(esto, select=c(predateu, origin, navire, nbpc, no_rel,  
trait, tagno, longueur, masse))
fish <- unique(fish)       # NO NEED TO SORT FIRST

Or at least it would if there were NO MISTAKES in my database to  
start with. It is this false feeling that all rows had to be unique  
that made me pull my hair last night and assume the problem was with  
merge.

I just verified and found that some fish have 2 lengths or masses,  
either mistakes when entering them into the databases, or mistakes  
when numbering fishes (2 fish with the same number). I'll go back to  
the raw data to fix this! So thanks for making me verify this  
assumption.

Le 06-09-18 à 01:38, Don MacQueen a écrit :
> I think you may misunderstand the meaning of all.x = FALSE.
>
> Setting all.x to false ensures that only rows of x that have  
> matches in y will be included. Equivalently, if a row of x is not  
> matched in y, it will not be in the output. However, if a row in x  
> is matched by more than one row in y, then that row will be  
> repeated as many times as there are matching rows in y. That is,  
> you have a 1 to many match (1 in x to many in y). SAS behaves the  
> same way.
>
> Are you sure this is not what is happening?
>
> Also, all.x = FALSE is the default; it is not necessary to specify  
> it. In fact, the default is to output only rows that are found in  
> both x and y (matching on the specified variables, of course).
>
> -Don

Thank you Don,

Looking at my old programs using merge, I always used "all.x=T".  
Yesterday, because I was getting more rows than expected, I assumed  
that switch caused it and started setting it to F (i.e. early on when  
I got extra lines, I assumed that fish measured in the second  
dataframe but for which I did not have stomach contents were merged  
anyway). I never went back to all.X=T after verifying it was not the  
cause of the problem, instead some lines in the first database were  
duplicated (multiple matches, it turns out). But you are right. In  
fact I made some tests and the behavior I want in this and most cases  
is "all.x=T" but "all.y=F". That is if I ever found a case where I  
had a line in the first dataframe with no match in the second, I'd  
want to keep that line in the final dataframe.

Again, many thanks,

Denis

> At 9:11 PM -0400 9/17/06, Denis Chabot wrote:
>
>> Hi,
>>
>> I am using merge to add some variables to an existing dataframe. I  
>> use the option "all.x=F" so that my final dataframe will only have  
>> as many rows as the first file I name in the call to merge.
>>
>> With a large dataframe using a lot of "by" variables, the number  
>> of rows of the merged dataframe increases from 177325 to 179690:
>>
>>  >dim(test)
>> [1] 177325      9
>>  > test2 <- merge(test, fish, by=c("predateu", "origin", "navire",  
>> "nbpc", "no_rel", "trait", "tagno"), all.x=F)
>>  > dim(test2)
>> [1] 179690     11
>>
>> I tried to make a smaller dataset with R commands that I could  
>> post here so that other people could reproduce, but merge behaved  
>> as expected: final number of rows was the same as the number of  
>> rows in the first file named in the call to merge.
>>
>> I took a subset of my large dataframe and could mail this to  
>> anyone interested in verifying the problem.
>>
>>  > test3 <- test[100001:160000,]
>>  >
>>  > dim(test3)
>> [1] 60000     9
>>  > test4 <- merge(test3, fish, by=c("predateu", "origin",  
>> "navire", "nbpc", "no_rel", "trait", "tagno"), all.x=F)
>>  >
>>  > dim(test4)
>> [1] 60043    11
>>
>> I compared test3 and test4 line by line. The first 11419 lines  
>> were the same (except for added variables, obviously) in both  
>> dataframes, but then lines 11420 to 11423 were repeated in test4.  
>> Then no problem for a lot of rows, until rows 45756-45760 in  
>> test3. These are offset by 4 in test4 because of the first group  
>> of extraneous lines just reported, and are found on lines 45760 to  
>> 45765. But they are also repeated on lines 45765 to 45769. And so  
>> on a few more times.
>>
>> Thus merge added lines (repeated a small number of lines) to the  
>> final dataframe despite my use of all.x=F.
>>
>> Am I doing something wrong? If not, is there a solution? Not being  
>> able to merge is a setback! I was attempting to move the last few  
>> things I was doing with SAS to R...
>>
>> Please let me know if you want the file test3 (2.3 MB as a csv  
>> file, but only 352 KB in R (.rda) format).
>>
>> Sincerely,
>>
>> Denis Chabot
>>
>>  > R.Version()
>> $platform
>> [1] "powerpc-apple-darwin8.6.0"
>>
>> $arch
>> [1] "powerpc"
>>
>> $os
>> [1] "darwin8.6.0"
>>
>> $system
>> [1] "powerpc, darwin8.6.0"
>>
>> $status
>> [1] ""
>>
>> $major
>> [1] "2"
>>
>> $minor
>> [1] "3.1"
>>
>> $year
>> [1] "2006"
>>
>> $month
>> [1] "06"
>>
>> $day
>> [1] "01"
>>
>> $`svn rev`
>> [1] "38247"
>>
>> $language
>> [1] "R"
>>
>> $version.string
>> [1] "Version 2.3.1 (2006-06-01)"