[R] merge gives me too many rows
Don MacQueen
macq at llnl.gov
Mon Sep 18 07:38:43 CEST 2006
I think you may misunderstand the meaning of all.x = FALSE.
Setting all.x to false ensures that only rows of x that have matches
in y will be included. Equivalently, if a row of x is not matched in
y, it will not be in the output. However, if a row in x is matched by
more than one row in y, then that row will be repeated as many times
as there are matching rows in y. That is, you have a 1 to many match
(1 in x to many in y). SAS behaves the same way.
Are you sure this is not what is happening?
Also, all.x = FALSE is the default; it is not necessary to specify
it. In fact, the default is to output only rows that are found in
both x and y (matching on the specified variables, of course).
-Don
At 9:11 PM -0400 9/17/06, Denis Chabot wrote:
>Hi,
>
>I am using merge to add some variables to an existing dataframe. I
>use the option "all.x=F" so that my final dataframe will only have as
>many rows as the first file I name in the call to merge.
>
>With a large dataframe using a lot of "by" variables, the number of
>rows of the merged dataframe increases from 177325 to 179690:
>
> >dim(test)
>[1] 177325 9
> > test2 <- merge(test, fish, by=c("predateu", "origin", "navire",
>"nbpc", "no_rel", "trait", "tagno"), all.x=F)
> > dim(test2)
>[1] 179690 11
>
>I tried to make a smaller dataset with R commands that I could post
>here so that other people could reproduce, but merge behaved as
>expected: final number of rows was the same as the number of rows in
>the first file named in the call to merge.
>
>I took a subset of my large dataframe and could mail this to anyone
>interested in verifying the problem.
>
> > test3 <- test[100001:160000,]
> >
> > dim(test3)
>[1] 60000 9
> > test4 <- merge(test3, fish, by=c("predateu", "origin", "navire",
>"nbpc", "no_rel", "trait", "tagno"), all.x=F)
> >
> > dim(test4)
>[1] 60043 11
>
>I compared test3 and test4 line by line. The first 11419 lines were
>the same (except for added variables, obviously) in both dataframes,
>but then lines 11420 to 11423 were repeated in test4. Then no problem
>for a lot of rows, until rows 45756-45760 in test3. These are offset
>by 4 in test4 because of the first group of extraneous lines just
>reported, and are found on lines 45760 to 45765. But they are also
>repeated on lines 45765 to 45769. And so on a few more times.
>
>Thus merge added lines (repeated a small number of lines) to the
>final dataframe despite my use of all.x=F.
>
>Am I doing something wrong? If not, is there a solution? Not being
>able to merge is a setback! I was attempting to move the last few
>things I was doing with SAS to R...
>
>Please let me know if you want the file test3 (2.3 MB as a csv file,
>but only 352 KB in R (.rda) format).
>
>Sincerely,
>
>Denis Chabot
>
> > R.Version()
>$platform
>[1] "powerpc-apple-darwin8.6.0"
>
>$arch
>[1] "powerpc"
>
>$os
>[1] "darwin8.6.0"
>
>$system
>[1] "powerpc, darwin8.6.0"
>
>$status
>[1] ""
>
>$major
>[1] "2"
>
>$minor
>[1] "3.1"
>
>$year
>[1] "2006"
>
>$month
>[1] "06"
>
>$day
>[1] "01"
>
>$`svn rev`
>[1] "38247"
>
>$language
>[1] "R"
>
>$version.string
>[1] "Version 2.3.1 (2006-06-01)"
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
--
---------------------------------
Don MacQueen
Lawrence Livermore National Laboratory
Livermore, CA, USA
More information about the R-help
mailing list