[R] merge gives me too many rows
Denis Chabot
chabotd at globetrotter.net
Mon Sep 18 03:11:22 CEST 2006
Hi,
I am using merge to add some variables to an existing dataframe. I
use the option "all.x=F" so that my final dataframe will only have as
many rows as the first file I name in the call to merge.
With a large dataframe using a lot of "by" variables, the number of
rows of the merged dataframe increases from 177325 to 179690:
>dim(test)
[1] 177325 9
> test2 <- merge(test, fish, by=c("predateu", "origin", "navire",
"nbpc", "no_rel", "trait", "tagno"), all.x=F)
> dim(test2)
[1] 179690 11
I tried to make a smaller dataset with R commands that I could post
here so that other people could reproduce, but merge behaved as
expected: final number of rows was the same as the number of rows in
the first file named in the call to merge.
I took a subset of my large dataframe and could mail this to anyone
interested in verifying the problem.
> test3 <- test[100001:160000,]
>
> dim(test3)
[1] 60000 9
> test4 <- merge(test3, fish, by=c("predateu", "origin", "navire",
"nbpc", "no_rel", "trait", "tagno"), all.x=F)
>
> dim(test4)
[1] 60043 11
I compared test3 and test4 line by line. The first 11419 lines were
the same (except for added variables, obviously) in both dataframes,
but then lines 11420 to 11423 were repeated in test4. Then no problem
for a lot of rows, until rows 45756-45760 in test3. These are offset
by 4 in test4 because of the first group of extraneous lines just
reported, and are found on lines 45760 to 45765. But they are also
repeated on lines 45765 to 45769. And so on a few more times.
Thus merge added lines (repeated a small number of lines) to the
final dataframe despite my use of all.x=F.
Am I doing something wrong? If not, is there a solution? Not being
able to merge is a setback! I was attempting to move the last few
things I was doing with SAS to R...
Please let me know if you want the file test3 (2.3 MB as a csv file,
but only 352 KB in R (.rda) format).
Sincerely,
Denis Chabot
> R.Version()
$platform
[1] "powerpc-apple-darwin8.6.0"
$arch
[1] "powerpc"
$os
[1] "darwin8.6.0"
$system
[1] "powerpc, darwin8.6.0"
$status
[1] ""
$major
[1] "2"
$minor
[1] "3.1"
$year
[1] "2006"
$month
[1] "06"
$day
[1] "01"
$`svn rev`
[1] "38247"
$language
[1] "R"
$version.string
[1] "Version 2.3.1 (2006-06-01)"
More information about the R-help
mailing list