[R] "Best" way to merge 300+ .5MB dataframes?
David Winsemius
dwinsemius at comcast.net
Mon Aug 11 00:24:00 CEST 2014
On Aug 10, 2014, at 11:51 AM, Grant Rettke wrote:
> Good afternoon,
>
> Today I was working on a practice problem. It was simple, and perhaps
> even realistic. It looked like this:
>
> • Get a list of all the data files in a directory
> • Load each file into a dataframe
> • Merge them into a single data frame
Something along these lines:
all <- do.call( rbind,
lapply( list.files(path=getwd(), pattern=".csv"),
read.csv) )
Possibly:
all <- sapply( list.files(path=getwd(), pattern=".csv"),
read.csv)
Untested since no reproducible example was offered. This skips the task of individually assigning names to the input dataframes. There are quite a few variations on this in the Archives. You should learn to search them. Rseek.org or MarkMail are effective for me.
http://www.rseek.org/
http://markmail.org/search/?q=list%3Aorg.r-project.r-help
>
> Because all of the columns were the same, the simplest solution in my
> mind was to `Reduce' the vector of dataframes with a call to
> `merge'. That worked fine, I got what was expected. That is key
> actually. It is literally a one-liner, and there will never be index
> or scoping errors with it.
You might have forced `merge` to work with the correct choice of arguments but I would have silently eliminated duplicate rows. Seems unlikely to me that it would be efficient for the purpose of just stacking dataframe values.
>
> > merge( data.frame(a=1, b=2), data.frame(a=3, b=4) )
[1] a b
<0 rows> (or 0-length row.names)
> merge( data.frame(a=1, b=2), data.frame(a=3, b=4) , all=TRUE)
a b
1 1 2
2 3 4
> merge( data.frame(a=1, b=2), data.frame(a=1, b=2) )
a b
1 1 2
> rbind( data.frame(a=1, b=2), data.frame(a=1, b=2) )
a b
1 1 2
2 1 2
> Now with that in mind, what is the idiomatic way? Do people usually do
> something else because it is /faster/ (by some definition)?
>
> Kind regards,
>
--
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list