[R] mergeing a large number of large .csvs

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Sat Nov 3 00:31:57 CET 2012


I would first confirm that you need the data in wide format... many algorithms are more efficient in long format anyway, and rbind is way more efficient than merge.

If you feel this is not negotiable, you may want to consider sqldf. Yes, you need to learn a bit of SQL, but it is very well integrated into R.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Benjamin Caldwell <btcaldwell at berkeley.edu> wrote:

>Dear R help;
>I'm currently trying to combine a large number (about 30 x 30) of large
>.csvs together (each at least 10000 records). They are organized by
>plots,
>hence 30 X 30, with each group of csvs in a folder which corresponds to
>the
>plot. The unmerged csvs all have the same number of columns (5). The
>fifth
>column has a different name for each csv. The number of rows is
>different.
>
>The combined csvs are of course quite large, and the code I'm running
>is
>quite slow - I'm currently running it on a computer with 10 GB ram,
>ssd,
>and quad core 2.3 ghz processor; it's taken 8 hours and it's only  75%
>of
>the way through (it's hung up on one of the largest data groupings now
>for
>an hour, and using 3.5 gigs of RAM.
>
>I know that R isn't the most efficient way of doing this, but I'm not
>familiar with sql or C. I wonder if anyone has suggestions for a
>different
>way to do this in the R environment. For instance, the key function now
>is
>merge, but I haven't tried join from the plyr package or rbind from
>base.
>I'm willing to provide a dropbox link to a couple of these files if
>you'd
>like to see the data. My code is as follows:
>
>
>#multmerge is based on code by Tony cookson,
>http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/;
>The function takes a path. This path should be the name of a folder
>that
>contains all of the files you would like to read and merge together and
>only those files you would like to merge.
>
>multmerge = function(mypath){
>filenames=list.files(path=mypath, full.names=TRUE)
>datalist = try(lapply(filenames,
>function(x){read.csv(file=x,header=T)}))
>try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist))
>}
>
>#this function renames files using a fixed list and outputs a .csv
>
>merepk <- function (path, nf.name) {
>
>output<-multmerge(mypath=path)
>name <- list("x", "y", "z", "depth", "amplitude")
>try(names(output) <- name)
>
>write.csv(output, nf.name)
>}
>
>#assumes all folders are in the same directory, with nothing else there
>
>merge.by.folder <- function (folderpath){
>
>foldernames<-list.files(path=folderpath)
>n<- length(foldernames)
>setwd(folderpath)
>
>for (i in 1:n){
>path<-paste(folderpath,foldernames[i], sep="\\")
> nf.name <- as.character(paste(foldernames[i],".csv", sep=""))
>merepk (path,nf.name)
> }
>}
>
>folderpath <- "yourpath"
>
>merge.by.folder(folderpath)
>
>
>Thanks for looking, and happy friday!
>
>
>
>*Ben Caldwell*
>
>PhD Candidate
>University of California, Berkeley
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list