[R] mergeing a large number of large .csvs

Thu Nov 8 22:31:13 CET 2012

On 06.11.2012 05:24, Benjamin Caldwell wrote:
> Thanks to you all. Modifying the code to use rbind instead of merge worked
> like a charm - I can only speculate why.

Because merge is not intended for appending. See its help page.

Uwe Ligges

> Best
>
> *Ben Caldwell*
>
> PhD Candidate
> University of California, Berkeley
> 130 Mulford Hall #3114
> Berkeley, CA 94720
> Office 223 Mulford Hall
> (510)859-3358
>
>
>
> On Sat, Nov 3, 2012 at 2:19 PM, jim holtman <jholtman at gmail.com> wrote:
>
>> It easier than that.  I forgot I can do it entirely within R:
>>
>> setwd("/temp/csv")
>> files <- Sys.glob("daily*csv")
>> output <- file('Rcombined.csv', 'w')
>> for (i in files){
>>      cat(i, '\n')  # write out file processing
>>      input <- readLines(i)
>>      input <- input[-1L]  # delete header
>>      writeLines(input, output)
>> }
>> close(output)
>>
>>
>>
>> On Sat, Nov 3, 2012 at 4:56 PM, jim holtman <jholtman at gmail.com> wrote:
>>> These are not commands, but programs you can use.  Here is a file copy
>>> program in "perl" (I spelt it wrong in the email);  This will copy all
>>> the files that have "daily" in their names.  It also skips the first
>>> line of each file assuming that it is the header.
>>>
>>> perl  can be found on most systems.  www.activestate.com  has a
>>> version that runs under Windows and that is what I am using.
>>>
>>>
>>> chdir "/temp/csv";  # my directory with files
>>> @files = glob "daily*csv";  # get files to copy (daily......csv)
>>> open OUTPUT, ">combined.csv"; # output file
>>> # loop for each file
>>> foreach $file (@files) {
>>>      print $file, "\n";  # print file being processed
>>>      open INPUT, "<" . $file;
>>>      # assume that the first line is a header, so skip it
>>>      $header = <INPUT>;
>>>      @all = <INPUT>;  # read rest of the file
>>>      close INPUT;
>>>      print OUTPUT @all;  # append to the output
>>> }
>>> close OUTPUT;
>>>
>>> Here is what was printed on the console:
>>>
>>>
>>> C:\Users\Owner>perl copyFiles.pl
>>> daily.BO.csv
>>> daily.C.csv
>>> daily.CL.csv
>>> daily.CT.csv
>>> daily.GC.csv
>>> daily.HO.csv
>>> daily.KC.csv
>>> daily.LA.csv
>>> daily.LN.csv
>>> daily.LP.csv
>>> daily.LX.csv
>>> daily.NG.csv
>>> daily.S.csv
>>> daily.SB.csv
>>> daily.SI.csv
>>> daily.SM.csv
>>>
>>> Which was a list of all the files copied.
>>>
>>> On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell
>>> <btcaldwell at berkeley.edu> wrote:
>>>> Jim,
>>>>
>>>> Where can I find documentation of the commands you mention?
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Nov 3, 2012 at 12:15 PM, jim holtman <jholtman at gmail.com>
>> wrote:
>>>>>
>>>>> A faster way would be to use something like 'per', 'awk' or 'sed'.
>>>>> You can strip off the header line of each CSV (if it has one) and then
>>>>> concatenate the files together.  This is very efficient use of memory
>>>>> since you are just reading one file at a time and then writing it out.
>>>>>   Will probably be a lot faster since no conversions have to be done.
>>>>> Once you have the one large file, then you can play with it (load it
>>>>> if you have enough memory, or load it into a database).
>>>>>
>>>>> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller
>>>>> <jdnewmil at dcn.davis.ca.us> wrote:
>>>>>> On the absence of any data examples from you per the posting
>> guidelines,
>>>>>> I will refer you to the help files for the melt function in the
>> reshape2
>>>>>> package.  Note that there can be various mixtures of wide versus
>> long...
>>>>>> such as a wide file with one date column and columns representing
>> all stock
>>>>>> prices and all trade volumes. The longest format would be what melt
>> gives
>>>>>> (date, column name, and value) but an in-between format would have
>> one
>>>>>> distinct column each for dollar values and volume values with a
>> column
>>>>>> indicating ticker label and of course another for date.
>>>>>>
>>>>>> If your csv files can be grouped according to those with similar
>> column
>>>>>> "types", then as you read them in you can use cbind(
>> csvlabel="somelabel",
>>>>>> csvdf) to distinguish it and then rbind those data frames together
>> to create
>>>>>> an intermediate-width data frame. When dealing with large amounts of
>> data
>>>>>> you will want to minimize the amount of reshaping you do, but it
>> would
>>>>>> require knowledge of your data and algorithms to say any more.
>>>>>>
>>>>>>
>> ---------------------------------------------------------------------------
>>>>>> Jeff Newmiller                        The     .....       .....  Go
>>>>>> Live...
>>>>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>>   Live
>>>>>> Go...
>>>>>>                                        Live:   OO#.. Dead: OO#..
>>   Playing
>>>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>>>> rocks...1k
>>>>>>
>>>>>>
>> ---------------------------------------------------------------------------
>>>>>> Sent from my phone. Please excuse my brevity.
>>>>>>
>>>>>> Benjamin Caldwell <btcaldwell at berkeley.edu> wrote:
>>>>>>
>>>>>>> Jeff,
>>>>>>> If you're willing to educate, I'd be happy to learn what wide vs long
>>>>>>> format means. I'll give rbind a shot in the meantime.
>>>>>>> Ben
>>>>>>> On Nov 2, 2012 4:31 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would first confirm that you need the data in wide format... many
>>>>>>>> algorithms are more efficient in long format anyway, and rbind is
>> way
>>>>>>> more
>>>>>>>> efficient than merge.
>>>>>>>>
>>>>>>>> If you feel this is not negotiable, you may want to consider sqldf.
>>>>>>> Yes,
>>>>>>>> you need to learn a bit of SQL, but it is very well integrated into
>>>>>>> R.
>>>>>>>>
>>>>>
>>>>>>>
>>>> ---------------------------------------------------------------------------
>>>>>>>> Jeff Newmiller                        The     .....       .....  Go
>>>>>>> Live...
>>>>>>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.
>>   Live
>>>>>>>> Go...
>>>>>>>>                                        Live:   OO#.. Dead: OO#..
>>>>>>> Playing
>>>>>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
>>   with
>>>>>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>>>>> rocks...1k
>>>>>>>>
>>>>>
>>>>>>>
>>>> ---------------------------------------------------------------------------
>>>>>>>> Sent from my phone. Please excuse my brevity.
>>>>>>>>
>>>>>>>> Benjamin Caldwell <btcaldwell at berkeley.edu> wrote:
>>>>>>>>
>>>>>>>>> Dear R help;
>>>>>>>>> I'm currently trying to combine a large number (about 30 x 30) of
>>>>>>> large
>>>>>>>>> .csvs together (each at least 10000 records). They are organized
>> by
>>>>>>>>> plots,
>>>>>>>>> hence 30 X 30, with each group of csvs in a folder which
>> corresponds
>>>>>>> to
>>>>>>>>> the
>>>>>>>>> plot. The unmerged csvs all have the same number of columns (5).
>> The
>>>>>>>>> fifth
>>>>>>>>> column has a different name for each csv. The number of rows is
>>>>>>>>> different.
>>>>>>>>>
>>>>>>>>> The combined csvs are of course quite large, and the code I'm
>>>>>>> running
>>>>>>>>> is
>>>>>>>>> quite slow - I'm currently running it on a computer with 10 GB
>> ram,
>>>>>>>>> ssd,
>>>>>>>>> and quad core 2.3 ghz processor; it's taken 8 hours and it's only
>>>>>>> 75%
>>>>>>>>> of
>>>>>>>>> the way through (it's hung up on one of the largest data groupings
>>>>>>> now
>>>>>>>>> for
>>>>>>>>> an hour, and using 3.5 gigs of RAM.
>>>>>>>>>
>>>>>>>>> I know that R isn't the most efficient way of doing this, but I'm
>>>>>>> not
>>>>>>>>> familiar with sql or C. I wonder if anyone has suggestions for a
>>>>>>>>> different
>>>>>>>>> way to do this in the R environment. For instance, the key
>> function
>>>>>>> now
>>>>>>>>> is
>>>>>>>>> merge, but I haven't tried join from the plyr package or rbind
>> from
>>>>>>>>> base.
>>>>>>>>> I'm willing to provide a dropbox link to a couple of these files
>> if
>>>>>>>>> you'd
>>>>>>>>> like to see the data. My code is as follows:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> #multmerge is based on code by Tony cookson,
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>>>>>
>> http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/
>>>>>>>> ;
>>>>>>>>> The function takes a path. This path should be the name of a
>> folder
>>>>>>>>> that
>>>>>>>>> contains all of the files you would like to read and merge
>> together
>>>>>>> and
>>>>>>>>> only those files you would like to merge.
>>>>>>>>>
>>>>>>>>> multmerge = function(mypath){
>>>>>>>>> filenames=list.files(path=mypath, full.names=TRUE)
>>>>>>>>> datalist = try(lapply(filenames,
>>>>>>>>> function(x){read.csv(file=x,header=T)}))
>>>>>>>>> try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist))
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> #this function renames files using a fixed list and outputs a .csv
>>>>>>>>>
>>>>>>>>> merepk <- function (path, nf.name) {
>>>>>>>>>
>>>>>>>>> output<-multmerge(mypath=path)
>>>>>>>>> name <- list("x", "y", "z", "depth", "amplitude")
>>>>>>>>> try(names(output) <- name)
>>>>>>>>>
>>>>>>>>> write.csv(output, nf.name)
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> #assumes all folders are in the same directory, with nothing else
>>>>>>> there
>>>>>>>>>
>>>>>>>>> merge.by.folder <- function (folderpath){
>>>>>>>>>
>>>>>>>>> foldernames<-list.files(path=folderpath)
>>>>>>>>> n<- length(foldernames)
>>>>>>>>> setwd(folderpath)
>>>>>>>>>
>>>>>>>>> for (i in 1:n){
>>>>>>>>> path<-paste(folderpath,foldernames[i], sep="\\")
>>>>>>>>> nf.name <- as.character(paste(foldernames[i],".csv", sep=""))
>>>>>>>>> merepk (path,nf.name)
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> folderpath <- "yourpath"
>>>>>>>>>
>>>>>>>>> merge.by.folder(folderpath)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for looking, and happy friday!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Ben Caldwell*
>>>>>>>>>
>>>>>>>>> PhD Candidate
>>>>>>>>> University of California, Berkeley
>>>>>>>>>
>>>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> ______________________________________________
>>>>>>>>> R-help at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>> Tell me what you want to do, not how you want to do it.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>