[R] data input strategy - lots of csv files

Liaw, Andy andy_liaw at merck.com
Thu May 11 17:05:33 CEST 2006


If you can show me the equivalent Python code in as few lines that perform
much faster, I'd very much appreciate it.  I had been trying to find an
"excuse" to learn Python, but so far have found what I can do in R quite
adequate.  Also, it's much easier to keep track of work flow when everything
is done in one place (R in my case).

Andy

From: Steve Miller
> 
> Why torture yourself and probably get bad performance in the 
> process? You
> should handle the data consolidation in python or ruby, which 
> are much more
> suited to this type of task, piping the results to R.
> 
> Steve Miller
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Liaw, Andy
> Sent: Thursday, May 11, 2006 5:50 AM
> To: seanpor at acm.org; r-help
> Subject: Re: [R] data input strategy - lots of csv files
> 
> This is what I would try:
> 
> csvlist <- list.files(pattern="csv$")
> bigblob <- lapply(csvlist, read.csv, ...)
> ## Get all dates that appear in any one of them.
> all.dates <- unique(unlist(lapply(bigblob, "[[", 1)))
> bigdata <- matrix(NA, length(all.dates), length(bigblob))
> dimnames(bigdata) <- list(all.dates, whatevercolnamesyouwant)
> ## loop through bigblob and populate corresponding columns
> ## of bigmatrix with the matching dates.
> for (i in seq(along=bigblob)) {
>     bigmatrix[as.character(bigblob[[i]][, 1]), i] <- 
>         bigblob[[i]][, columnwithdata]
> }
> 
> This is obviously untested, so hope it's of some help.
> 
> Andy
> 
> From: Sean O'Riordain
> > 
> > Good morning,
> > I have currently 63 .csv files most of which have lines which 
> > look like
> >   01/06/05,23445
> > Though some files have two numbers beside each date.  There 
> > are missing values, and currently the longest file has 318 rows.
> > 
> > (merge() is losing the head and doing runaway memory 
> > allocation - but thats another question - I'm still trying to 
> > pin that issue down and make a small repeatable example)
> > 
> > Currently I'm reading in these files with lines like
> >   a1 <- read.csv("daft_file_name_1.csv",header=F)
> >   ...
> >   a63 <- read.csv("another_silly_filename_63.csv",header=F)
> > 
> > and then i'm naming the columns in these like...
> >   names(a1)[2] <- "silly column name"
> >   ...
> >   names(a63)[2] <- "daft column name"
> > 
> > then trying to merge()...
> >   atot <- merge(a1, a2, all=T)
> > and then using language manipulation to loop
> >   atot <- merge(atot, a3, all=T)
> >   ...
> >   atot <- merge(atot, a63, all=T)
> > etc...
> > 
> > followed by more language manipulation
> > for() {
> >   rm(a1)
> > } etc...
> > 
> > i.e.
> > for (i in 2:63) {
> >     atot <- merge(atot, eval(parse(text=paste("a", i, 
> > sep=""))), all=T)
> >     #     eval(parse(text=paste("a",i,"[1] <- NULL",sep="")))
> > 
> >     cat("i is ", i, gc(), "\n")
> > 
> >     # now delete these 63 temporary objects...
> >     # e.g. should look like rm(a33)
> >     eval(parse(text=paste("rm(a",i,")", sep=""))) }
> > 
> > eventually getting a dataframe with the first column being 
> > the date, and the subsequent 63 columns being the data... 
> > with missing values coded as NA...
> > 
> > so my question is... is there a better strategy for reading 
> > in lots of small files (only a few kbytes each) like that 
> > which are timeseries with missing data... which doesn't go 
> > through the above awkwardness (and language manipulation) but 
> > still ends up with a nice data.frame with NA values correctly 
> > coded etc.
> > 
> > Many thanks,
> > Sean O'Riordain
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> > 
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list