[R] data input strategy - lots of csv files

Liaw, Andy andy_liaw at merck.com
Thu May 11 12:49:41 CEST 2006


This is what I would try:

csvlist <- list.files(pattern="csv$")
bigblob <- lapply(csvlist, read.csv, ...)
## Get all dates that appear in any one of them.
all.dates <- unique(unlist(lapply(bigblob, "[[", 1)))
bigdata <- matrix(NA, length(all.dates), length(bigblob))
dimnames(bigdata) <- list(all.dates, whatevercolnamesyouwant)
## loop through bigblob and populate corresponding columns
## of bigmatrix with the matching dates.
for (i in seq(along=bigblob)) {
    bigmatrix[as.character(bigblob[[i]][, 1]), i] <- 
        bigblob[[i]][, columnwithdata]
}

This is obviously untested, so hope it's of some help.

Andy

From: Sean O'Riordain
> 
> Good morning,
> I have currently 63 .csv files most of which have lines which 
> look like
>   01/06/05,23445
> Though some files have two numbers beside each date.  There 
> are missing values, and currently the longest file has 318 rows.
> 
> (merge() is losing the head and doing runaway memory 
> allocation - but thats another question - I'm still trying to 
> pin that issue down and make a small repeatable example)
> 
> Currently I'm reading in these files with lines like
>   a1 <- read.csv("daft_file_name_1.csv",header=F)
>   ...
>   a63 <- read.csv("another_silly_filename_63.csv",header=F)
> 
> and then i'm naming the columns in these like...
>   names(a1)[2] <- "silly column name"
>   ...
>   names(a63)[2] <- "daft column name"
> 
> then trying to merge()...
>   atot <- merge(a1, a2, all=T)
> and then using language manipulation to loop
>   atot <- merge(atot, a3, all=T)
>   ...
>   atot <- merge(atot, a63, all=T)
> etc...
> 
> followed by more language manipulation
> for() {
>   rm(a1)
> } etc...
> 
> i.e.
> for (i in 2:63) {
>     atot <- merge(atot, eval(parse(text=paste("a", i, 
> sep=""))), all=T)
>     #     eval(parse(text=paste("a",i,"[1] <- NULL",sep="")))
> 
>     cat("i is ", i, gc(), "\n")
> 
>     # now delete these 63 temporary objects...
>     # e.g. should look like rm(a33)
>     eval(parse(text=paste("rm(a",i,")", sep=""))) }
> 
> eventually getting a dataframe with the first column being 
> the date, and the subsequent 63 columns being the data... 
> with missing values coded as NA...
> 
> so my question is... is there a better strategy for reading 
> in lots of small files (only a few kbytes each) like that 
> which are timeseries with missing data... which doesn't go 
> through the above awkwardness (and language manipulation) but 
> still ends up with a nice data.frame with NA values correctly 
> coded etc.
> 
> Many thanks,
> Sean O'Riordain
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list