[R] alternative to rbind within a loop

Denis Chabot chabot.denis at gmail.com
Thu Jul 23 21:53:32 CEST 2009


Hi,

I often have to do this:

select a folder (directory) containing a few hundred data files in csv  
format (up to 1000 files, in fact)

open each file, transform some character variables in date-tiime format

make into a dataframe (involves getting rid of a few variables I don't  
need

concatenate to the master dataframe that will eventually contain the  
data from all the files in the folder.

I use a loop going from 1 to the number of files. I have added a  
command to print an incrementing number to the R console each time the  
loop completes one iteration, to judge the speed of the process.

At the beginning, 3-4 files are processed each second. After a few  
hundred iterations it slows down to about 1 file per second. Before I  
reach the last file (898 in the case at hand), it has become much  
slower, about 1 file every 2-3 seconds.

This progressive slowing down suggests the problem is linked to the  
size of the growing "master" dataframe that rbind combines with each  
new file.

In fact, the small script below confirms this as nothing at all  
happens within the loop but rbind. You can cut the size of this  
example not to waste to much of your time:


# create a dummy data.frame and copy it in a large number of csv files

test  <- file.path("test")

a <- 1:350
b <- rnorm(350,100,10)
c <- runif(350, 0, 100)
d <- month.name[runif(350,1,12)]

the.data <- data.frame(a,b,c,d)

for(i in 1:850){
	write.csv(the.data, file=paste(test, "/file_", i, ".csv", sep=""))
}

# now lets make a single dataframe from all these csv files

all.files <- list.files(path=test,full.names=T,pattern=".csv")

new.data <- NULL

system.time({
	for(i in all.files){
	in.data <- read.csv(i)
	if (is.null(new.data)) {new.data = in.data} else {new.data =  
rbind(new.data, in.data)}
	cat(paste(i, ", ", sep=""))
} # end for
}) # end system.time

utilisateur     système      écoulé
     156.206      44.859     202.150
This is with

sessionInfo()
R version 2.9.1 Patched (2009-07-16 r48939)
x86_64-apple-darwin9.7.0

locale:
fr_CA.UTF-8/fr_CA.UTF-8/C/C/fr_CA.UTF-8/fr_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] doBy_3.7        chron_2.3-30    timeDate_290.84

loaded via a namespace (and not attached):
[1] cluster_1.12.0  grid_2.9.1      Hmisc_3.5-2     lattice_0.17-25  
tools_2.9.1


Would it be better to somehow save all 850 files in one dataframe  
each, and then rbind them all in a single operation?

Can I combine all my files without using a loop? I've never quite  
mastered the "apply" family of functions but have not seen examples to  
read files.

Thanks in advance,

Denis Chabot




More information about the R-help mailing list