[R] What is the best way to efficiently construct a data frame from multiple source files?

Mark Connolly mark_connolly at acm.org
Thu Oct 29 19:45:15 CET 2009


I have an arbitrary number of spreadsheets that I want to consolidate 
into a single data frame.  The spreadsheets all have the same structure:

location, depth1Reading, depth2reading, depth3reading, depth4reading, 
depth5reading

The spreadsheets have their reading date in their file name.  This gets 
parsed out and added to the data frame as a factor.  The file name gets 
recorded in the data frame as a reference factor.  The depth readings 
are put into a normal form in the data frame.

The target data frame structure is
str(df)
'data.frame':   23100 obs. of  5 variables:
 $ measurement.date: Factor w/ 77 levels "2005/01/07","2003/01/08",..: 1 
1 1 1 1 1 1 1 1 1 ...
 $ source          : Factor w/ 77 levels "TDRREADINGS010705.xls",..: 1 1 
1 1 1 1 1 1 1 1 ...
 $ location        : int  1 1 1 1 1 2 2 2 2 2 ...
 $ position        : num  1 2 3 4 5 1 2 3 4 5 ...
 $ theta.percent   : num  24.8 23.5 30.7 26.6 NA 20.7 28.2 24.3 20.6 10 ...


I am successfully using the following (nested) looping-and-rbinding 
method, but it is very slow.  I tried allocating the whole data frame 
and replacing rows, but ran into issues with new factors.  I would like 
to know if there is a general R approach for efficiently doing this sort 
of data frame construction (assuming R is generally considered 
appropriate for data cleanup and restructuring).

code (works but slow):

    require("gdata") # for reading spreadsheet
    tdrs <- readLines(pipe("ls TDR*.xls"))
    na.strings=c("n", " n", "n ", " ", "jn", "N", "bent", "bent pin", 
"skip")) # ugly but not important

    # allocate empty data frame
    df <- data.frame(measurement.date=character(0),
                    source=character(0),             
                    location=numeric(0),    
                    position=numeric(0),   
                    theta.percent=numeric(0))

    # iterate over spreadsheets
    for (i in 1:length(tdrs)) {
        source <- tdrs[i]  # slightly optimistic
        tdr <- read.xls(source, na.strings=na.strings)

        # standardize column names
        names(tdr)<-c("probe", "X1", "X2", "X3", "X4", "X5")

        # create a date that is nicely sortable
        measurement.date <- paste(paste(20,substr(source,16,17),sep=""), 
# year
                                  substr(source,12,13), # month
                                  substr(source,14,15), sep="/") # day

        for (j in 1:nrow(tdr)) {    # iterate over each spreadsheet row
            tdrrow <- tdr[j,]
            location <- tdrrow$probe

            for (pos in 2:6) {    # normalize the readings
                if (is.na(tdrrow[,pos])) {
                        theta.percent<-NA }
                else theta.percent<-as.numeric(tdrrow[,pos])
                position <- pos - 1
                df <- rbind(df, 
data.frame(measurement.date=measurement.date,
                                           source=source,
                                           location=location,
                                           position=position,
                                           theta.percent=theta.percent))
            }
        }
    }




More information about the R-help mailing list