[R] What is the best way to efficiently construct a data frame from multiple source files?
Mark Connolly
mark_connolly at acm.org
Thu Oct 29 19:45:15 CET 2009
I have an arbitrary number of spreadsheets that I want to consolidate
into a single data frame. The spreadsheets all have the same structure:
location, depth1Reading, depth2reading, depth3reading, depth4reading,
depth5reading
The spreadsheets have their reading date in their file name. This gets
parsed out and added to the data frame as a factor. The file name gets
recorded in the data frame as a reference factor. The depth readings
are put into a normal form in the data frame.
The target data frame structure is
str(df)
'data.frame': 23100 obs. of 5 variables:
$ measurement.date: Factor w/ 77 levels "2005/01/07","2003/01/08",..: 1
1 1 1 1 1 1 1 1 1 ...
$ source : Factor w/ 77 levels "TDRREADINGS010705.xls",..: 1 1
1 1 1 1 1 1 1 1 ...
$ location : int 1 1 1 1 1 2 2 2 2 2 ...
$ position : num 1 2 3 4 5 1 2 3 4 5 ...
$ theta.percent : num 24.8 23.5 30.7 26.6 NA 20.7 28.2 24.3 20.6 10 ...
I am successfully using the following (nested) looping-and-rbinding
method, but it is very slow. I tried allocating the whole data frame
and replacing rows, but ran into issues with new factors. I would like
to know if there is a general R approach for efficiently doing this sort
of data frame construction (assuming R is generally considered
appropriate for data cleanup and restructuring).
code (works but slow):
require("gdata") # for reading spreadsheet
tdrs <- readLines(pipe("ls TDR*.xls"))
na.strings=c("n", " n", "n ", " ", "jn", "N", "bent", "bent pin",
"skip")) # ugly but not important
# allocate empty data frame
df <- data.frame(measurement.date=character(0),
source=character(0),
location=numeric(0),
position=numeric(0),
theta.percent=numeric(0))
# iterate over spreadsheets
for (i in 1:length(tdrs)) {
source <- tdrs[i] # slightly optimistic
tdr <- read.xls(source, na.strings=na.strings)
# standardize column names
names(tdr)<-c("probe", "X1", "X2", "X3", "X4", "X5")
# create a date that is nicely sortable
measurement.date <- paste(paste(20,substr(source,16,17),sep=""),
# year
substr(source,12,13), # month
substr(source,14,15), sep="/") # day
for (j in 1:nrow(tdr)) { # iterate over each spreadsheet row
tdrrow <- tdr[j,]
location <- tdrrow$probe
for (pos in 2:6) { # normalize the readings
if (is.na(tdrrow[,pos])) {
theta.percent<-NA }
else theta.percent<-as.numeric(tdrrow[,pos])
position <- pos - 1
df <- rbind(df,
data.frame(measurement.date=measurement.date,
source=source,
location=location,
position=position,
theta.percent=theta.percent))
}
}
}
More information about the R-help
mailing list