[R] Reading chunks of data from a file more efficiently

Waichler, Scott R Scott.Waichler at pnnl.gov
Sun Aug 10 00:31:04 CEST 2014


Hi,

I have some very large (~1.1 GB) output files from a groundwater model called STOMP that I want to read as efficiently as possible.  For each variable there are over 1 million values to read.  Variables are not organized in columns; instead they are written out in sections in the file, like this:

X-Direction Node Positions, m
 5.931450000E+05  5.931550000E+05  5.931650000E+05  5.931750000E+05
 5.932450000E+05  5.932550000E+05  5.932650000E+05  5.932750000E+05
. . . 
 5.946950000E+05  5.947050000E+05  5.947150000E+05  5.947250000E+05
 5.947950000E+05  5.948050000E+05  5.948150000E+05  5.948250000E+05

Y-Direction Node Positions, m
 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
 1.148050000E+05  1.148050000E+05  1.148050000E+05  1.148050000E+05
. . . 
 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05
 1.171950000E+05  1.171950000E+05  1.171950000E+05  1.171950000E+05

Z-Direction Node Positions, m
 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
 9.550000000E+01  9.550000000E+01  9.550000000E+01  9.550000000E+01
. . .

I want to read and use only a subset of the variables.  I wrote the function below to find the line where each target variable begins and then scan the values, but it still seems rather slow, perhaps because I am opening and closing the file for each variable.  Can anyone suggest a faster way?

# Reads original STOMP plot file (plot.*) directly.  Should be useful when the plot files are
# very large with lots of variables, and you just want to retrieve a few of them.  
# Arguments:  1) plot filename, 2) number of nodes, 
# 3) character vector of names of target variables you want to return.
# Returns a list with the selected plot output.
READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) {
  lines <- readLines(plt.file)
  num.vars <- length(var.names)
  tmp <- list()
  for(i in 1:num.vars) {
    ind <- grep(var.names[i], lines, fixed=T, useBytes=T)
    if(length(ind) != 1) stop("Not one line in the plot file with matching variable name.\n")
    tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T)
  }
  return(tmp)
}  # end READ.PLOT.OUTPUT6()

Regards,
Scott Waichler
Pacific Northwest National Laboratory
Richland, WA, USA
scott.waichler at pnnl.gov



More information about the R-help mailing list