[R] Reading chunks of data from a file more efficiently
Jeff Newmiller
jdnewmil at dcn.davis.ca.us
Sun Aug 10 03:14:05 CEST 2014
Informally abbreviating data is not recommended... I faked some, but would
appreciate if you would make your example reproducible next time.
All I really did for performance was use the data you read in rather than
re-scanning the file.
# generated by using dput()
lines <- c("X-Direction Node Positions, m",
" 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05",
" 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05",
" 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05",
" 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05",
"",
"Y-Direction Node Positions, m",
" 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05",
" 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05",
" 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05",
" 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05",
"",
"Z-Direction Node Positions, m",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
"",
"X-Direction Node Positions, n",
" 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05",
" 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05",
" 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05",
" 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05",
"",
"Y-Direction Node Positions, n",
" 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05",
" 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05",
" 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05",
" 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05",
"",
"Z-Direction Node Positions, n",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
" 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01",
"", "")
getDimVar <- function( lines, Dim, specifiedvar, starts ) {
vstart <- grep( paste0( "^", Dim, "-Direction Node Positions, "
, specifiedvar, "$" ), lines )
startv <- match( vstart, starts )
if ( 0 == length( startv ) ) {
stop( "Variable ", specifiedvar, " not found" )
}
if ( length( starts ) == startv ) {
vend <- length( lines )
} else {
vend <- starts[ startv + 1 ] - 1
}
tcon <- textConnection( lines[ seq( vstart + 1, vend ) ] )
result <- scan( tcon )
close( tcon )
result
}
starts <- grep( "^[XYZ]-Direction Node Positions, ", lines )
specifiedvar <- "n"
n <- data.frame( X=getDimVar( lines, "X", specifiedvar, starts )
, Y=getDimVar( lines, "Y", specifiedvar, starts )
, Z=getDimVar( lines, "Z", specifiedvar, starts ) )
# test a variable that doesn't exist
specifiedvar <- "o"
o <- data.frame( X=getDimVar( lines, "X", specifiedvar, starts )
, Y=getDimVar( lines, "Y", specifiedvar, starts )
, Z=getDimVar( lines, "Z", specifiedvar, starts ) )
On Sat, 9 Aug 2014, Waichler, Scott R wrote:
> Hi,
>
> I have some very large (~1.1 GB) output files from a groundwater model called STOMP that I want to read as efficiently as possible. For each variable there are over 1 million values to read. Variables are not organized in columns; instead they are written out in sections in the file, like this:
>
> X-Direction Node Positions, m
> 5.931450000E+05 5.931550000E+05 5.931650000E+05 5.931750000E+05
> 5.932450000E+05 5.932550000E+05 5.932650000E+05 5.932750000E+05
> . . .
> 5.946950000E+05 5.947050000E+05 5.947150000E+05 5.947250000E+05
> 5.947950000E+05 5.948050000E+05 5.948150000E+05 5.948250000E+05
>
> Y-Direction Node Positions, m
> 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05
> 1.148050000E+05 1.148050000E+05 1.148050000E+05 1.148050000E+05
> . . .
> 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05
> 1.171950000E+05 1.171950000E+05 1.171950000E+05 1.171950000E+05
>
> Z-Direction Node Positions, m
> 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01
> 9.550000000E+01 9.550000000E+01 9.550000000E+01 9.550000000E+01
> . . .
>
> I want to read and use only a subset of the variables. I wrote the function below to find the line where each target variable begins and then scan the values, but it still seems rather slow, perhaps because I am opening and closing the file for each variable. Can anyone suggest a faster way?
>
> # Reads original STOMP plot file (plot.*) directly. Should be useful when the plot files are
> # very large with lots of variables, and you just want to retrieve a few of them.
> # Arguments: 1) plot filename, 2) number of nodes,
> # 3) character vector of names of target variables you want to return.
> # Returns a list with the selected plot output.
> READ.PLOT.OUTPUT6 <- function(plt.file, num.nodes, var.names) {
> lines <- readLines(plt.file)
> num.vars <- length(var.names)
> tmp <- list()
> for(i in 1:num.vars) {
> ind <- grep(var.names[i], lines, fixed=T, useBytes=T)
> if(length(ind) != 1) stop("Not one line in the plot file with matching variable name.\n")
> tmp[[i]] <- scan(plt.file, skip=ind, nmax=num.nodes, quiet=T)
> }
> return(tmp)
> } # end READ.PLOT.OUTPUT6()
>
> Regards,
> Scott Waichler
> Pacific Northwest National Laboratory
> Richland, WA, USA
> scott.waichler at pnnl.gov
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list