[R-sig-eco] Large datasets, sequential processing of subsets
Glen A Sargeant
gsargeant at usgs.gov
Tue May 20 17:46:50 CEST 2008
Stephen Sefick asked about methods for processing subsets of very large
datasets. His original title was "wavCWT (wmtsa) iterations" but I think
the subject has evolved into a new one of broader interest that deserves
more descriptive keywords.
My general approach:
1) Subset the data by day, week, or whatever grouping desired.
2) Write a function that will perform the desired action on 1 subset.
3) Construct a list from the subsets.
4) Use lapply() to execute the function for each component of the list in
succession. The output should be a list.
5) Combine components of the output list.
In addition to being useful for large datasets, this approach is also very
useful when subsets of data are of interest in their own right.
An example:
#Example that illustrates output to both a
#graphics device (the terminal) and to an
#object; remove readline() to send to another
#graphics device (see e.g., pdf()). Very useful
#for visual inspection of many datasets in
#succession.
#Example data
df <- data.frame(group=factor(rep(c("a","b","c"),
rep(10,3))),x=rnorm(30,0,1),y=rnorm(30,0,1))
str(df)
#Groups contained in the data
subsets <- levels(df$group)
n.subsets <- length(subsets)
subsets
n.subsets
#A list to store subsets
lst <- vector(mode="list",length=n.subsets)
names(lst) <- subsets
lst
#Extract the subsets and store in the list
for(i in subsets){
lst[[i]] <- subset(df,group==i)}
lst
#A function that returns both a plot and a
#scalar for each subset;
#"X" is the name used for each component
#in succession and has no other significance
foo <- function(X){
plot(X$x,X$y)
readline("Enter to continue, escape to exit")
return(mean(X$x))
}
output.list <- lapply(lst,foo)
###########################################
#Respond to prompts on the console before
#proceeding beyond this point!
###########################################
output.list
`list.to.frame` <-
function(df.list,type="scalar")
{
df = df.list[[1]]
for(i in 2:length(df.list)){
if(type=="matrix" | type=="frame"){
df = rbind(df,df.list[[i]])}
if(type=="scalar"){
df = c(df,df.list[[i]])}
}
df
}
output.vector <- list.to.frame(output.list,type="scalar")
output.vector
*************************************************
Glen A. Sargeant, Ph.D.
Research Wildlife Biologist/Statistician
Northern Prairie Wildlife Research Center
8711 37th Street SE
Jamestown, ND 58401
Phone: (701) 253-5528
E-mail: glen_sargeant at usgs.gov
FAX: (701) 253-5553
*************************************************
"stephen sefick" <ssefick at gmail.com>
05/20/2008 09:09 AM
To
"Glen A Sargeant" <gsargeant at usgs.gov>
cc
Subject
Re: [R-sig-eco] wavCWT (wmtsa) iterations?
I am working with Dissolved Oxygen data. I am subsetting it into
individual time series for analysis. This is just to ease computation and
to keep me straight on what I am working on. That would be great if you
could send me an example.
thanks
On Tue, May 20, 2008 at 9:59 AM, Glen A Sargeant <gsargeant at usgs.gov>
wrote:
Steve,
What type of data are you working on? I may be able to send you an
example.
Glen
*************************************************
Glen A. Sargeant, Ph.D.
Research Wildlife Biologist/Statistician
Northern Prairie Wildlife Research Center
8711 37th Street SE
Jamestown, ND 58401
Phone: (701) 253-5528
E-mail: glen_sargeant at usgs.gov
FAX: (701) 253-5553
*************************************************
"stephen sefick" <ssefick at gmail.com>
Sent by: r-sig-ecology-bounces at r-project.org
05/19/2008 01:39 PM
To
r-sig-ecology at r-project.org
cc
Subject
[R-sig-eco] wavCWT (wmtsa) iterations?
I have hit the max memory for my poor little computer. I there a way to
just section the sections that I would like to look at instead of doing
the
transform on the whole dataset? scale.range only works when I specify
deltat(x.ts). It would be nice to start this at say a day to a week. my
data is in 15min. intervals so this could correspond to 96 to 672
readings.
The other thing that I was wondering if I could do is do this on subsets
of
the data and then combine them into one big plot for the CWT of the entire
data set- iterate through "chunks" and then combine them at the end?
thanks
Error: cannot allocate vector of size 98.2 Mb
In addition: Warning messages:
1: In wavCWT(x.ts) :
Reached total allocation of 502Mb: see help(memory.size)
2: In wavCWT(x.ts) :
Reached total allocation of 502Mb: see help(memory.size)
3: In wavCWT(x.ts) :
Reached total allocation of 502Mb: see help(memory.size)
4: In wavCWT(x.ts) :
Reached total allocation of 502Mb: see help(memory.size)
--
Let's not spend our time and resources thinking about things that are so
little or so large that all they really do for us is puff us up and make
us
feel like gods. We are mammals, and have not exhausted the annoying little
problems of being mammals.
-K. Mullis
[[alternative HTML version deleted]]
_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
--
Let's not spend our time and resources thinking about things that are so
little or so large that all they really do for us is puff us up and make
us feel like gods. We are mammals, and have not exhausted the annoying
little problems of being mammals.
-K. Mullis
More information about the R-sig-ecology
mailing list