[R-sig-eco] Large datasets, sequential processing of subsets

Glen A Sargeant gsargeant at usgs.gov
Tue May 20 17:46:50 CEST 2008


Stephen Sefick asked about methods for processing subsets of very large 
datasets.  His original title was "wavCWT (wmtsa) iterations" but I think 
the subject has evolved into a new one of broader interest that deserves 
more descriptive keywords.

My general approach:

1) Subset the data by day, week, or whatever grouping desired.

2) Write a function that will perform the desired action on 1 subset.

3) Construct a list from the subsets.

4) Use lapply() to execute the function for each component of the list in 
succession.  The output should be a list.

5) Combine components of the output list.

In addition to being useful for large datasets, this approach is also very 
useful when subsets of data are of interest in their own right.

An example:

#Example that illustrates output to both a
#graphics device (the terminal) and to an
#object; remove readline() to send to another
#graphics device (see e.g., pdf()). Very useful
#for visual inspection of many datasets in
#succession.

#Example data
df <- data.frame(group=factor(rep(c("a","b","c"),
rep(10,3))),x=rnorm(30,0,1),y=rnorm(30,0,1))
str(df)

#Groups contained in the data
subsets <- levels(df$group)
n.subsets <- length(subsets)
subsets
n.subsets

#A list to store subsets
lst <- vector(mode="list",length=n.subsets)
names(lst) <- subsets
lst

#Extract the subsets and store in the list
for(i in subsets){
lst[[i]] <- subset(df,group==i)}
lst

#A function that returns both a plot and a
#scalar for each subset;
#"X" is the name used for each component
#in succession and has no other significance

foo <- function(X){
plot(X$x,X$y)
readline("Enter to continue, escape to exit")
return(mean(X$x))
}

output.list <- lapply(lst,foo)

###########################################
#Respond to prompts on the console before
#proceeding beyond this point!
###########################################

output.list

`list.to.frame` <-
function(df.list,type="scalar")
{
   df = df.list[[1]]
   for(i in 2:length(df.list)){
       if(type=="matrix" | type=="frame"){
       df = rbind(df,df.list[[i]])}
       if(type=="scalar"){
       df = c(df,df.list[[i]])}
   }
   df
}

output.vector <- list.to.frame(output.list,type="scalar")
output.vector

*************************************************
Glen A. Sargeant, Ph.D.
Research Wildlife Biologist/Statistician
Northern Prairie Wildlife Research Center
8711 37th Street SE
Jamestown, ND  58401

Phone: (701) 253-5528
E-mail:  glen_sargeant at usgs.gov
FAX:     (701) 253-5553
*************************************************



"stephen sefick" <ssefick at gmail.com> 
05/20/2008 09:09 AM

To
"Glen A Sargeant" <gsargeant at usgs.gov>
cc

Subject
Re: [R-sig-eco] wavCWT (wmtsa) iterations?






I am working with Dissolved Oxygen data.  I am subsetting it into 
individual time series for analysis.  This is just to ease computation and 
to keep me straight on what I am working on.  That would be great if you 
could send me an example. 
thanks

On Tue, May 20, 2008 at 9:59 AM, Glen A Sargeant <gsargeant at usgs.gov> 
wrote:
Steve,

What type of data are you working on? I may be able to send you an
example.

Glen

*************************************************
Glen A. Sargeant, Ph.D.
Research Wildlife Biologist/Statistician
Northern Prairie Wildlife Research Center
8711 37th Street SE
Jamestown, ND  58401

Phone: (701) 253-5528
E-mail:  glen_sargeant at usgs.gov
FAX:     (701) 253-5553
*************************************************



"stephen sefick" <ssefick at gmail.com>
Sent by: r-sig-ecology-bounces at r-project.org
05/19/2008 01:39 PM

To
r-sig-ecology at r-project.org
cc

Subject
[R-sig-eco] wavCWT (wmtsa) iterations?






I have hit the max memory for my poor little computer.  I there a way to
just section the sections that I would like to look at instead of doing
the
transform on the whole dataset?  scale.range only works when I specify
deltat(x.ts).  It would be nice to start this at say a day to a week.  my
data is in 15min. intervals so this could correspond to 96 to 672
readings.
The other thing that I was wondering if I could do is do this on subsets
of
the data and then combine them into one big plot for the CWT of the entire
data set-  iterate through "chunks" and then combine them at the end?
thanks

Error: cannot allocate vector of size 98.2 Mb
In addition: Warning messages:
1: In wavCWT(x.ts) :
 Reached total allocation of 502Mb: see help(memory.size)
2: In wavCWT(x.ts) :
 Reached total allocation of 502Mb: see help(memory.size)
3: In wavCWT(x.ts) :
 Reached total allocation of 502Mb: see help(memory.size)
4: In wavCWT(x.ts) :
 Reached total allocation of 502Mb: see help(memory.size)

--
Let's not spend our time and resources thinking about things that are so
little or so large that all they really do for us is puff us up and make
us
feel like gods. We are mammals, and have not exhausted the annoying little
problems of being mammals.

-K. Mullis

                [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology





-- 
Let's not spend our time and resources thinking about things that are so 
little or so large that all they really do for us is puff us up and make 
us feel like gods. We are mammals, and have not exhausted the annoying 
little problems of being mammals.

-K. Mullis



More information about the R-sig-ecology mailing list