[R] assigning and saving datasets in a loop, with names changing with "i"

Tony Plate tplate at acm.org
Sat Dec 22 01:55:10 CET 2007


Marie Pierre Sylvestre wrote:
> Dear R users,
> 
> I am analysing a very large data set and I need to perform several data
> manipulations. The dataset is so big that the only way I can play with it
> without having memory problems (E.g. "cannot allocate vectors of size...")
> is to write a batch script to:
> 
> 1. cut the data into pieces 
> 2. save the pieces in seperate .RData files
> 3. Remove everything from the environment
> 4. load one of the piece
> 5. perform the manipulations on it
> 6. save it and remove it from the environment
> 7. Redo 4-6 for every piece
> 8. Merge everything together at the end
> 
> It works if coded line by line but since I'll have to perform these tasks
> on other data sets, I am trying to automate this as much as I can. 

The trackObjs package is designed to make it easy to work in approximately 
this manner -- it saves objects automatically to disk but they are still 
accessible as normal.

Here's how you could do the above - this example works with 10 8Mb objects 
in a R session with a limit of 40Mb.

# allow R only 40Mb of vector memory
mem.limits(vsize=40e6)
mem.limits()/1e6
library(trackObjs)
# start tracking to store data objects in the directory 'data'
# each object is 8Mb, and we store 10 of them
track.start("data")
n <- 10
m <- 1e6
constructObject <- function(i) i+rnorm(m)
# steps 1, 2 & 3:
for (i in 1:n) {
    xname <- paste("x", i, sep="")
    cat("", xname)
    assign(xname, constructObject(i))
    # store in a file, accessible by name:
    track(list=xname)
}
cat("\n")
gc(TRUE)
# accessing object by name
object.size(x1)/2^20 # In Mb
mean(x1)
mean(x2)
gc(TRUE)
# steps 4:6
# accessing object through a constructed name
result <- sapply(1:n, function(i) mean(get(paste("x", i, sep=""))))
result
# remove the data objects
track.remove(list=paste("x", 1:n, sep=""))
track.stop()

Here's the a full transcript of the above - note how whenever gc() is 
called there is hardly any vector memory in use.

 > # allow R only 40Mb of vector memory
 > mem.limits(vsize=40e6)
    nsize    vsize
       NA 40000000
 > mem.limits()/1e6
nsize vsize
    NA    40
 > library(trackObjs)
 > # start tracking to store data objects in the directory 'data'
 > # each object is 8Mb, and we store 10 of them
 > track.start("data")
 > n <- 10
 > m <- 1e6
 > constructObject <- function(i) i+rnorm(m)
 > # steps 1, 2 & 3:
 > for (i in 1:n) {
+    xname <- paste("x", i, sep="")
+    cat("", xname)
+    assign(xname, constructObject(i))
+    # store in a file, accessible by name:
+    track(list=xname)
+ }
  x1 x2 x3 x4 x5 x6 x7 x8 x9 x10> cat("\n")

 > gc(TRUE)
Garbage collection 19 = 6+0+13 (level 2) ...
4.0 Mbytes of cons cells used (42%)
0.7 Mbytes of vectors used (5%)
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 148362  4.0     350000  9.4         NA   350000  9.4
Vcells  89973  0.7    1950935 14.9       38.2  2112735 16.2
 > # accessing object by name
 > object.size(x1)/2^20 # In Mb
[1] 7.629417
 > mean(x1)
[1] 0.998635
 > mean(x2)
[1] 1.999656
 > gc(TRUE)
Garbage collection 22 = 7+1+14 (level 2) ...
4.0 Mbytes of cons cells used (43%)
0.7 Mbytes of vectors used (6%)
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 149264  4.0     350000  9.4         NA   350000  9.4
Vcells  90160  0.7    1560747 12.0       38.2  2112735 16.2
 > # steps 4:6
 > result <- sapply(1:n, function(i) mean(get(paste("x", i, sep=""))))
 > result
  [1]  0.998635  1.999656  2.997368  4.000197  5.000159  6.001216  6.999552
  [8]  7.999743  8.999982 10.001355
 > # remove the data objects
 > track.remove(list=paste("x", 1:n, sep=""))
  [1] "x1"  "x2"  "x3"  "x4"  "x5"  "x6"  "x7"  "x8"  "x9"  "x10"
 > track.stop()
 >



> 
> I am using a loop in which I used 'assign' and 'get' (pseudo code below).
> My problem is when I use 'get', it prints the whole object on the screen.
> I am wondering whether there is a more efficient way to do what I need to
> do. Any help would be appreciated. Please keep in mind that the whole
> process is quite computer-intensive, so I can't keep everything in the
> environment while R performs calculations.
> 
> Say I have 1 big dataframe called data. I use 'split' to divide it into a
> list of 12 dataframes (call this list my.list)
> 
> my.fun is a function that takes a dataframe, performs several
> manipulations on it and returns a dataframe.
> 
> 
> for (i in 1:12){
>   assign( paste( "data", i, sep=""),  my.fun(my.list[i]))   # this works
>   # now I need to save this new object as a RData. 
> 
>   # The following line does not work
>   save(paste("data", i, sep = ""),  file = paste(  paste("data", i, sep =
> ""), "RData", sep="."))
> }
> 
>   # This works but it is a bit convoluted!!!
>   temp <- get(paste("data", i, sep = ""))
>   save(temp,  file = "lala.RData")
> }
> 
> 
> I am *sure* there is something more clever to do but I can't find it. Any
> help would be appreciated.
> 
> best regards,
> 
> MP
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list