[R] assigning and saving datasets in a loop, with names changing with "i"
Tony Plate
tplate at acm.org
Sat Dec 22 01:55:10 CET 2007
Marie Pierre Sylvestre wrote:
> Dear R users,
>
> I am analysing a very large data set and I need to perform several data
> manipulations. The dataset is so big that the only way I can play with it
> without having memory problems (E.g. "cannot allocate vectors of size...")
> is to write a batch script to:
>
> 1. cut the data into pieces
> 2. save the pieces in seperate .RData files
> 3. Remove everything from the environment
> 4. load one of the piece
> 5. perform the manipulations on it
> 6. save it and remove it from the environment
> 7. Redo 4-6 for every piece
> 8. Merge everything together at the end
>
> It works if coded line by line but since I'll have to perform these tasks
> on other data sets, I am trying to automate this as much as I can.
The trackObjs package is designed to make it easy to work in approximately
this manner -- it saves objects automatically to disk but they are still
accessible as normal.
Here's how you could do the above - this example works with 10 8Mb objects
in a R session with a limit of 40Mb.
# allow R only 40Mb of vector memory
mem.limits(vsize=40e6)
mem.limits()/1e6
library(trackObjs)
# start tracking to store data objects in the directory 'data'
# each object is 8Mb, and we store 10 of them
track.start("data")
n <- 10
m <- 1e6
constructObject <- function(i) i+rnorm(m)
# steps 1, 2 & 3:
for (i in 1:n) {
xname <- paste("x", i, sep="")
cat("", xname)
assign(xname, constructObject(i))
# store in a file, accessible by name:
track(list=xname)
}
cat("\n")
gc(TRUE)
# accessing object by name
object.size(x1)/2^20 # In Mb
mean(x1)
mean(x2)
gc(TRUE)
# steps 4:6
# accessing object through a constructed name
result <- sapply(1:n, function(i) mean(get(paste("x", i, sep=""))))
result
# remove the data objects
track.remove(list=paste("x", 1:n, sep=""))
track.stop()
Here's the a full transcript of the above - note how whenever gc() is
called there is hardly any vector memory in use.
> # allow R only 40Mb of vector memory
> mem.limits(vsize=40e6)
nsize vsize
NA 40000000
> mem.limits()/1e6
nsize vsize
NA 40
> library(trackObjs)
> # start tracking to store data objects in the directory 'data'
> # each object is 8Mb, and we store 10 of them
> track.start("data")
> n <- 10
> m <- 1e6
> constructObject <- function(i) i+rnorm(m)
> # steps 1, 2 & 3:
> for (i in 1:n) {
+ xname <- paste("x", i, sep="")
+ cat("", xname)
+ assign(xname, constructObject(i))
+ # store in a file, accessible by name:
+ track(list=xname)
+ }
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10> cat("\n")
> gc(TRUE)
Garbage collection 19 = 6+0+13 (level 2) ...
4.0 Mbytes of cons cells used (42%)
0.7 Mbytes of vectors used (5%)
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 148362 4.0 350000 9.4 NA 350000 9.4
Vcells 89973 0.7 1950935 14.9 38.2 2112735 16.2
> # accessing object by name
> object.size(x1)/2^20 # In Mb
[1] 7.629417
> mean(x1)
[1] 0.998635
> mean(x2)
[1] 1.999656
> gc(TRUE)
Garbage collection 22 = 7+1+14 (level 2) ...
4.0 Mbytes of cons cells used (43%)
0.7 Mbytes of vectors used (6%)
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 149264 4.0 350000 9.4 NA 350000 9.4
Vcells 90160 0.7 1560747 12.0 38.2 2112735 16.2
> # steps 4:6
> result <- sapply(1:n, function(i) mean(get(paste("x", i, sep=""))))
> result
[1] 0.998635 1.999656 2.997368 4.000197 5.000159 6.001216 6.999552
[8] 7.999743 8.999982 10.001355
> # remove the data objects
> track.remove(list=paste("x", 1:n, sep=""))
[1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10"
> track.stop()
>
>
> I am using a loop in which I used 'assign' and 'get' (pseudo code below).
> My problem is when I use 'get', it prints the whole object on the screen.
> I am wondering whether there is a more efficient way to do what I need to
> do. Any help would be appreciated. Please keep in mind that the whole
> process is quite computer-intensive, so I can't keep everything in the
> environment while R performs calculations.
>
> Say I have 1 big dataframe called data. I use 'split' to divide it into a
> list of 12 dataframes (call this list my.list)
>
> my.fun is a function that takes a dataframe, performs several
> manipulations on it and returns a dataframe.
>
>
> for (i in 1:12){
> assign( paste( "data", i, sep=""), my.fun(my.list[i])) # this works
> # now I need to save this new object as a RData.
>
> # The following line does not work
> save(paste("data", i, sep = ""), file = paste( paste("data", i, sep =
> ""), "RData", sep="."))
> }
>
> # This works but it is a bit convoluted!!!
> temp <- get(paste("data", i, sep = ""))
> save(temp, file = "lala.RData")
> }
>
>
> I am *sure* there is something more clever to do but I can't find it. Any
> help would be appreciated.
>
> best regards,
>
> MP
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list