[Rd] Model object, when generated in a function, saves entire environment when saved
Kenny Bell
kmbe||56 @end|ng |rom gm@||@com
Wed Jan 29 20:25:53 CET 2020
Reviving an old thread. I haven't noticed this be a problem for a while
when saving RDS's which is great. However, I noticed the problem again when
saving `qs` files (https://github.com/traversc/qs) which is an RDS
replacement with a fast serialization / compression system.
I'd like to get an idea of what change was made within R to address this
issue for `saveRDS`. My thought is that this will help the author of the
`qs` package do something similar. I have had a browse through the release
notes for the last few years (Ctrl-F-ing "environment") and couldn't see it.
Many thanks for any help and best wishes to all.
The following code uses R 3.6.2 and requires you to run
install.packages("qs") first:
save_size_qs <- function (object) {
tf <- tempfile(fileext = ".qs")
on.exit(unlink(tf))
qs::qsave(object, file = tf)
file.size(tf)
}
save_size_rds <- function (object) {
tf <- tempfile(fileext = ".rds")
on.exit(unlink(tf))
saveRDS(object, file = tf)
file.size(tf)
}
normal_lm <- function(){
junk <- 1:1e+08
lm(Sepal.Length ~ Sepal.Width, data = iris)
}
normal_ggplot <- function(){
junk <- 1:1e+08
ggplot2::ggplot()
}
clean_lm <- function () {
junk <- 1:1e+08
# Run the lm in its own environment
env <- new.env(parent = globalenv())
env$subset <- subset
with(env, lm(Sepal.Length ~ Sepal.Width, data = iris))
}
# The qs save size includes the junk but the rds does not
save_size_qs(normal_lm())
#> [1] 848396
save_size_rds(normal_lm())
#> [1] 4163
save_size_qs(normal_ggplot())
#> [1] 857446
save_size_rds(normal_ggplot())
#> [1] 12895
# Both exclude the junk when separating the lm into its own environment
save_size_qs(clean_lm())
#> [1] 6154
save_size_rds(clean_lm())
#> [1] 4255
On Thu, Jul 28, 2016 at 7:31 AM Kenny Bell <kmbell56 using gmail.com> wrote:
> Thanks so much for all this.
>
> The first solution is what I'm going with as I want the terms object to
> come along so that predict still works.
>
> On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel <
> r-devel using r-project.org> wrote:
>
>> Another solution is to only save the parts of the model object that
>> interest you. As long as they don't include the formula (which is
>> what drags along the environment it was created in), you will
>> save space. E.g.,
>>
>> tfun2 <- function(subset) {
>> junk <- 1:1e6
>> list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris,
>> subset=subset)$coef)
>> }
>>
>> saveSize(tfun2(1:4))
>> #[1] 152
>>
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap using tibco.com>
>> wrote:
>>
>> > One way around this problem is to make a new environment whose
>> > parent environment is .GlobalEnv and which contains only what the
>> > the call to lm() requires and to compute lm() in that environment.
>> E.g.,
>> >
>> > tfun1 <- function (subset)
>> > {
>> > junk <- 1:1e+06
>> > env <- new.env(parent = globalenv())
>> > env$subset <- subset
>> > with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset =
>> subset))
>> > }
>> > Then we get
>> > > saveSize(tfun1(1:4)) # see below for def. of saveSize
>> > [1] 910
>> > instead of the 2129743 bytes in the save file when using the naive
>> method.
>> >
>> > saveSize <- function (object) {
>> > tf <- tempfile(fileext = ".RData")
>> > on.exit(unlink(tf))
>> > save(object, file = tf)
>> > file.size(tf)
>> > }
>> >
>> >
>> >
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>> >
>> > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 using berkeley.edu>
>> wrote:
>> >
>> >> In the below, I generate a model from an environment that isn't
>> >> .GlobalEnv with a large object that is unrelated to the model
>> >> generation. It seems to save the irrelevant object unnecessarily. In
>> >> my actual use case, I am running and saving many models in a loop that
>> >> each use a single large data.frame (that gets collapsed into a small
>> >> data.frame for estimation), so removing it isn't an option.
>> >>
>> >> In the case where the model exists in .GlobalEnv, everything is
>> >> peachy. So replicating whatever happens when saving the model that was
>> >> generated in .GlobalEnv at the return() stage of the function call
>> >> would fix this problem.
>> >>
>> >> I was referred to this list from r-bugs. First time r-devel poster.
>> >>
>> >> Hope this helps,
>> >>
>> >> Kendon
>> >>
>> >> ```
>> >> tmp_fun <- function(x){
>> >> iris_big <- lapply(1:10000, function(x) iris)
>> >> lm(Sepal.Length ~ Sepal.Width, data = iris)
>> >> }
>> >>
>> >> out <- tmp_fun(1)
>> >> object.size(out)
>> >> # 48008
>> >> save(out, file = "tmp.RData", compress = FALSE)
>> >> file.size("tmp.RData")
>> >> # 57196752 - way too big
>> >>
>> >> # Works fine when in .GlobalEnv
>> >> iris_big <- lapply(1:10000, function(x) iris)
>> >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris)
>> >>
>> >> object.size(out)
>> >> # 48008
>> >> save(out, file = "tmp.RData", compress = FALSE)
>> >> file.size("tmp.RData")
>> >> # 16641 - good size.
>> >> ```
>> >>
>> >> [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-devel using r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >
>> >
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list