[R] What is an alternative to expand.grid if create a long vector?
Avi Gross
@v|gro@@ @end|ng |rom ver|zon@net
Tue Apr 20 03:46:56 CEST 2021
Just some thoughts I am considering about the issue of how to make giant objects in memory without making them giant or all in memory.
As stupid as this sounds, when things get really big, it can mean not only processing your data in smaller amounts but using other techniques than asking expand.grid to create all possible combinations in advance.
Some languages like python allow generators that yield one item at a time and are called until exhausted, which sounds more like your usage. A single function remains resident in memory and each time it is called it uses the resident values in a calculation and returns the next. That approach may not work well with the way expand.grid works.
So a less efficient way would be to write your own deeply nested loop that generates one set of ten or so variables each time through the deepest nested loop that you can use one at a time. Alternatively, you can use such a loop to write a line at a time in something like a .CSV format and later read N lines at a time from the file or even have multiple programs work in parallel by taking their own allocations after ignoring the lines not meant for them, or some other method.
Deeply nested loops in R tend to be slow, as I have found out, which is indeed why I switched to using pmap() on a data.frame made using expand.grid first. But if your needs are exorbitant and you have limited memory, ....
Can you squeeze some memory out of your design? Your data seems highly repetitive and if you really want to store something like this in a column:
c(seq(0.001, 1, length.out = 100))
The size of that, for comparison, is:
object.size(seq(0.001, 1, length.out = 100))
848 bytes
So it is 8 bytes per number plus some overhead.
Then consider storing something like that another way. First, the c() wrapper around the above is redundant, albeit harmless. Why not store this:
1L:100L
object.size(1L:100L)
448 bytes
So, four bytes per number plus some overhead.
That stores integers between 1 and 100 and in your case that means that later you can divide by a thousand or so to get the number you want each time but not store a full double-precision number.
And if you use factors, it may take less space. I note some of your other values pick different starting and ending points but in all cases you ask for 100 equally-spaced values to be calculated by seq() which is fine but you could simply record a factor with umpteen specific values as either doubles or integers and if expand.grid honors that, it would use less space in any final output. My experiments (not shown here) suggest you can easily cut sizes in half and perhaps more with judicious usage.
Perhaps finding or writing a more efficient loop in a C or C++ function would allow a way to loop through all possibilities more efficiently and provide a function for it to call on each iteration. Depending on your need, that can do a calculation using local variables and perhaps add a line to an output file, or add another set of values to a vector or other data structure that gets returned at the end of processing.
One possibility to consider is using an on-line resource, perhaps paying a fee, that will run your R program for you in an environment with more allowed resources like memory:
https://rstudio.cloud/
Some of the professional options allow 8 GB of memory and perhaps 4 CPU. You can, of course, configure your own machine to have more memory or perhaps allocate lots more swap space and allow your process to abuse it.
There are many possible solutions but also consider if the sizes and amounts you are working on are realistic. I worked on a project a while ago where I generated a huge amount of instances with 500 iterations per instance and was asked to bump that up to 10,000 per instance (20 times as much) just to show the results were similar and that 500 had been enough. It ran for DAYS and luckily the rest of the project went back to more manageable numbers.
So, back to your scenario, I wonder if the regularity of your data would allow interesting games to be played. Imagine smaller combinations of say 10 levels each and for each row in the resulting data.frame, expand that out again so the number 2,3,4 (using just three for illustration) becomes (2:29, 3:39, 4:49) and is given to expand.grid to make a smaller local one-use expansion table to use. Your original giant problem is converted to making a modest table that for each row expands to a second modest table that is used and immediately discarded and replaced by a similar table. So for ten variables, instead of making 100^10 variations all at once, you might make 10^10 variations and iterate on rows of that and make another 10^10 size table and do your processing on each row of that and then remove that table and replace it till done. In theory, you can use that in additional stages and cut memory use sharply albeit perhaps increasing CPU usage substantially.
-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Rui Barradas
Sent: Monday, April 19, 2021 12:02 PM
To: Shah Alam <dr.alamsolangi using gmail.com>; r-help mailing list <r-help using r-project.org>
Subject: Re: [R] What is an alternative to expand.grid if create a long vector?
Hello,
If you want to process the data by rows, then maybe you should consider a custom function that divides the problem in small chunks and process one chunk at a time.
But even so, at 8 bytes per double, 100^10 rows is
(100^10*8)/(1024^4) # Tera bytes
#[1] 727595761
It will take you a very, very long time to process.
Revise the problem?
Hope this helps,
Rui Barradas
Às 13:35 de 19/04/21, Shah Alam escreveu:
> Dear All,
>
> I would like to know that is there any problem in *expand.grid*
> function or it is a limitation of this function.
>
> I am trying to create a combination of elements using expand.grid function.
>
> A <- expand.grid(
> c(seq(0.001, 0.1, length.out = 100)),
> c(seq(0.0001, 0.001, length.out = 100)), c(seq(0.38, 0.42, length.out
> = 100)), c(seq(0.12, 0.18, length.out = 100)))
>
> Four combinations work fine. However, If I increase the combinations
> up to ten. The following error appears.
>
> A <- expand.grid(
> c(seq(0.001, 1, length.out = 100)),
> c(seq(0.0001, 0.001, length.out = 100)), c(seq(0.38, 0.42, length.out
> = 100)), c(seq(0.12, 0.18, length.out = 100)), c(seq(0.01, 0.04,
> length.out = 100)), c(seq(0.0001, 0.001, length.out = 100)),
> c(seq(0.0001, 0.001, length.out = 100)), c(seq(0.001, 0.01, length.out
> = 100)), c(seq(0.01, 0.3, length.out = 100))
> )
>
> *Error in rep.int <http://rep.int>(rep.int <http://rep.int>(seq_len(nx),
> rep.int <http://rep.int>(rep.fac, nx)), orep) : invalid 'times' value*
>
> After reducing the length to 10. It produced a different type of error
>
> A <- expand.grid(
> c(seq(0.001, 0.005, length.out = 10)), c(seq(0.0001, 0.0005,
> length.out = 10)), c(seq(0.38, 0.42, length.out = 5)), c(seq(0.12,
> 0.18, length.out = 7)), c(seq(0.01, 0.04, length.out = 5)),
> c(seq(0.0001, 0.001, length.out = 10)), c(seq(0.0001, 0.001,
> length.out = 10)), c(seq(0.001, 0.01, length.out = 10)), c(seq(0.1,
> 0.8, length.out = 8))
> )
>
> *Error: cannot allocate vector of size 1.0 Gb*
>
> What is an alternative to expand.grid if create a long vector based on
> 10 elements?
>
> With kind regards,
> Shah Alam
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list