[R] Large data and space use

Mon Nov 29 07:49:06 CET 2021

Richard,

I currently have no problem with running out of memory. I was referring to people who have said they use LARGE structures and I am pointing out how they can temporarily get way larger even when not expected. Functions that temporarily will balloon up might come with notifications. And, yes, some transformations may well be doable outside R or in chunks. What gets me is how often users have no idea what happens when they invoke a package.

I am not against transformations and needed duplications. I am more interested in whether some existing code might be evaluated and updated in somewhat harmless ways as in removing stuff as soon as it is definitely not needed. Of course there are tradeoffs. I have seen times only one column of a data.frame was needed and the entire data.frame was copied and then returned. That is OK but clearly it might be more economical to ask just for a single column to be changed in place. People often use a sledgehammer when a thumbtack will do.

But as noted, R has features that often delay things so a full copy is not made and thus less memory is ever used. But people seem to think that since all “local” memory is generally returned when the function ends, so why bother micromanaging it as it runs.

Arguably, some R packages may make changes in what is kept and for how long. Standard R lets you specify what rows and what columns of a data.frame to keep in a single argument as in df[rows, columns] while something like dplyr offers multiple smaller steps in a grammar of sorts so you do something like a select() followed (often in a pipeline) by a filter() or done in the opposite order. Each additional change is sometimes done by programmers in minimal steps so that a more efficient implementation is harder to do as each one does just one thing well. That may also be a plus, especially if pipelined objects are released in progress and not all at the end of the pipeline.

From: Richard O'Keefe <raoknz using gmail.com> 
Sent: Sunday, November 28, 2021 3:54 AM
To: Avi Gross <avigross using verizon.net>
Cc: R-help Mailing List <r-help using r-project.org>
Subject: Re: [R] Large data and space use

If you have enough data that running out of memory is a serious problem,

then a language like R or Python or Octave or Matlab that offers you NO

control over storage may not be the best choice.  You might need to

consider Julia or even Rust.

However, if you have enough data that running out of memory is a serious

problem, your problems may be worse than you think.  In 2021, Linux is

*still* having OOM Killer problems.

https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/

Your process hogging memory may cause some other process to be killed.

Even if that doesn't happen, your process may be simply thrown off the

machine without being warned.

It may be one of the biggest problems around in statistical computing:

how to make it straightforward to carve up a problem so that it can be

run on many machines.  R has the 'Rmpi' and 'snow' packages, amongst others.

https://CRAN.R-project.org/view=HighPerformanceComputing

Another approach is to select and transform data outside R.  If you have

data in some kind of data base then doing select and transform in the

data base may be a good approach.

On Sun, 28 Nov 2021 at 06:57, Avi Gross via R-help <r-help using r-project.org <mailto:r-help using r-project.org> > wrote:

Several recent questions and answers have mad e me look at some code and I
realized that some functions may not be great to use when you are dealing
with very large amounts of data that may already be getting close to limits
of your memory. Does the function you call to do one thing to your object
perhaps overdo it and make multiple copies and not delete them as soon as
they are not needed?

An example was a recent post suggesting a nice set of tools you can use to
convert your data.frame so the columns are integers or dates no matter how
they were read in from a CSV file or created.

What I noticed is that often copies of a sort were made by trying to change
the original say to one date format or another and then deciding which, if
any to keep. Sometimes multiple transformations are tried and this may be
done repeatedly with intermediates left lying around. Yes, the memory will
all be implicitly returned when the function completes. But often these
functions invoke yet other functions which work on their copies. You an end
up with your original data temporarily using multiple times as much actual
memory.

R does have features so some things are "shared" unless one copy or another
changes. But in the cases I am looking at, changes are the whole idea.

What I wonder is whether such functions should clearly call an rm() or the
equivalent as soon as possible when something is no longer needed.

The various kinds of pipelines are another case in point as they involve all
kinds of hidden temporary variables that eventually need to be cleaned up.
When are they removed? I have seen pipelines with 10 or more steps as
perhaps data is read in, has rows removed or columns removed or re-ordered
and grouping applied and merged with others and reports generated. The
intermediates are often of similar sizes with the data and if large, can add
up. If writing the code linearly using temp1 and temp2 type of variables to
hold the output of one stage and the input of the text stage, I would be
tempted to add a rm(temp1) as soon as it was finished being used, or just
reuse the same name of temp1 so the previous contents are no longer being
pointed to and can be taken by the garbage collector at some time.

So I wonder if some functions should have a note in their manual pages
specifying what may happen to the volume of data as they run. An example
would be if I had a function that took a matrix and simply squared it using
matrix multiplication. There are various ways to do this and one of them
simply makes a copy and invokes the built-in way in R that multiplies two
matrices. It then returns the result. So you end up storing basically three
times the size  of the matrix right before you return it. Other methods
might do the actual multiplication in loops operating on subsections of the
matrix and if done carefully, never keep more than say 2.1 times as much
data around. 

Or is this not important often enough? All I know, is data may be getting
larger much faster than memory in our machines gets larger.

        [[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org <mailto:R-help using r-project.org>  mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]