[Rd] Shallow copies

Henrik Bengtsson hb at biostat.ucsf.edu
Wed Oct 1 01:55:35 CEST 2014


On Tue, Sep 30, 2014 at 2:20 PM, Matthieu Gomez
<gomez.matthieu at gmail.com> wrote:
> I have a question about shallow copies in R. Since R 3.1.0, subsetting
> a dataframe with respect to its columns no longer result in deep
> copies. This is an amazing change in my opinion. Now, subsetting a
> data.frame by rows (or subsetting a matrix by columns or rows) still
> does deep copies. In particular, it is my understanding that running a
> command on a very large subset of rows (say "sum" or "biglm" on non
> outliers observations) results in a deep copy of these rows first,
> which can require twice as much the memory of the original
> data.frame/matrix. If this is correct, I would be very interested to
> know more on whether this behavior can/may change in future versions
> of R.

I let the experts comment on this, but subsetting/reshuffling columns
in data.frame:s sound easy compared with what you're asking for.
Columns of a data frame are basically just elements in a list and they
don't have to be contiguous in memory whereas the elements in a matrix
(of a basic data type) needs to be contiguous in memory.

However, somewhat related: Having lots of these temporary copies can
be quite time consuming for the garbage collector, so it's not just
about the memory but also about the overall processing time.  One of
the next improvements in the 'matrixStats' package is to add support
for specifying subsets of rows/columns to operate over - for the
purpose of avoiding the auxiliary copies that you talk about, e.g.

  cols <- c(1:14, 87:103)
  rows <- seq(from=1, to=nrow(X), by=2)
  y <- rowMedians(X, rows=rows, columns=cols)

instead of

  y <- rowMedians(X[rows,cols])

It's a fairly simple task to implement, but I've got lots on my plate
so I don't know when this will happen. (I welcome contributions via
github.com/HenrikBengtsson/matrixStats.) Similar methods in R (e.g.
rowSums()) could gain from this too.

/Henrik
(matrixStats)

PS. Code compilation could translate rowMedians(X[rows,cols]) to
rowMedians(X, rows=rows, columns=cols) but that's far in the future (I
think).

>
> Thanks a lot!,
> Matthew
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list