[Bioc-devel] pointer and big matrix in R

Martin Morgan mtmorgan at fhcrc.org
Sun Apr 14 03:53:26 CEST 2013


On 04/12/2013 12:08 PM, Nicolas Servant wrote:
> Thank you Martin, I will have a look to the SummarizedExperiment class. I'mj
> not sure to really understand what you call "a reference class" ? does it
> mean that there is no easy way to simply creater a pointer between A and B ?

Practically the answer is no, there is no easy way to create a pointer between A 
and B.

## Background

R as copy-on-change semantics so actually

     x <- seq(1, 10)
     y <- x

has associated the symbol `y` with the address pointed to by `x` without any 
memory copy (in this sense `y` is a pointer to `x`), but at the same time marked 
the memory location at `x` such that if either `x` or `y` is changed, a memory 
copy occurs. S4 is implemented in a way that makes it difficult to exploit this 
copy-on-change efficiency.

The 'classic' approach to creating reference semantics (where two symbols 
actually reference the same data) has been to assign a variable to an 
environment, and to pass the environment around. An R environment has 
reference-based semantics.

     e = new.env(parent=emptyenv())
     e[["x"]] = seq(1, 10)
     f = e ## changes to e[["x"]] are reflected in f[["x"]] and vice versa

This approach has been made more formal with so-called reference classes, 
`?ReferenceClasses`

    .R = setRefClass("R", fields=list(x="matrix"))
     r = .R$new(x=matrix(1:10, 5))
     s = r
     r$x[] = log(r$x[])  ## changes to r$x change s$x

The main draw-back to reference classes is that changing `r` surprises the R 
user with its 'action-at-a-distance' -- `s` also changes. The user really does 
not expect the value of one R variable to change when the value of another R 
variable is changed. Also while the `$` accessor and contained methods of 
reference classes are more familiar to non-R programmers, they are not 
necessarily the way R users are expecting to interact with objects (e.g., the S3 
object `m = matrix()` is queried with `dim(m)`, not `m$dim()`).

## Implementing copy-on-change reference classes

A solution is to implement a reference class that nonetheless has copy-on-change 
semantics, so that one gets the benefit of pass-by-reference without the 
surprise of action-at-a-distance. The implementation can be done so that the 
user interacts with R objects in a way that is familiar to R users.

The design pattern is actually quite simple; the following uses the convention 
that `.` prefixes functions that are not meant to be called directly by the 
user. We start with a function that makes a _shallow_ copy of a reference class, 
then updates an arbitrary number of reference class fields.

     .clone <-
         function(x, ...)
     {
         args <- list(...)
         x <- x$copy()
         for (nm in names(args))
             x$field(nm, args[[nm]])
         x
     }

The `x$copy()` creates a new instance of the reference class, with references to 
the original fields. The `for` loop replaces the fields that we've provided, but 
only in our new copy.

We use `.clone` whenever we want to modify a field of `x`; only the modified 
fields are copied.

## Example

Here's a reference class representation of your HTCexp class as simplified in 
your email, with three fields.

     .HTCexp <- setRefClass("HTCexp",
         fields = list(data = "matrix", x_intervals = "GenomicRanges",
           y_intervals = "GenomicRanges"))

A relatively recent addition to R is that `setRefClass` (and `setClass`) return 
a generator function that can be used to create instances of the class. We use 
the generator in a function that is exposed to the user, taking the opportunity 
to provide some hints about expected arguments and to simplify object construction

     HTCexp <-
         function(data = matrix(0, 0, 0), x_intervals = GRanges(),
                  y_intervals = x_intervals, ...)
     {
         .HTCexp$new(data=data, x_intervals = x_intervals,
                     y_intervals = y_intervals, ...)
     }

This is enough to create an object

     htc0 <- HTCexp()
     htc <- HTCexp(matrix(1:25, 5), GRanges("A", IRanges(1:5, width=10)))

We then provide a familiar interface to the class, e.g., with some getters:

     setGeneric("htc_data", function(x, ...) standardGeneric("htc_data"))
     setGeneric("x_intervals", function(x, ...) standardGeneric("x_intervals"))
     setGeneric("y_intervals", function(x, ...) standardGeneric("y_intervals"))

     setMethod("htc_data", "HTCexp", function(x, ...) x$data)
     setMethod("x_intervals", "HTCexp", function(x, ...) x$x_intervals)
     setMethod("y_intervals", "HTCexp", function(x, ...) x$y_intervals)

The user accesses components of `HTCexp` with these S4 methods as 
`x_intervals(htc)`, but internally we access the data using reference class 
techniques, e.g, `x$x_intervals`.

If we write a method or function that modifies our object, we need to maintain 
the illusion of copy-on-change by creating a clone updated with the data that we 
modify

     setMethod("[", "HTCexp", function(x, i, j, ..., drop=TRUE) {
         .clone(x, data=htc_data(x)[i,j, ...], x_intervals=x_intervals(x)[i],
               y_intervals=y_intervals(x)[j])
     })

Neat, `htc[1:2, 3:4]`. If you think about it, we aren't necessarily reducing the 
amount of copying involved -- we modified all fields of the object. In practice, 
I believe that we _have_ substantially reduced copying, for complicated reasons.

Let's implement the `Math` group generic (see `?GroupGenericFunctions`), so that 
we can call functions like `log` on our `HTCexp` object (I have no idea whether 
this is sensible in your example...).

     setMethod("Math", "HTCexp", function(x) {
         data <- callGeneric(htc_data(x))
         .clone(x, data=data)
     })

When the user says, e.g., `log(htc)`, we are updating the `data` field of the 
object, the GRanges in `x_intervals` and `y_intervals` are not copied.

## Design decisions

The design I offered above made the `HTCexp` class itself a copy-on-change 
reference class. In contrast, `SummarizedExperiment` creates a 
`ShallowSimpleListAssays` copy-on-change reference class, and uses this as a 
slot in the plain S4 `SummarizedExperiment`. There is no technical problem with 
this, and the implementation seemed to be effective -- the 'big' data is in the 
`Assays` slot, and we've gained significant performance without adding too much 
complexity. Your own use case was motivated by identical `GRanges` x- and 
y-intervals. Perhaps the solution for you is to construct a `GRangesRef` class 
and use these as slots in your HTCexp class?

As an additional note, there are differences between reference and S4 classes. 
One thing I've stumbled over is that validity method (`setValidity`) are not 
automatically invoked for reference classes; one approach is to write such 
methods, and invoke them at appropriate locations (e.g., after object 
construction in the `HTCexp` function).


Hope this helps; I'd like to add this as a How To page to the Bioconductor web 
site http://bioconductor.org/developers/ so if there are comments, corrections, 
etc., please don't hesitate to let me know.

Martin

> nicolas >
> Le 12 avr. 2013 à 19:44, Martin Morgan <mtmorgan at fhcrc.org> a écrit :
>
>> On 04/12/2013 10:02 AM, Servant Nicolas wrote:
>>> Dear all,
>>>
>>> I have a S4 object (HTCexp from HITC package), composed of one big matrix, and two genomicRanges objects, A and B which describe the matrix raws and columns.
>>> I thinking about a way to decrease the memory size of this object.
>>> I also have methods to  get/set the matrix and the two GRanges, namely intdata(), x_intervals(), y_intervals().
>>>
>>> In case of symetric matrix, the two GRanges can be the same, so I was interested in simply creating in this case, a pointer from B to A. How can I do it in R please ??
>>> Second, I'm wondering if it exists other matrix-like object optimized for big matrix (5000 x 5000). I quicky saw the Matrix object from the CRAN, useful for sparse matrix.
>>> Any suggestion would be appreciated !
>>
>> This is not a super-big object, so perhaps you're running in to problems with R's propensity to copy data? An easy solution might be to re-use the SummarizedExperiment class, which addresses this issue by placing the 'assays' data in a reference class.
>>
>>     library(GenomicRanges)
>>
>>     .HTCexp = setClass("HTCexp", contains="SummarizedExperiment",
>>       representation=representation(y_intervals="GenomicRanges"))
>>
>>     HTCexp <-
>>         function(intdata = matrix(0, 0, 0), x_intervals=GRanges(),
>>                  y_intervals=GRanges(), ...)
>>     {
>>         .HTCexp(SummarizedExperiment(intdata, rowData=x_intervals),
>>                 y_intervals=y_intervals, ...)
>>     }
>>
>> which already gives
>>
>>> HTCexp()
>> class: HTCexp
>> dim: 0 0
>> exptData(0):
>> assays(1): ''
>> rownames: NULL
>> rowData metadata column names(0):
>> colnames: NULL
>> colData names(0):
>>> m <- matrix(0, 5000, 5000,
>> +             dimnames=list(seq_len(5000), seq_len(5000)))
>>> g <- GRanges("A", IRanges(1:5000, width=0))
>>> HTCexp(m, g, g)
>> class: HTCexp
>> dim: 5000 5000
>> exptData(0):
>> assays(1): ''
>> rownames: NULL
>> rowData metadata column names(0):
>> colnames(5000): 1 2 ... 4999 5000
>> colData names(0):
>>
>> I think you'd need to implement "[" and a 'y_intervals' accessors
>>
>>
>>     setGeneric("y_intervals", function(x, ...) standardGeneric("y_intervals"))
>>
>>     setMethod("y_intervals", "HTCexp", function(x, ...) {
>>         x at y_intervals
>>     })
>>
>>     setMethod("[", "HTCexp", function(x, i, j, ..., drop=TRUE) {
>>         ## not sure that this is complete...
>>         if (missing(i) && missing(j))
>>             x
>>         else {
>>             se <- as(x, "SummarizedExperiment")
>>             if (missing(i))
>>                 initialize(x, se[,j], y_intervals=y_intervals(x)[j])
>>             else if (missing(j))
>>                 initialize(x, se[i,])
>>             else
>>                 initialize(x, se[i,j], y_intervals=y_intervals(x)[j])
>>         }
>>     })
>>
>> Martin
>>
>>>
>>> Thank you
>>> Regards
>>> Nicolas
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list