[Bioc-devel] pointer and big matrix in R
Martin Morgan
mtmorgan at fhcrc.org
Sun Apr 14 03:53:26 CEST 2013
On 04/12/2013 12:08 PM, Nicolas Servant wrote:
> Thank you Martin, I will have a look to the SummarizedExperiment class. I'mj
> not sure to really understand what you call "a reference class" ? does it
> mean that there is no easy way to simply creater a pointer between A and B ?
Practically the answer is no, there is no easy way to create a pointer between A
and B.
## Background
R as copy-on-change semantics so actually
x <- seq(1, 10)
y <- x
has associated the symbol `y` with the address pointed to by `x` without any
memory copy (in this sense `y` is a pointer to `x`), but at the same time marked
the memory location at `x` such that if either `x` or `y` is changed, a memory
copy occurs. S4 is implemented in a way that makes it difficult to exploit this
copy-on-change efficiency.
The 'classic' approach to creating reference semantics (where two symbols
actually reference the same data) has been to assign a variable to an
environment, and to pass the environment around. An R environment has
reference-based semantics.
e = new.env(parent=emptyenv())
e[["x"]] = seq(1, 10)
f = e ## changes to e[["x"]] are reflected in f[["x"]] and vice versa
This approach has been made more formal with so-called reference classes,
`?ReferenceClasses`
.R = setRefClass("R", fields=list(x="matrix"))
r = .R$new(x=matrix(1:10, 5))
s = r
r$x[] = log(r$x[]) ## changes to r$x change s$x
The main draw-back to reference classes is that changing `r` surprises the R
user with its 'action-at-a-distance' -- `s` also changes. The user really does
not expect the value of one R variable to change when the value of another R
variable is changed. Also while the `$` accessor and contained methods of
reference classes are more familiar to non-R programmers, they are not
necessarily the way R users are expecting to interact with objects (e.g., the S3
object `m = matrix()` is queried with `dim(m)`, not `m$dim()`).
## Implementing copy-on-change reference classes
A solution is to implement a reference class that nonetheless has copy-on-change
semantics, so that one gets the benefit of pass-by-reference without the
surprise of action-at-a-distance. The implementation can be done so that the
user interacts with R objects in a way that is familiar to R users.
The design pattern is actually quite simple; the following uses the convention
that `.` prefixes functions that are not meant to be called directly by the
user. We start with a function that makes a _shallow_ copy of a reference class,
then updates an arbitrary number of reference class fields.
.clone <-
function(x, ...)
{
args <- list(...)
x <- x$copy()
for (nm in names(args))
x$field(nm, args[[nm]])
x
}
The `x$copy()` creates a new instance of the reference class, with references to
the original fields. The `for` loop replaces the fields that we've provided, but
only in our new copy.
We use `.clone` whenever we want to modify a field of `x`; only the modified
fields are copied.
## Example
Here's a reference class representation of your HTCexp class as simplified in
your email, with three fields.
.HTCexp <- setRefClass("HTCexp",
fields = list(data = "matrix", x_intervals = "GenomicRanges",
y_intervals = "GenomicRanges"))
A relatively recent addition to R is that `setRefClass` (and `setClass`) return
a generator function that can be used to create instances of the class. We use
the generator in a function that is exposed to the user, taking the opportunity
to provide some hints about expected arguments and to simplify object construction
HTCexp <-
function(data = matrix(0, 0, 0), x_intervals = GRanges(),
y_intervals = x_intervals, ...)
{
.HTCexp$new(data=data, x_intervals = x_intervals,
y_intervals = y_intervals, ...)
}
This is enough to create an object
htc0 <- HTCexp()
htc <- HTCexp(matrix(1:25, 5), GRanges("A", IRanges(1:5, width=10)))
We then provide a familiar interface to the class, e.g., with some getters:
setGeneric("htc_data", function(x, ...) standardGeneric("htc_data"))
setGeneric("x_intervals", function(x, ...) standardGeneric("x_intervals"))
setGeneric("y_intervals", function(x, ...) standardGeneric("y_intervals"))
setMethod("htc_data", "HTCexp", function(x, ...) x$data)
setMethod("x_intervals", "HTCexp", function(x, ...) x$x_intervals)
setMethod("y_intervals", "HTCexp", function(x, ...) x$y_intervals)
The user accesses components of `HTCexp` with these S4 methods as
`x_intervals(htc)`, but internally we access the data using reference class
techniques, e.g, `x$x_intervals`.
If we write a method or function that modifies our object, we need to maintain
the illusion of copy-on-change by creating a clone updated with the data that we
modify
setMethod("[", "HTCexp", function(x, i, j, ..., drop=TRUE) {
.clone(x, data=htc_data(x)[i,j, ...], x_intervals=x_intervals(x)[i],
y_intervals=y_intervals(x)[j])
})
Neat, `htc[1:2, 3:4]`. If you think about it, we aren't necessarily reducing the
amount of copying involved -- we modified all fields of the object. In practice,
I believe that we _have_ substantially reduced copying, for complicated reasons.
Let's implement the `Math` group generic (see `?GroupGenericFunctions`), so that
we can call functions like `log` on our `HTCexp` object (I have no idea whether
this is sensible in your example...).
setMethod("Math", "HTCexp", function(x) {
data <- callGeneric(htc_data(x))
.clone(x, data=data)
})
When the user says, e.g., `log(htc)`, we are updating the `data` field of the
object, the GRanges in `x_intervals` and `y_intervals` are not copied.
## Design decisions
The design I offered above made the `HTCexp` class itself a copy-on-change
reference class. In contrast, `SummarizedExperiment` creates a
`ShallowSimpleListAssays` copy-on-change reference class, and uses this as a
slot in the plain S4 `SummarizedExperiment`. There is no technical problem with
this, and the implementation seemed to be effective -- the 'big' data is in the
`Assays` slot, and we've gained significant performance without adding too much
complexity. Your own use case was motivated by identical `GRanges` x- and
y-intervals. Perhaps the solution for you is to construct a `GRangesRef` class
and use these as slots in your HTCexp class?
As an additional note, there are differences between reference and S4 classes.
One thing I've stumbled over is that validity method (`setValidity`) are not
automatically invoked for reference classes; one approach is to write such
methods, and invoke them at appropriate locations (e.g., after object
construction in the `HTCexp` function).
Hope this helps; I'd like to add this as a How To page to the Bioconductor web
site http://bioconductor.org/developers/ so if there are comments, corrections,
etc., please don't hesitate to let me know.
Martin
> nicolas >
> Le 12 avr. 2013 à 19:44, Martin Morgan <mtmorgan at fhcrc.org> a écrit :
>
>> On 04/12/2013 10:02 AM, Servant Nicolas wrote:
>>> Dear all,
>>>
>>> I have a S4 object (HTCexp from HITC package), composed of one big matrix, and two genomicRanges objects, A and B which describe the matrix raws and columns.
>>> I thinking about a way to decrease the memory size of this object.
>>> I also have methods to get/set the matrix and the two GRanges, namely intdata(), x_intervals(), y_intervals().
>>>
>>> In case of symetric matrix, the two GRanges can be the same, so I was interested in simply creating in this case, a pointer from B to A. How can I do it in R please ??
>>> Second, I'm wondering if it exists other matrix-like object optimized for big matrix (5000 x 5000). I quicky saw the Matrix object from the CRAN, useful for sparse matrix.
>>> Any suggestion would be appreciated !
>>
>> This is not a super-big object, so perhaps you're running in to problems with R's propensity to copy data? An easy solution might be to re-use the SummarizedExperiment class, which addresses this issue by placing the 'assays' data in a reference class.
>>
>> library(GenomicRanges)
>>
>> .HTCexp = setClass("HTCexp", contains="SummarizedExperiment",
>> representation=representation(y_intervals="GenomicRanges"))
>>
>> HTCexp <-
>> function(intdata = matrix(0, 0, 0), x_intervals=GRanges(),
>> y_intervals=GRanges(), ...)
>> {
>> .HTCexp(SummarizedExperiment(intdata, rowData=x_intervals),
>> y_intervals=y_intervals, ...)
>> }
>>
>> which already gives
>>
>>> HTCexp()
>> class: HTCexp
>> dim: 0 0
>> exptData(0):
>> assays(1): ''
>> rownames: NULL
>> rowData metadata column names(0):
>> colnames: NULL
>> colData names(0):
>>> m <- matrix(0, 5000, 5000,
>> + dimnames=list(seq_len(5000), seq_len(5000)))
>>> g <- GRanges("A", IRanges(1:5000, width=0))
>>> HTCexp(m, g, g)
>> class: HTCexp
>> dim: 5000 5000
>> exptData(0):
>> assays(1): ''
>> rownames: NULL
>> rowData metadata column names(0):
>> colnames(5000): 1 2 ... 4999 5000
>> colData names(0):
>>
>> I think you'd need to implement "[" and a 'y_intervals' accessors
>>
>>
>> setGeneric("y_intervals", function(x, ...) standardGeneric("y_intervals"))
>>
>> setMethod("y_intervals", "HTCexp", function(x, ...) {
>> x at y_intervals
>> })
>>
>> setMethod("[", "HTCexp", function(x, i, j, ..., drop=TRUE) {
>> ## not sure that this is complete...
>> if (missing(i) && missing(j))
>> x
>> else {
>> se <- as(x, "SummarizedExperiment")
>> if (missing(i))
>> initialize(x, se[,j], y_intervals=y_intervals(x)[j])
>> else if (missing(j))
>> initialize(x, se[i,])
>> else
>> initialize(x, se[i,j], y_intervals=y_intervals(x)[j])
>> }
>> })
>>
>> Martin
>>
>>>
>>> Thank you
>>> Regards
>>> Nicolas
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list