[Bioc-devel] Unexpected behaviour with Assays and Vector classes

Hervé Pagès hpages at fredhutch.org
Tue Nov 17 08:39:50 CET 2015


Hi Aaron,

On 11/16/2015 09:39 AM, Hervé Pagès wrote:
> Hi Aaron,
>
> On 11/15/2015 12:37 PM, Aaron Lun wrote:
>> Hi Herve,
>>
>> I would have expected GRanges behaviour, where the metadata is affected
>> by the replacement. For example, with GRanges objects, I often use the
>> metadata to store statistics or descriptors relevant to each genomic
>> interval, e.g., peak scores, log-fold changes and so on. If I did a
>> replacement to change the genomic coordinates, I would expect that the
>> associated metadata would come along for the ride. The same argument
>> would be applied to other classes derived from Vector.
>>
>> I should add that I encountered this problem in the context of doing
>> subset replacement for metadata. To re-use your IRanges example:
>>
>>> require(IRanges)
>>> ir <- IRanges(11:14, 20, names=letters[1:4])
>>> mcols(ir) <- DataFrame(stuff=1:4)
>>> mcols(ir[1])$stuff <- 2
>>> mcols(ir)$stuff
>> [1] 1 2 3 4
>>
>> I would expect that the first entry of "stuff" would now be 2. But
>> because metadata doesn't get replaced during subset replacement, it
>> remains at 1. This is problematic, because if I wanted to update a
>> subset of entries in the metadata, I would use something similar to the
>> above replacement and expect it to change the existing object.
>
> Good point. The same argument could actually be used for the names:
>
>    > x <- c(a=1, b=2, c=3, d=4)
>    > names(x[1]) <- "ZZ"
>    > x
>    a b c d
>    1 2 3 4
>
> Anyway, you convinced me that we should not try to follow too closely
> the names model for how we treat metadata columns. If nobody objects,
> I'll change the behavior of the various "replaceROWS" methods in the
> S4Vectors stack and make sure that they all transfer the metadata
> columns from 'value' to 'x'.

This is done in S4Vectors 0.9.9 and IRanges 2.5.6 in devel, and in
S4Vectors 0.8.3 and IRanges 2.4.2 in release.

Cheers,
H.

>
> Cheers,
> H.
>
>>
>> Cheers,
>>
>> Aaron
>>
>> Hervé Pagès wrote:
>>> Hi Aaron,
>>>
>>> On 11/15/2015 10:59 AM, Aaron Lun wrote:
>>>> Hello all,
>>>>
>>>> I've encountered some unexpected behaviour with some of the base
>>>> classes
>>>> while developing stuff for genomic interactions. The first issue lies
>>>> with the subset replacement in the Vector class. Let's say I make a
>>>> derived class "foo" inheriting from Vector, as below:
>>>>
>>>>   > require(S4Vectors)
>>>>   > setClass("foo", contains="Vector", slots=c(blah="integer"))
>>>>   > setMethod("parallelSlotNames", "foo", function(x) {
>>>>   >     c("blah", callNextMethod())
>>>>   > })
>>>> [1] "parallelSlotNames"
>>>>   > setMethod("c", "foo", function(x, ..., recursive=TRUE) {
>>>>   >    new.blah <- do.call(c, lapply(list(x, ...), FUN=slot,
>>>> name="blah"))
>>>>   >    new.mcols <- do.call(rbind, lapply(list(x, ...), FUN=mcols))
>>>>   >    new("foo", blah=new.blah, metadata=metadata(x),
>>>>   >        elementMetadata=new.mcols)
>>>>   > })
>>>> [1] "c"
>>>>
>>>> Construction gives what you'd expect:
>>>>
>>>>   > a <- new("foo", blah=1:5, elementMetadata=DataFrame(stuff=1:5))
>>>>   > a at blah
>>>> [1] 1 2 3 4 5
>>>>   > mcols(a)$stuff
>>>> [1] 1 2 3 4 5
>>>>
>>>> However, if I try to do subset replacement, I get this:
>>>>
>>>>   > a[1] <- a[2]
>>>>   > a at blah
>>>> [1] 2 2 3 4 5
>>>>   > mcols(a)$stuff
>>>> [1] 1 2 3 4 5
>>>>
>>>> So, "blah" is replaced properly, but "elementMetadata" is not. This is
>>>> attributable to a line in "replaceROWS" which preserves the mcols of
>>>> the
>>>> original object during replacement (also for "names"). Should this line
>>>> be removed to give expected behaviour for the elementMetadata?
>>>
>>> For the treatment of the metadata columns we are usually mimicking
>>> how names are treated in base R:
>>>
>>>> x <- c(a=1, b=2, c=3, d=4)
>>>> x[1] <- x[2]
>>>> x
>>> a b c d
>>> 2 2 3 4
>>>
>>> The names are not affected.
>>>
>>> IRanges objects are following that model:
>>>
>>>> library(IRanges)
>>>> ir <- IRanges(11:14, 20, names=letters[1:4])
>>>> mcols(ir) <- DataFrame(stuff=1:4)
>>>> ir
>>> IRanges of length 4
>>>      start end width names
>>> [1]    11  20    10     a
>>> [2]    12  20     9     b
>>> [3]    13  20     8     c
>>> [4]    14  20     7     d
>>>> mcols(ir)
>>> DataFrame with 4 rows and 1 column
>>>        stuff
>>>    <integer>
>>> 1         1
>>> 2         2
>>> 3         3
>>> 4         4
>>>> ir[1] <- ir[2]
>>>> ir
>>> IRanges of length 4
>>>      start end width names
>>> [1]    12  20     9     a
>>> [2]    12  20     9     b
>>> [3]    13  20     8     c
>>> [4]    14  20     7     d
>>>> mcols(ir)
>>> DataFrame with 4 rows and 1 column
>>>        stuff
>>>    <integer>
>>> 1         1
>>> 2         2
>>> 3         3
>>> 4         4
>>>
>>> However it seems that GRanges objects are not:
>>>
>>>> gr <- GRanges("chr1", ir)
>>>> gr
>>> GRanges object with 4 ranges and 1 metadata column:
>>>      seqnames    ranges strand |     stuff
>>>         <Rle> <IRanges>  <Rle> | <integer>
>>>    a     chr1  [12, 20]      * |         1
>>>    b     chr1  [12, 20]      * |         2
>>>    c     chr1  [13, 20]      * |         3
>>>    d     chr1  [14, 20]      * |         4
>>>    -------
>>>    seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>> gr[1] <- gr[2]
>>>> gr
>>> GRanges object with 4 ranges and 1 metadata column:
>>>      seqnames    ranges strand |     stuff
>>>         <Rle> <IRanges>  <Rle> | <integer>
>>>    a     chr1  [12, 20]      * |         2
>>>    b     chr1  [12, 20]      * |         2
>>>    c     chr1  [13, 20]      * |         3
>>>    d     chr1  [14, 20]      * |         4
>>>    -------
>>>    seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>> So we have an inconsistency within our Vector-based classes.
>>> We need to fix that. It seems that you would have expected the
>>> metadata columns to be altered by [<-. Is this what most people feel?
>>>
>>>>
>>>> The other issue is that r/cbind'ing doesn't seem to work properly for
>>>> unnamed multi-matrix Assays objects. Consider:
>>>>
>>>>   > require(SummarizedExperiment)
>>>>   > whee <- Assays(list(x1=matrix(1, 3, 4), x2=matrix(2, 3, 4)))
>>>>   > whee2 <- Assays(list(x1=matrix(3, 3, 4), x2=matrix(4, 3, 4)))
>>>>   > rbind(whee, whee2)
>>>> Reference class object of class "ShallowSimpleListAssays"
>>>> Field "data":
>>>> List of length 2
>>>> names(2): x1 x2
>>>>   >
>>>>   > names(whee) <- names(whee2) <- NULL
>>>>   > rbind(whee, whee2)
>>>> Reference class object of class "ShallowSimpleListAssays"
>>>> Field "data":
>>>> List of length 1
>>>>
>>>> So, unnaming and rbind'ing results in the loss of a matrix. This is the
>>>> same issue I reported for unnamed multi-matrix assays when rbinding
>>>> multiple SummarizedExperiment objects; I recall that being resolved by
>>>> r/cbind'ing based on position. Should this be done here as well? If
>>>> not,
>>>> perhaps we should force people to name their assays.
>>>
>>> Maybe Martin or Val can chime in for this one.
>>>
>>> Thanks,
>>> H.
>>>
>>>>
>>>> Cheers,
>>>>
>>>> Aaron
>>>>
>>>>   > sessionInfo()
>>>> R Under development (unstable) (2015-10-30 r69588)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>> Running under: CentOS release 6.4 (Final)
>>>>
>>>> locale:
>>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats4    stats     graphics  grDevices utils
>>>> datasets
>>>> [8] methods   base
>>>>
>>>> other attached packages:
>>>> [1] SummarizedExperiment_1.1.2 Biobase_2.31.0
>>>> [3] GenomicRanges_1.23.3       GenomeInfoDb_1.7.3
>>>> [5] IRanges_2.5.5              S4Vectors_0.9.8
>>>> [7] BiocGenerics_0.17.1
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] zlibbioc_1.17.0 XVector_0.11.0
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list