[Bioc-devel] Unexpected behaviour with Assays and Vector classes

Hervé Pagès hpages at fredhutch.org
Mon Nov 16 18:39:30 CET 2015


Hi Aaron,

On 11/15/2015 12:37 PM, Aaron Lun wrote:
> Hi Herve,
>
> I would have expected GRanges behaviour, where the metadata is affected
> by the replacement. For example, with GRanges objects, I often use the
> metadata to store statistics or descriptors relevant to each genomic
> interval, e.g., peak scores, log-fold changes and so on. If I did a
> replacement to change the genomic coordinates, I would expect that the
> associated metadata would come along for the ride. The same argument
> would be applied to other classes derived from Vector.
>
> I should add that I encountered this problem in the context of doing
> subset replacement for metadata. To re-use your IRanges example:
>
>> require(IRanges)
>> ir <- IRanges(11:14, 20, names=letters[1:4])
>> mcols(ir) <- DataFrame(stuff=1:4)
>> mcols(ir[1])$stuff <- 2
>> mcols(ir)$stuff
> [1] 1 2 3 4
>
> I would expect that the first entry of "stuff" would now be 2. But
> because metadata doesn't get replaced during subset replacement, it
> remains at 1. This is problematic, because if I wanted to update a
> subset of entries in the metadata, I would use something similar to the
> above replacement and expect it to change the existing object.

Good point. The same argument could actually be used for the names:

   > x <- c(a=1, b=2, c=3, d=4)
   > names(x[1]) <- "ZZ"
   > x
   a b c d
   1 2 3 4

Anyway, you convinced me that we should not try to follow too closely
the names model for how we treat metadata columns. If nobody objects,
I'll change the behavior of the various "replaceROWS" methods in the
S4Vectors stack and make sure that they all transfer the metadata
columns from 'value' to 'x'.

Cheers,
H.

>
> Cheers,
>
> Aaron
>
> Hervé Pagès wrote:
>> Hi Aaron,
>>
>> On 11/15/2015 10:59 AM, Aaron Lun wrote:
>>> Hello all,
>>>
>>> I've encountered some unexpected behaviour with some of the base classes
>>> while developing stuff for genomic interactions. The first issue lies
>>> with the subset replacement in the Vector class. Let's say I make a
>>> derived class "foo" inheriting from Vector, as below:
>>>
>>>   > require(S4Vectors)
>>>   > setClass("foo", contains="Vector", slots=c(blah="integer"))
>>>   > setMethod("parallelSlotNames", "foo", function(x) {
>>>   >     c("blah", callNextMethod())
>>>   > })
>>> [1] "parallelSlotNames"
>>>   > setMethod("c", "foo", function(x, ..., recursive=TRUE) {
>>>   >    new.blah <- do.call(c, lapply(list(x, ...), FUN=slot,
>>> name="blah"))
>>>   >    new.mcols <- do.call(rbind, lapply(list(x, ...), FUN=mcols))
>>>   >    new("foo", blah=new.blah, metadata=metadata(x),
>>>   >        elementMetadata=new.mcols)
>>>   > })
>>> [1] "c"
>>>
>>> Construction gives what you'd expect:
>>>
>>>   > a <- new("foo", blah=1:5, elementMetadata=DataFrame(stuff=1:5))
>>>   > a at blah
>>> [1] 1 2 3 4 5
>>>   > mcols(a)$stuff
>>> [1] 1 2 3 4 5
>>>
>>> However, if I try to do subset replacement, I get this:
>>>
>>>   > a[1] <- a[2]
>>>   > a at blah
>>> [1] 2 2 3 4 5
>>>   > mcols(a)$stuff
>>> [1] 1 2 3 4 5
>>>
>>> So, "blah" is replaced properly, but "elementMetadata" is not. This is
>>> attributable to a line in "replaceROWS" which preserves the mcols of the
>>> original object during replacement (also for "names"). Should this line
>>> be removed to give expected behaviour for the elementMetadata?
>>
>> For the treatment of the metadata columns we are usually mimicking
>> how names are treated in base R:
>>
>>> x <- c(a=1, b=2, c=3, d=4)
>>> x[1] <- x[2]
>>> x
>> a b c d
>> 2 2 3 4
>>
>> The names are not affected.
>>
>> IRanges objects are following that model:
>>
>>> library(IRanges)
>>> ir <- IRanges(11:14, 20, names=letters[1:4])
>>> mcols(ir) <- DataFrame(stuff=1:4)
>>> ir
>> IRanges of length 4
>>      start end width names
>> [1]    11  20    10     a
>> [2]    12  20     9     b
>> [3]    13  20     8     c
>> [4]    14  20     7     d
>>> mcols(ir)
>> DataFrame with 4 rows and 1 column
>>        stuff
>>    <integer>
>> 1         1
>> 2         2
>> 3         3
>> 4         4
>>> ir[1] <- ir[2]
>>> ir
>> IRanges of length 4
>>      start end width names
>> [1]    12  20     9     a
>> [2]    12  20     9     b
>> [3]    13  20     8     c
>> [4]    14  20     7     d
>>> mcols(ir)
>> DataFrame with 4 rows and 1 column
>>        stuff
>>    <integer>
>> 1         1
>> 2         2
>> 3         3
>> 4         4
>>
>> However it seems that GRanges objects are not:
>>
>>> gr <- GRanges("chr1", ir)
>>> gr
>> GRanges object with 4 ranges and 1 metadata column:
>>      seqnames    ranges strand |     stuff
>>         <Rle> <IRanges>  <Rle> | <integer>
>>    a     chr1  [12, 20]      * |         1
>>    b     chr1  [12, 20]      * |         2
>>    c     chr1  [13, 20]      * |         3
>>    d     chr1  [14, 20]      * |         4
>>    -------
>>    seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>> gr[1] <- gr[2]
>>> gr
>> GRanges object with 4 ranges and 1 metadata column:
>>      seqnames    ranges strand |     stuff
>>         <Rle> <IRanges>  <Rle> | <integer>
>>    a     chr1  [12, 20]      * |         2
>>    b     chr1  [12, 20]      * |         2
>>    c     chr1  [13, 20]      * |         3
>>    d     chr1  [14, 20]      * |         4
>>    -------
>>    seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>
>> So we have an inconsistency within our Vector-based classes.
>> We need to fix that. It seems that you would have expected the
>> metadata columns to be altered by [<-. Is this what most people feel?
>>
>>>
>>> The other issue is that r/cbind'ing doesn't seem to work properly for
>>> unnamed multi-matrix Assays objects. Consider:
>>>
>>>   > require(SummarizedExperiment)
>>>   > whee <- Assays(list(x1=matrix(1, 3, 4), x2=matrix(2, 3, 4)))
>>>   > whee2 <- Assays(list(x1=matrix(3, 3, 4), x2=matrix(4, 3, 4)))
>>>   > rbind(whee, whee2)
>>> Reference class object of class "ShallowSimpleListAssays"
>>> Field "data":
>>> List of length 2
>>> names(2): x1 x2
>>>   >
>>>   > names(whee) <- names(whee2) <- NULL
>>>   > rbind(whee, whee2)
>>> Reference class object of class "ShallowSimpleListAssays"
>>> Field "data":
>>> List of length 1
>>>
>>> So, unnaming and rbind'ing results in the loss of a matrix. This is the
>>> same issue I reported for unnamed multi-matrix assays when rbinding
>>> multiple SummarizedExperiment objects; I recall that being resolved by
>>> r/cbind'ing based on position. Should this be done here as well? If not,
>>> perhaps we should force people to name their assays.
>>
>> Maybe Martin or Val can chime in for this one.
>>
>> Thanks,
>> H.
>>
>>>
>>> Cheers,
>>>
>>> Aaron
>>>
>>>   > sessionInfo()
>>> R Under development (unstable) (2015-10-30 r69588)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: CentOS release 6.4 (Final)
>>>
>>> locale:
>>>    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>>    [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel  stats4    stats     graphics  grDevices utils     datasets
>>> [8] methods   base
>>>
>>> other attached packages:
>>> [1] SummarizedExperiment_1.1.2 Biobase_2.31.0
>>> [3] GenomicRanges_1.23.3       GenomeInfoDb_1.7.3
>>> [5] IRanges_2.5.5              S4Vectors_0.9.8
>>> [7] BiocGenerics_0.17.1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] zlibbioc_1.17.0 XVector_0.11.0
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list