[Bioc-devel] Unexpected behaviour with Assays and Vector classes

Aaron Lun alun at wehi.edu.au
Sun Nov 15 21:37:07 CET 2015


Hi Herve,

I would have expected GRanges behaviour, where the metadata is affected
by the replacement. For example, with GRanges objects, I often use the
metadata to store statistics or descriptors relevant to each genomic
interval, e.g., peak scores, log-fold changes and so on. If I did a
replacement to change the genomic coordinates, I would expect that the
associated metadata would come along for the ride. The same argument
would be applied to other classes derived from Vector.

I should add that I encountered this problem in the context of doing
subset replacement for metadata. To re-use your IRanges example:

> require(IRanges)
> ir <- IRanges(11:14, 20, names=letters[1:4])
> mcols(ir) <- DataFrame(stuff=1:4)
> mcols(ir[1])$stuff <- 2
> mcols(ir)$stuff
[1] 1 2 3 4

I would expect that the first entry of "stuff" would now be 2. But
because metadata doesn't get replaced during subset replacement, it
remains at 1. This is problematic, because if I wanted to update a
subset of entries in the metadata, I would use something similar to the
above replacement and expect it to change the existing object.

Cheers,

Aaron

Hervé Pagès wrote:
> Hi Aaron,
>
> On 11/15/2015 10:59 AM, Aaron Lun wrote:
>> Hello all,
>>
>> I've encountered some unexpected behaviour with some of the base classes
>> while developing stuff for genomic interactions. The first issue lies
>> with the subset replacement in the Vector class. Let's say I make a
>> derived class "foo" inheriting from Vector, as below:
>>
>>  > require(S4Vectors)
>>  > setClass("foo", contains="Vector", slots=c(blah="integer"))
>>  > setMethod("parallelSlotNames", "foo", function(x) {
>>  >     c("blah", callNextMethod())
>>  > })
>> [1] "parallelSlotNames"
>>  > setMethod("c", "foo", function(x, ..., recursive=TRUE) {
>>  >    new.blah <- do.call(c, lapply(list(x, ...), FUN=slot,
>> name="blah"))
>>  >    new.mcols <- do.call(rbind, lapply(list(x, ...), FUN=mcols))
>>  >    new("foo", blah=new.blah, metadata=metadata(x),
>>  >        elementMetadata=new.mcols)
>>  > })
>> [1] "c"
>>
>> Construction gives what you'd expect:
>>
>>  > a <- new("foo", blah=1:5, elementMetadata=DataFrame(stuff=1:5))
>>  > a at blah
>> [1] 1 2 3 4 5
>>  > mcols(a)$stuff
>> [1] 1 2 3 4 5
>>
>> However, if I try to do subset replacement, I get this:
>>
>>  > a[1] <- a[2]
>>  > a at blah
>> [1] 2 2 3 4 5
>>  > mcols(a)$stuff
>> [1] 1 2 3 4 5
>>
>> So, "blah" is replaced properly, but "elementMetadata" is not. This is
>> attributable to a line in "replaceROWS" which preserves the mcols of the
>> original object during replacement (also for "names"). Should this line
>> be removed to give expected behaviour for the elementMetadata?
>
> For the treatment of the metadata columns we are usually mimicking
> how names are treated in base R:
>
> > x <- c(a=1, b=2, c=3, d=4)
> > x[1] <- x[2]
> > x
> a b c d
> 2 2 3 4
>
> The names are not affected.
>
> IRanges objects are following that model:
>
> > library(IRanges)
> > ir <- IRanges(11:14, 20, names=letters[1:4])
> > mcols(ir) <- DataFrame(stuff=1:4)
> > ir
> IRanges of length 4
>     start end width names
> [1]    11  20    10     a
> [2]    12  20     9     b
> [3]    13  20     8     c
> [4]    14  20     7     d
> > mcols(ir)
> DataFrame with 4 rows and 1 column
>       stuff
>   <integer>
> 1         1
> 2         2
> 3         3
> 4         4
> > ir[1] <- ir[2]
> > ir
> IRanges of length 4
>     start end width names
> [1]    12  20     9     a
> [2]    12  20     9     b
> [3]    13  20     8     c
> [4]    14  20     7     d
> > mcols(ir)
> DataFrame with 4 rows and 1 column
>       stuff
>   <integer>
> 1         1
> 2         2
> 3         3
> 4         4
>
> However it seems that GRanges objects are not:
>
> > gr <- GRanges("chr1", ir)
> > gr
> GRanges object with 4 ranges and 1 metadata column:
>     seqnames    ranges strand |     stuff
>        <Rle> <IRanges>  <Rle> | <integer>
>   a     chr1  [12, 20]      * |         1
>   b     chr1  [12, 20]      * |         2
>   c     chr1  [13, 20]      * |         3
>   d     chr1  [14, 20]      * |         4
>   -------
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
> > gr[1] <- gr[2]
> > gr
> GRanges object with 4 ranges and 1 metadata column:
>     seqnames    ranges strand |     stuff
>        <Rle> <IRanges>  <Rle> | <integer>
>   a     chr1  [12, 20]      * |         2
>   b     chr1  [12, 20]      * |         2
>   c     chr1  [13, 20]      * |         3
>   d     chr1  [14, 20]      * |         4
>   -------
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> So we have an inconsistency within our Vector-based classes.
> We need to fix that. It seems that you would have expected the
> metadata columns to be altered by [<-. Is this what most people feel?
>
>>
>> The other issue is that r/cbind'ing doesn't seem to work properly for
>> unnamed multi-matrix Assays objects. Consider:
>>
>>  > require(SummarizedExperiment)
>>  > whee <- Assays(list(x1=matrix(1, 3, 4), x2=matrix(2, 3, 4)))
>>  > whee2 <- Assays(list(x1=matrix(3, 3, 4), x2=matrix(4, 3, 4)))
>>  > rbind(whee, whee2)
>> Reference class object of class "ShallowSimpleListAssays"
>> Field "data":
>> List of length 2
>> names(2): x1 x2
>>  >
>>  > names(whee) <- names(whee2) <- NULL
>>  > rbind(whee, whee2)
>> Reference class object of class "ShallowSimpleListAssays"
>> Field "data":
>> List of length 1
>>
>> So, unnaming and rbind'ing results in the loss of a matrix. This is the
>> same issue I reported for unnamed multi-matrix assays when rbinding
>> multiple SummarizedExperiment objects; I recall that being resolved by
>> r/cbind'ing based on position. Should this be done here as well? If not,
>> perhaps we should force people to name their assays.
>
> Maybe Martin or Val can chime in for this one.
>
> Thanks,
> H.
>
>>
>> Cheers,
>>
>> Aaron
>>
>>  > sessionInfo()
>> R Under development (unstable) (2015-10-30 r69588)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: CentOS release 6.4 (Final)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats4    stats     graphics  grDevices utils     datasets
>> [8] methods   base
>>
>> other attached packages:
>> [1] SummarizedExperiment_1.1.2 Biobase_2.31.0
>> [3] GenomicRanges_1.23.3       GenomeInfoDb_1.7.3
>> [5] IRanges_2.5.5              S4Vectors_0.9.8
>> [7] BiocGenerics_0.17.1
>>
>> loaded via a namespace (and not attached):
>> [1] zlibbioc_1.17.0 XVector_0.11.0
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



More information about the Bioc-devel mailing list