[Bioc-devel] Unexpected behaviour with Assays and Vector classes
Aaron Lun
alun at wehi.edu.au
Sun Nov 15 21:37:07 CET 2015
Hi Herve,
I would have expected GRanges behaviour, where the metadata is affected
by the replacement. For example, with GRanges objects, I often use the
metadata to store statistics or descriptors relevant to each genomic
interval, e.g., peak scores, log-fold changes and so on. If I did a
replacement to change the genomic coordinates, I would expect that the
associated metadata would come along for the ride. The same argument
would be applied to other classes derived from Vector.
I should add that I encountered this problem in the context of doing
subset replacement for metadata. To re-use your IRanges example:
> require(IRanges)
> ir <- IRanges(11:14, 20, names=letters[1:4])
> mcols(ir) <- DataFrame(stuff=1:4)
> mcols(ir[1])$stuff <- 2
> mcols(ir)$stuff
[1] 1 2 3 4
I would expect that the first entry of "stuff" would now be 2. But
because metadata doesn't get replaced during subset replacement, it
remains at 1. This is problematic, because if I wanted to update a
subset of entries in the metadata, I would use something similar to the
above replacement and expect it to change the existing object.
Cheers,
Aaron
Hervé Pagès wrote:
> Hi Aaron,
>
> On 11/15/2015 10:59 AM, Aaron Lun wrote:
>> Hello all,
>>
>> I've encountered some unexpected behaviour with some of the base classes
>> while developing stuff for genomic interactions. The first issue lies
>> with the subset replacement in the Vector class. Let's say I make a
>> derived class "foo" inheriting from Vector, as below:
>>
>> > require(S4Vectors)
>> > setClass("foo", contains="Vector", slots=c(blah="integer"))
>> > setMethod("parallelSlotNames", "foo", function(x) {
>> > c("blah", callNextMethod())
>> > })
>> [1] "parallelSlotNames"
>> > setMethod("c", "foo", function(x, ..., recursive=TRUE) {
>> > new.blah <- do.call(c, lapply(list(x, ...), FUN=slot,
>> name="blah"))
>> > new.mcols <- do.call(rbind, lapply(list(x, ...), FUN=mcols))
>> > new("foo", blah=new.blah, metadata=metadata(x),
>> > elementMetadata=new.mcols)
>> > })
>> [1] "c"
>>
>> Construction gives what you'd expect:
>>
>> > a <- new("foo", blah=1:5, elementMetadata=DataFrame(stuff=1:5))
>> > a at blah
>> [1] 1 2 3 4 5
>> > mcols(a)$stuff
>> [1] 1 2 3 4 5
>>
>> However, if I try to do subset replacement, I get this:
>>
>> > a[1] <- a[2]
>> > a at blah
>> [1] 2 2 3 4 5
>> > mcols(a)$stuff
>> [1] 1 2 3 4 5
>>
>> So, "blah" is replaced properly, but "elementMetadata" is not. This is
>> attributable to a line in "replaceROWS" which preserves the mcols of the
>> original object during replacement (also for "names"). Should this line
>> be removed to give expected behaviour for the elementMetadata?
>
> For the treatment of the metadata columns we are usually mimicking
> how names are treated in base R:
>
> > x <- c(a=1, b=2, c=3, d=4)
> > x[1] <- x[2]
> > x
> a b c d
> 2 2 3 4
>
> The names are not affected.
>
> IRanges objects are following that model:
>
> > library(IRanges)
> > ir <- IRanges(11:14, 20, names=letters[1:4])
> > mcols(ir) <- DataFrame(stuff=1:4)
> > ir
> IRanges of length 4
> start end width names
> [1] 11 20 10 a
> [2] 12 20 9 b
> [3] 13 20 8 c
> [4] 14 20 7 d
> > mcols(ir)
> DataFrame with 4 rows and 1 column
> stuff
> <integer>
> 1 1
> 2 2
> 3 3
> 4 4
> > ir[1] <- ir[2]
> > ir
> IRanges of length 4
> start end width names
> [1] 12 20 9 a
> [2] 12 20 9 b
> [3] 13 20 8 c
> [4] 14 20 7 d
> > mcols(ir)
> DataFrame with 4 rows and 1 column
> stuff
> <integer>
> 1 1
> 2 2
> 3 3
> 4 4
>
> However it seems that GRanges objects are not:
>
> > gr <- GRanges("chr1", ir)
> > gr
> GRanges object with 4 ranges and 1 metadata column:
> seqnames ranges strand | stuff
> <Rle> <IRanges> <Rle> | <integer>
> a chr1 [12, 20] * | 1
> b chr1 [12, 20] * | 2
> c chr1 [13, 20] * | 3
> d chr1 [14, 20] * | 4
> -------
> seqinfo: 1 sequence from an unspecified genome; no seqlengths
> > gr[1] <- gr[2]
> > gr
> GRanges object with 4 ranges and 1 metadata column:
> seqnames ranges strand | stuff
> <Rle> <IRanges> <Rle> | <integer>
> a chr1 [12, 20] * | 2
> b chr1 [12, 20] * | 2
> c chr1 [13, 20] * | 3
> d chr1 [14, 20] * | 4
> -------
> seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> So we have an inconsistency within our Vector-based classes.
> We need to fix that. It seems that you would have expected the
> metadata columns to be altered by [<-. Is this what most people feel?
>
>>
>> The other issue is that r/cbind'ing doesn't seem to work properly for
>> unnamed multi-matrix Assays objects. Consider:
>>
>> > require(SummarizedExperiment)
>> > whee <- Assays(list(x1=matrix(1, 3, 4), x2=matrix(2, 3, 4)))
>> > whee2 <- Assays(list(x1=matrix(3, 3, 4), x2=matrix(4, 3, 4)))
>> > rbind(whee, whee2)
>> Reference class object of class "ShallowSimpleListAssays"
>> Field "data":
>> List of length 2
>> names(2): x1 x2
>> >
>> > names(whee) <- names(whee2) <- NULL
>> > rbind(whee, whee2)
>> Reference class object of class "ShallowSimpleListAssays"
>> Field "data":
>> List of length 1
>>
>> So, unnaming and rbind'ing results in the loss of a matrix. This is the
>> same issue I reported for unnamed multi-matrix assays when rbinding
>> multiple SummarizedExperiment objects; I recall that being resolved by
>> r/cbind'ing based on position. Should this be done here as well? If not,
>> perhaps we should force people to name their assays.
>
> Maybe Martin or Val can chime in for this one.
>
> Thanks,
> H.
>
>>
>> Cheers,
>>
>> Aaron
>>
>> > sessionInfo()
>> R Under development (unstable) (2015-10-30 r69588)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: CentOS release 6.4 (Final)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats4 stats graphics grDevices utils datasets
>> [8] methods base
>>
>> other attached packages:
>> [1] SummarizedExperiment_1.1.2 Biobase_2.31.0
>> [3] GenomicRanges_1.23.3 GenomeInfoDb_1.7.3
>> [5] IRanges_2.5.5 S4Vectors_0.9.8
>> [7] BiocGenerics_0.17.1
>>
>> loaded via a namespace (and not attached):
>> [1] zlibbioc_1.17.0 XVector_0.11.0
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
More information about the Bioc-devel
mailing list