[BioC] Add extra columns to GRanges Metadata

Wed Feb 22 07:24:04 CET 2012

Hi Herve,

My opinion:
I am happy to continue with
values(gr) <- cbind(vales(gr), DataFrame(colA, colB))  

.... now it is clear it is best practice (I was originally not certain
on that). I'm just fine with "values" no other alias seems necessary.

I would be delighted however not to have to bother with ordering down
the track, that was confusing when I encountered it. Your solution for
that sounds very reasonable.

Thanks
Paul

-----Original Message-----
From: Hervé Pagès <hpages at fhcrc.org>
To: p.leo at uq.edu.au
Cc: bioconductor <bioconductor at r-project.org>
Subject: Re: [BioC] Add extra columns to GRanges Metadata
Date: Tue, 21 Feb 2012 21:14:19 -0800

On 02/20/2012 10:35 PM, Paul Leo wrote:
> Well in the end I did this the way I always do it: extracted the
> elementMetaData modified it and reassigned it in it's entirety.
>
> It works fine but it seems a little ham fisted hence my original post.
>
> I had to do this in the end ANYWAY because  c(,) , it appears,  requires
> the elementMetadata of the GRanges in the same order. see example below.
> Something I seemed to have avoided until now.
>
> Given this I don't *think*  that the other solns will work anyway. But
> there are indeed other issues that I did not appreciate that you point
> out. Many thanks .
>
> I guess I'm after this  functionality since for combining  mutation data
> from various sources/programs , SNP , indels, large indels, CNV etc, the
> quality,genotype, annotation data I put in the elementMetaData. To take
> along for the analysis and filtering.
>
> I could have used GrangeList... but  in the final disease modelling  is
> not convenient to use a Granges object
>
>
> So the ordering thing is a little bit of a pain. But the problem is
> solvable with the current data structures (which are just fab- thanks).
>
> Though it would be handy to add extra metadata in a computationally
> rapid way and not to worry about their order.

Yes c() is a little bit rigid as it requires the colnames() of
elementMetadata(gr1) and elementMetadata(gr2) to be identical.
Maybe we could make it a little more forgiving by allowing
the colnames to be the same but not necessarily in the same order.
Then c(gr1, gr2) would reorder the cols in gr2 like in gr1
before doing the combining. Does that sound reasonable?

Also I'm with Martin on why we should resist the temptation to
make 'gr$foo' a convenience for 'elementMetadata(gr)$foo'.
Yes elementMetadata is a long and ugly accessor name (and also
a long and ugly slot name) and this is why values() was added
as an alias for elementMetadata() a couple of years ago on
popular demand. If that's not short enough, we can add another
alias like emd():

    emd(gr) <- cbind(emd(gr), DataFrame(colA, colB))

I would actually prefer this alias (or anything else that is
short enough) over values(). The values in a GRanges object are
its elements (the ranges), like the values in a numeric vector are
the numbers in that vector.

Cheers,
H.

>
> Cheers
> Paul
>
>
>
>
>
>
> PS: Grange combning when metaData in different orders example:
>
> gr1<-GRanges(seqnames =
>            Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
>            ranges =
>            IRanges(1:10, end = 7:16, names = head(letters, 10)),
>            strand =
>            Rle(strand(c("-", "+", "*", "+", "-")),
>                c(1, 2, 2, 3, 2)),
>            score = 1:10,
>            GC = seq(1, 0, length=10))
> gr2<-GRanges(seqnames =
>            Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
>            ranges =
>            IRanges(1:10, end = 7:16, names = head(letters, 10)),
>            strand =
>            Rle(strand(c("-", "+", "*", "+", "-")),
>                c(1, 2, 2, 3, 2)),
>            GC = seq(1, 0, length=10),
>            score = 1:10
>            )
>
> gr1
> gr2
>
> gr1<-GRanges(seqnames =
>            Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
>            ranges =
>            IRanges(1:10, end = 7:16, names = head(letters, 10)),
>            strand =
>            Rle(strand(c("-", "+", "*", "+", "-")),
>                c(1, 2, 2, 3, 2)),
>            score = 1:10,
>            GC = seq(1, 0, length=10))
> gr2<-GRanges(seqnames =
>            Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
>            ranges =
>            IRanges(1:10, end = 7:16, names = head(letters, 10)),
>            strand =
>            Rle(strand(c("-", "+", "*", "+", "-")),
>                c(1, 2, 2, 3, 2)),
>            GC = seq(1, 0, length=10),
>            score = 1:10
>            )
> gr1
> gr2
>
> gr.all<-c(gr1,gr2)
> Error in .Method(..., deparse.level = deparse.level) :
>    column names for arg 2 do not match those of first arg
>
> gr2<-GRanges(seqnames =
>            Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
>            ranges =
>            IRanges(1:10, end = 7:16, names = head(letters, 10)),
>            strand =
>            Rle(strand(c("-", "+", "*", "+", "-")),
>                c(1, 2, 2, 3, 2)),
>            score = 1:10,GC = seq(1, 0, length=10)
>            )
> gr.all<-c(gr1,gr2)
>
>
>> sessionInfo()
> R Under development (unstable) (2011-11-23 r57740)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>   [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
>   [1] multicore_0.1-7        biomaRt_2.11.1
> Rsamtools_1.7.23
>   [4] Biostrings_2.23.6      GenomicFeatures_1.7.10
> AnnotationDbi_1.17.14
>   [7] Biobase_2.15.3         GenomicRanges_1.7.15
> IRanges_1.13.20
> [10] BiocGenerics_0.1.4
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-4.1     BSgenome_1.23.2    DBI_0.2-5
> RCurl_1.9-5
> [5] RSQLite_0.11.1     rtracklayer_1.15.6 tools_2.15.0
> XML_3.9-2
> [9] zlibbioc_1.1.0
>>
>
> -----Original Message-----
> From: Michael Lawrence<lawrence.michael at gene.com>
> To: Martin Morgan<mtmorgan at fhcrc.org>
> Cc: Michael Lawrence<lawrence.michael at gene.com>, bioconductor
> <bioconductor at r-project.org>
> Subject: Re: [BioC] Add extra columns to GRanges Metadata
> Date: Mon, 20 Feb 2012 16:38:13 -0800
>
> On Mon, Feb 20, 2012 at 4:33 PM, Martin Morgan<mtmorgan at fhcrc.org>  wrote:
>
>> On 02/20/2012 11:50 AM, Tengfei Yin wrote:
>>
>>> +1 for Tim's methods.....especially the $ and $<- method, it's always in a
>>> wishlist for me :)
>>>
>>> well, maybe "$ " could be sometimes confusing when people trying to treat
>>> GRanges as a general format similar to data.frame, and do something like
>>> gr$seqnames<- blabla, the good part about current API is that it remind me
>>> the difference bettween  elementMetadata and three required fields... but
>>>
>>
>> GRanges extends Vector and
>>
>>> gr = GRanges("A", IRanges(1:10, 20), X=1:10, Y=1:10)
>>> length(gr)
>> [1] 10
>>> length(values(gr))
>> [1] 2
>>
>> shows the geometry -- one would expect gr$foo to access the range named
>> 'foo' rather than the column of elementMetadata named 'foo'. I think this
>> is the crux of the resistance.
>>
>>
> Right, thanks for reiterating this. The question is whether the convenience
> of the shortcut outweighs the conceptual inconsistency. Probably depends
> heavily on one's perspective. I would favor the perspective of the frequent
> user.
>
>
>> When Tim says
>>
>>
>> On 02/20/2012 11:45 AM, Tim Triche, Jr. wrote:
>>> if these don't belong in BiocGenerics, I don't know what does :-D
>>
>> I'm not sure if the reference is to "intersect" (which is already in
>> BiocGenerics), "%i%" etc., or to $ and $<-. For $ and $<-, they don't need
>> to be in BiocGenerics because they are already made generics by the methods
>> package -- BiocGenerics is meant to provide a place to define generics that
>> would otherwise be defined independently in multiple packages.
>>
>> Martin
>>
>>
>>   it will be convenient if general $ method support replacement? even for
>>> seqnames, strand and ranges part? with default checking?
>>>
>>> other methods mentioned by Tim are also handy if default of those function
>>> is under assumption.
>>>
>>> On Mon, Feb 20, 2012 at 1:04 PM, Michael Lawrence<lawrence.michael@**
>>> gene.com<lawrence.michael at gene.com>
>>>
>>>> wrote:
>>>>
>>>
>>>   On Mon, Feb 20, 2012 at 10:36 AM, Steve Lianoglou<
>>>> mailinglist.honeypot at gmail.com**>   wrote:
>>>>
>>>>   Hi Tim,
>>>>>
>>>>> These functions are quite handy ... definitely going to poach them.
>>>>>
>>>>>
>>>>
>>>> I wonder if we should keep poaching these in the wild or if Bioc Core
>>>> will
>>>> ever bow to popular pressure and allow their inclusion in the base
>>>> packages? ;-)
>>>>
>>>>
>>>>   Also ... you need an editor with better code folding support ;-)
>>>>>
>>>>> Pual,
>>>>>
>>>>> Aslo, to do what you want to do, namely:
>>>>>
>>>>>
>>>>>>
>>>>>   extra.blank<-matrix(data=NA,**nrow=length(a.grs),ncol=**
>>>> length(missing.meta.cols))
>>>>
>>>>> old.meta<-as.data.frame(**values(a.grs))
>>>>>> new.meta<-cbind(old.meta,**extra.blank)
>>>>>> values(a.grs)<-new.meta
>>>>>>
>>>>>
>>>>> Using vanilla GenomicRanges functionality, you just do this:
>>>>>
>>>>> R>   values(a.grs)<- cbind(values(a.grs), DataFrame(extra.blank))
>>>>>
>>>>> HTH,
>>>>>
>>>>> -steve
>>>>>
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>>   | Memorial Sloan-Kettering Cancer Center
>>>>>   | Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elianos/contact>
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor