[Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments
Michael Lawrence
lawrence.michael at gene.com
Tue Oct 14 07:25:25 CEST 2014
On Mon, Oct 13, 2014 at 9:44 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi,
>
> On 10/11/2014 02:25 PM, Vincent Carey wrote:
>
>> On Sat, Oct 11, 2014 at 5:17 PM, Michael Lawrence <
>> lawrence.michael at gene.com
>>
>>> wrote:
>>>
>>
>> But what it would do exactly?
>>>
>>> Probably would want to be able to extract a gene list from a TxDb, then
>>> extract the desired type of structure from the TxDb.
>>>
>>> Not too bad right now, but it would be nice to leverage the identifier
>>> type information on the gene list object.
>>>
>>> Currently:
>>> tx <- transcripts(txdb, vals=list(gene_id=genes))
>>>
>>> Proposed:
>>> tx <- transcripts(txdb[GeneList])
>>>
>>>
>> yes, that makes sense. i don't go to txdb's as naturally as i should.
>>
>
> Also coming a little late to the party, but I also have a preference
> for Kasper's proposal of using subsetByXXX.
>
> Supporting 'txdb[GeneList]' is arbitrarily making gene ids special,
> when a TxDb contains other ids (transcript and exon ids).
>
>
My proposal was in the context of having formal vectors of IDs, as Gabe has
done (internally as of yet). Basically, extending a character vector to
track the type of ID. GSEABase has something similar. I agree plain old
character vectors make no sense here.
> Also if you push a little bit this concept, you quickly run into
> some semantic headaches:
>
> - First, let's keep in mind that for a common track like the
> "UCSC Genes" track, a lot of transcripts are not linked to any
> gene.
>
> - Then, allowing subsetting a TxDb by a character vector means
> a TxDb has names. At least conceptually. So it's tempting to
> also support 'names(txdb)' (would return all the gene ids).
>
> - Finally, the names being unique, it seems natural to expect that
> 'txdb[names(txdb)]' is a no-op. But it won't because
> 'txdb[names(txdb)]' will drop all the transcripts that are not
> linked to a gene.
>
> But before any TxDb subsetting can happen (via [ or subsetByXXX), we
> need to bring back the classic (and healthier) pass-by-value semantic
> on these objects. (Right now TxDb is a reference class and thus TxDb
> objects have a pass-by-reference semantic.)
>
> H.
>
>
>
>>
>>
>>>
>>>
>>> On Sat, Oct 11, 2014 at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org>
>>> wrote:
>>>
>>> On 10/11/2014 08:41 AM, Vincent Carey wrote:
>>>>
>>>> Is there anything on the order of as([GeneSet], "GRanges") around?
>>>>>
>>>>>
>>>> no, I don't think so; obviously of use and following a common theme.
>>>> Martin
>>>>
>>>>
>>>>
>>>> On Sat, Sep 20, 2014 at 11:34 PM, Gabe Becker <becker.gabe at gene.com>
>>>>> wrote:
>>>>>
>>>>> Sean and Vincent,
>>>>>
>>>>>>
>>>>>> The goal of what we are doing builds off of what Martin has in
>>>>>> GSEABase.
>>>>>> We were looking to see how much benefit we can get with something
>>>>>> lighter-weight that lies between indistinguishable character vectors
>>>>>> and
>>>>>> the full machinery of GeneSets.
>>>>>>
>>>>>> Either way, it seems like formalizing the semantic information is a
>>>>>> way
>>>>>> to
>>>>>> do what you want. Furthermore, these classed id objects can be created
>>>>>> automatically when there is contextual information e.g. during queries
>>>>>> to
>>>>>> databases (or db-like objects), and then simply added to metadata
>>>>>> DataFrames and re-used.
>>>>>>
>>>>>> ~G
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 20, 2014 at 12:19 PM, Sean Davis <sdavis2 at mail.nih.gov>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Sat, Sep 20, 2014 at 3:11 PM, Gabe Becker <becker.gabe at gene.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hey all,
>>>>>>>
>>>>>>>>
>>>>>>>> We are in the (very) early stages of experimenting with something
>>>>>>>> that
>>>>>>>> seems relevant here: classed identifiers. We are using them for
>>>>>>>> database/mart queries, but the same concept could be useful for the
>>>>>>>> cases
>>>>>>>> you're describing I think.
>>>>>>>>
>>>>>>>> E.g.
>>>>>>>>
>>>>>>>> mysyms = GeneSymbol(c("BRAF", "BRCA1"))
>>>>>>>>
>>>>>>>>> mysyms
>>>>>>>>>
>>>>>>>>> An object of class "GeneSymbol"
>>>>>>>> [1] "BRAF" "BRCA1"
>>>>>>>>
>>>>>>>> yourSE[mysyms, ]
>>>>>>>>>
>>>>>>>>> ...
>>>>>>>>
>>>>>>>>
>>>>>>>> This approach has the flavor of some of the functionality that
>>>>>>>>
>>>>>>> Martin put
>>>>>>> together for the GSEABase package (EntrezIdentifier, etc.).
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This approach has the benefit of being declarative instead of
>>>>>>>> heuristic
>>>>>>>> (people won't be able to accidentally invoke it), while still giving
>>>>>>>> most
>>>>>>>> of the convenience I believe you are looking for.
>>>>>>>>
>>>>>>>> The object classes inherit directly from character, so should "just
>>>>>>>> work"
>>>>>>>> most of the time, but as I said it's early days; lots more testing
>>>>>>>> for
>>>>>>>> functionality and usefulness is needed.
>>>>>>>>
>>>>>>>> ~G
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Sep 20, 2014 at 11:38 AM, Vincent Carey <
>>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> OK by me to leave [ alone. We could start with subsetByEntrez,
>>>>>>>>
>>>>>>>>> subsetByKEGG, subsetBySymbol, subsetByGOTERM, subsetByGOID.
>>>>>>>>>
>>>>>>>>> Utilities to generate GRanges for queries in each of these
>>>>>>>>> vocabularies
>>>>>>>>> should, perhaps, be in the OrganismDb space? Once those are in
>>>>>>>>> place
>>>>>>>>> no additional infrastructure is necessary?
>>>>>>>>>
>>>>>>>>> On Sat, Sep 20, 2014 at 12:49 PM, Tim Triche, Jr. <
>>>>>>>>>
>>>>>>>>> tim.triche at gmail.com>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Agreed with Sean, having tried implementing to "magical"
>>>>>>>>> alternative
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --t
>>>>>>>>>>
>>>>>>>>>> On Sep 20, 2014, at 9:31 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi, Vince.
>>>>>>>>>>>
>>>>>>>>>>> I'm coming a little late to the party, but I agree with Kasper's
>>>>>>>>>>>
>>>>>>>>>>> sentiment
>>>>>>>>>>
>>>>>>>>>> that the less "magical" approach of using subsetByXXX might be
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>> cleaner
>>>>>>>>>>
>>>>>>>>>> way to go for the time being.
>>>>>>>>>>>
>>>>>>>>>>> Sean
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 20, 2014 at 10:42 AM, Vincent Carey <
>>>>>>>>>>>
>>>>>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> https://github.com/vjcitn/biocMultiAssay/blob/master/
>>>>>>>>>
>>>>>>>> vignettes/SEresolver.Rnw
>>>>>>>>
>>>>>>>>
>>>>>>>>> shows some modifications to [ that allow subsetting of SE by
>>>>>>>>>>>> gene or pathway name
>>>>>>>>>>>>
>>>>>>>>>>>> it may be premature to work at the [ level. Kasper suggested
>>>>>>>>>>>>
>>>>>>>>>>>> defining
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> a suite of subsetBy operations that would accomplish this
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> i think we could get something along these lines into the
>>>>>>>>>>>> release
>>>>>>>>>>>>
>>>>>>>>>>>> without
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> too much more work. votes?
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Computational Biologist
>>>>>>>> Genentech Research
>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Computational Biologist
>>>>>> Genentech Research
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>>
>>>
>>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list