[Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments

Tue Oct 14 07:25:25 CEST 2014

On Mon, Oct 13, 2014 at 9:44 PM, Hervé Pagès <hpages at fhcrc.org> wrote:

> Hi,
>
> On 10/11/2014 02:25 PM, Vincent Carey wrote:
>
>> On Sat, Oct 11, 2014 at 5:17 PM, Michael Lawrence <
>> lawrence.michael at gene.com
>>
>>> wrote:
>>>
>>
>>  But what it would do exactly?
>>>
>>> Probably would want to be able to extract a gene list from a TxDb, then
>>> extract the desired type of structure from the TxDb.
>>>
>>> Not too bad right now, but it would be nice to leverage the identifier
>>> type information on the gene list object.
>>>
>>> Currently:
>>> tx <- transcripts(txdb, vals=list(gene_id=genes))
>>>
>>> Proposed:
>>> tx <- transcripts(txdb[GeneList])
>>>
>>>
>> yes, that makes sense.  i don't go to txdb's as naturally as i should.
>>
>
> Also coming a little late to the party, but I also have a preference
> for Kasper's proposal of using subsetByXXX.
>
> Supporting 'txdb[GeneList]' is arbitrarily making gene ids special,
> when a TxDb contains other ids (transcript and exon ids).
>
>
My proposal was in the context of having formal vectors of IDs, as Gabe has
done (internally as of yet). Basically, extending a character vector to
track the type of ID. GSEABase has something similar. I agree plain old
character vectors make no sense here.

> Also if you push a little bit this concept, you quickly run into
> some semantic headaches:
>
>   - First, let's keep in mind that for a common track like the
>     "UCSC Genes" track, a lot of transcripts are not linked to any
>     gene.
>
>   - Then, allowing subsetting a TxDb by a character vector means
>     a TxDb has names. At least conceptually. So it's tempting to
>     also support 'names(txdb)' (would return all the gene ids).
>
>   - Finally, the names being unique, it seems natural to expect that
>     'txdb[names(txdb)]' is a no-op. But it won't because
>     'txdb[names(txdb)]' will drop all the transcripts that are not
>     linked to a gene.
>
> But before any TxDb subsetting can happen (via [ or subsetByXXX), we
> need to bring back the classic (and healthier) pass-by-value semantic
> on these objects. (Right now TxDb is a reference class and thus TxDb
> objects have a pass-by-reference semantic.)
>
> H.
>
>
>
>>
>>
>>>
>>>
>>> On Sat, Oct 11, 2014 at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org>
>>> wrote:
>>>
>>>  On 10/11/2014 08:41 AM, Vincent Carey wrote:
>>>>
>>>>  Is there anything on the order of as([GeneSet], "GRanges") around?
>>>>>
>>>>>
>>>> no, I don't think so; obviously of use and following a common theme.
>>>> Martin
>>>>
>>>>
>>>>
>>>>  On Sat, Sep 20, 2014 at 11:34 PM, Gabe Becker <becker.gabe at gene.com>
>>>>> wrote:
>>>>>
>>>>>   Sean and Vincent,
>>>>>
>>>>>>
>>>>>> The goal of what we are doing builds off of what Martin has in
>>>>>> GSEABase.
>>>>>> We were looking to see how much benefit we can get with something
>>>>>> lighter-weight that lies between indistinguishable character vectors
>>>>>> and
>>>>>> the full machinery of GeneSets.
>>>>>>
>>>>>> Either way, it seems like formalizing the semantic information is a
>>>>>> way
>>>>>> to
>>>>>> do what you want. Furthermore, these classed id objects can be created
>>>>>> automatically when there is contextual information e.g. during queries
>>>>>> to
>>>>>> databases (or db-like objects), and then simply added to metadata
>>>>>> DataFrames and re-used.
>>>>>>
>>>>>> ~G
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 20, 2014 at 12:19 PM, Sean Davis <sdavis2 at mail.nih.gov>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Sat, Sep 20, 2014 at 3:11 PM, Gabe Becker <becker.gabe at gene.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Hey all,
>>>>>>>
>>>>>>>>
>>>>>>>> We are in the (very) early stages of experimenting with something
>>>>>>>> that
>>>>>>>> seems relevant here: classed identifiers. We are using them for
>>>>>>>> database/mart queries, but the same concept could be useful for the
>>>>>>>> cases
>>>>>>>> you're describing I think.
>>>>>>>>
>>>>>>>> E.g.
>>>>>>>>
>>>>>>>>   mysyms = GeneSymbol(c("BRAF", "BRCA1"))
>>>>>>>>
>>>>>>>>> mysyms
>>>>>>>>>
>>>>>>>>>  An object of class "GeneSymbol"
>>>>>>>> [1] "BRAF"  "BRCA1"
>>>>>>>>
>>>>>>>>  yourSE[mysyms, ]
>>>>>>>>>
>>>>>>>>>  ...
>>>>>>>>
>>>>>>>>
>>>>>>>>   This approach has the flavor of some of the functionality that
>>>>>>>>
>>>>>>> Martin put
>>>>>>> together for the GSEABase package (EntrezIdentifier, etc.).
>>>>>>>
>>>>>>> Sean
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  This approach has the benefit of being declarative instead of
>>>>>>>> heuristic
>>>>>>>> (people won't be able to accidentally invoke it), while still giving
>>>>>>>> most
>>>>>>>> of the convenience I believe you are looking for.
>>>>>>>>
>>>>>>>> The object classes inherit directly from character, so should "just
>>>>>>>> work"
>>>>>>>> most of the time, but as I said it's early days; lots more testing
>>>>>>>> for
>>>>>>>> functionality and usefulness is needed.
>>>>>>>>
>>>>>>>> ~G
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Sep 20, 2014 at 11:38 AM, Vincent Carey <
>>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   OK by me to leave [ alone.  We could start with subsetByEntrez,
>>>>>>>>
>>>>>>>>> subsetByKEGG, subsetBySymbol, subsetByGOTERM, subsetByGOID.
>>>>>>>>>
>>>>>>>>> Utilities to generate GRanges for queries in each of these
>>>>>>>>> vocabularies
>>>>>>>>> should, perhaps, be in the OrganismDb space?  Once those are in
>>>>>>>>> place
>>>>>>>>> no additional infrastructure is necessary?
>>>>>>>>>
>>>>>>>>> On Sat, Sep 20, 2014 at 12:49 PM, Tim Triche, Jr. <
>>>>>>>>>
>>>>>>>>>  tim.triche at gmail.com>
>>>>>>>>
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>>   Agreed with Sean, having tried implementing to "magical"
>>>>>>>>> alternative
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --t
>>>>>>>>>>
>>>>>>>>>>   On Sep 20, 2014, at 9:31 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>>  Hi, Vince.
>>>>>>>>>>>
>>>>>>>>>>> I'm coming a little late to the party, but I agree with Kasper's
>>>>>>>>>>>
>>>>>>>>>>>  sentiment
>>>>>>>>>>
>>>>>>>>>>  that the less "magical" approach of using subsetByXXX might be
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>  cleaner
>>>>>>>>>>
>>>>>>>>>>  way to go for the time being.
>>>>>>>>>>>
>>>>>>>>>>> Sean
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Sep 20, 2014 at 10:42 AM, Vincent Carey <
>>>>>>>>>>>
>>>>>>>>>>>  stvjc at channing.harvard.edu>
>>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>    https://github.com/vjcitn/biocMultiAssay/blob/master/
>>>>>>>>>
>>>>>>>> vignettes/SEresolver.Rnw
>>>>>>>>
>>>>>>>>
>>>>>>>>>  shows some modifications to [ that allow subsetting of SE by
>>>>>>>>>>>> gene or pathway name
>>>>>>>>>>>>
>>>>>>>>>>>> it may be premature to work at the [ level.  Kasper suggested
>>>>>>>>>>>>
>>>>>>>>>>>>  defining
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>  a suite of subsetBy operations that would accomplish this
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> i think we could get something along these lines into the
>>>>>>>>>>>> release
>>>>>>>>>>>>
>>>>>>>>>>>>  without
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  too much more work.  votes?
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>      [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>            [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Computational Biologist
>>>>>>>> Genentech Research
>>>>>>>>
>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Computational Biologist
>>>>>> Genentech Research
>>>>>>
>>>>>>
>>>>>>           [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>>>
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

	[[alternative HTML version deleted]]