[Bioc-devel] 'semantically rich' subsetting of SummarizedExperiments

Hervé Pagès hpages at fhcrc.org
Tue Oct 14 06:44:42 CEST 2014


Hi,

On 10/11/2014 02:25 PM, Vincent Carey wrote:
> On Sat, Oct 11, 2014 at 5:17 PM, Michael Lawrence <lawrence.michael at gene.com
>> wrote:
>
>> But what it would do exactly?
>>
>> Probably would want to be able to extract a gene list from a TxDb, then
>> extract the desired type of structure from the TxDb.
>>
>> Not too bad right now, but it would be nice to leverage the identifier
>> type information on the gene list object.
>>
>> Currently:
>> tx <- transcripts(txdb, vals=list(gene_id=genes))
>>
>> Proposed:
>> tx <- transcripts(txdb[GeneList])
>>
>
> yes, that makes sense.  i don't go to txdb's as naturally as i should.

Also coming a little late to the party, but I also have a preference
for Kasper's proposal of using subsetByXXX.

Supporting 'txdb[GeneList]' is arbitrarily making gene ids special,
when a TxDb contains other ids (transcript and exon ids).

Also if you push a little bit this concept, you quickly run into
some semantic headaches:

   - First, let's keep in mind that for a common track like the
     "UCSC Genes" track, a lot of transcripts are not linked to any
     gene.

   - Then, allowing subsetting a TxDb by a character vector means
     a TxDb has names. At least conceptually. So it's tempting to
     also support 'names(txdb)' (would return all the gene ids).

   - Finally, the names being unique, it seems natural to expect that
     'txdb[names(txdb)]' is a no-op. But it won't because
     'txdb[names(txdb)]' will drop all the transcripts that are not
     linked to a gene.

But before any TxDb subsetting can happen (via [ or subsetByXXX), we
need to bring back the classic (and healthier) pass-by-value semantic
on these objects. (Right now TxDb is a reference class and thus TxDb
objects have a pass-by-reference semantic.)

H.

>
>
>>
>>
>>
>> On Sat, Oct 11, 2014 at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org>
>> wrote:
>>
>>> On 10/11/2014 08:41 AM, Vincent Carey wrote:
>>>
>>>> Is there anything on the order of as([GeneSet], "GRanges") around?
>>>>
>>>
>>> no, I don't think so; obviously of use and following a common theme.
>>> Martin
>>>
>>>
>>>
>>>> On Sat, Sep 20, 2014 at 11:34 PM, Gabe Becker <becker.gabe at gene.com>
>>>> wrote:
>>>>
>>>>   Sean and Vincent,
>>>>>
>>>>> The goal of what we are doing builds off of what Martin has in GSEABase.
>>>>> We were looking to see how much benefit we can get with something
>>>>> lighter-weight that lies between indistinguishable character vectors and
>>>>> the full machinery of GeneSets.
>>>>>
>>>>> Either way, it seems like formalizing the semantic information is a way
>>>>> to
>>>>> do what you want. Furthermore, these classed id objects can be created
>>>>> automatically when there is contextual information e.g. during queries
>>>>> to
>>>>> databases (or db-like objects), and then simply added to metadata
>>>>> DataFrames and re-used.
>>>>>
>>>>> ~G
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Sep 20, 2014 at 12:19 PM, Sean Davis <sdavis2 at mail.nih.gov>
>>>>> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> On Sat, Sep 20, 2014 at 3:11 PM, Gabe Becker <becker.gabe at gene.com>
>>>>>> wrote:
>>>>>>
>>>>>>   Hey all,
>>>>>>>
>>>>>>> We are in the (very) early stages of experimenting with something that
>>>>>>> seems relevant here: classed identifiers. We are using them for
>>>>>>> database/mart queries, but the same concept could be useful for the
>>>>>>> cases
>>>>>>> you're describing I think.
>>>>>>>
>>>>>>> E.g.
>>>>>>>
>>>>>>>   mysyms = GeneSymbol(c("BRAF", "BRCA1"))
>>>>>>>> mysyms
>>>>>>>>
>>>>>>> An object of class "GeneSymbol"
>>>>>>> [1] "BRAF"  "BRCA1"
>>>>>>>
>>>>>>>> yourSE[mysyms, ]
>>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>>   This approach has the flavor of some of the functionality that
>>>>>> Martin put
>>>>>> together for the GSEABase package (EntrezIdentifier, etc.).
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> This approach has the benefit of being declarative instead of
>>>>>>> heuristic
>>>>>>> (people won't be able to accidentally invoke it), while still giving
>>>>>>> most
>>>>>>> of the convenience I believe you are looking for.
>>>>>>>
>>>>>>> The object classes inherit directly from character, so should "just
>>>>>>> work"
>>>>>>> most of the time, but as I said it's early days; lots more testing for
>>>>>>> functionality and usefulness is needed.
>>>>>>>
>>>>>>> ~G
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Sep 20, 2014 at 11:38 AM, Vincent Carey <
>>>>>>> stvjc at channing.harvard.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   OK by me to leave [ alone.  We could start with subsetByEntrez,
>>>>>>>> subsetByKEGG, subsetBySymbol, subsetByGOTERM, subsetByGOID.
>>>>>>>>
>>>>>>>> Utilities to generate GRanges for queries in each of these
>>>>>>>> vocabularies
>>>>>>>> should, perhaps, be in the OrganismDb space?  Once those are in place
>>>>>>>> no additional infrastructure is necessary?
>>>>>>>>
>>>>>>>> On Sat, Sep 20, 2014 at 12:49 PM, Tim Triche, Jr. <
>>>>>>>>
>>>>>>> tim.triche at gmail.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   Agreed with Sean, having tried implementing to "magical" alternative
>>>>>>>>>
>>>>>>>>> --t
>>>>>>>>>
>>>>>>>>>   On Sep 20, 2014, at 9:31 AM, Sean Davis <sdavis2 at mail.nih.gov>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>>> Hi, Vince.
>>>>>>>>>>
>>>>>>>>>> I'm coming a little late to the party, but I agree with Kasper's
>>>>>>>>>>
>>>>>>>>> sentiment
>>>>>>>>>
>>>>>>>>>> that the less "magical" approach of using subsetByXXX might be the
>>>>>>>>>>
>>>>>>>>> cleaner
>>>>>>>>>
>>>>>>>>>> way to go for the time being.
>>>>>>>>>>
>>>>>>>>>> Sean
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Sep 20, 2014 at 10:42 AM, Vincent Carey <
>>>>>>>>>>
>>>>>>>>> stvjc at channing.harvard.edu>
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>   https://github.com/vjcitn/biocMultiAssay/blob/master/
>>>>>>> vignettes/SEresolver.Rnw
>>>>>>>
>>>>>>>>
>>>>>>>>>>> shows some modifications to [ that allow subsetting of SE by
>>>>>>>>>>> gene or pathway name
>>>>>>>>>>>
>>>>>>>>>>> it may be premature to work at the [ level.  Kasper suggested
>>>>>>>>>>>
>>>>>>>>>> defining
>>>>>>>
>>>>>>>> a suite of subsetBy operations that would accomplish this
>>>>>>>>>>>
>>>>>>>>>>> i think we could get something along these lines into the release
>>>>>>>>>>>
>>>>>>>>>> without
>>>>>>>>>
>>>>>>>>>> too much more work.  votes?
>>>>>>>>>>>
>>>>>>>>>>>          [[alternative HTML version deleted]]
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Computational Biologist
>>>>>>> Genentech Research
>>>>>>>
>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Computational Biologist
>>>>> Genentech Research
>>>>>
>>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>>
>>> --
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list