[BioC] GSEA using Broad genesets

zrl zrl1974 at gmail.com
Wed Feb 10 23:21:18 CET 2010


Thank you Martin.
If we caculate the statistic in each gene set,is it possible that
several probes mapped to the same gene. How will GSEABase deal with
the calculation of statistic of a gene set with multiple probes mapped
to the same gene? (or maybe this quesiton should be directed to using
"category" package, since I always use its "gseattperm").

Thank you again for your detailed explaination and patience.

Qiudao


On Wed, Feb 10, 2010 at 3:05 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 02/10/2010 10:47 AM, zrl wrote:
>> Thank you Martin, these are what I want. I like the second method to
>> create incidence matrix.
>> My last question is in GSEABase when we do this:
>>
>> "gsc <- GeneSetCollection(bcrneg_filt1, setType=KEGGCollection())"
>>
>> how does GSEABase collapse the affy probes to gene symbols?
>> (max,mean,median or not at all)
>
> Remember that the gene set is a collection of symbols; expression
> doesn't have anything to do with its construction. GeneSetCollection()
> uses featureNames(bcrneg_filt1), and then the map between affy probe ids
> and KEGG pathways provided by the relevant Bioconductor annotation
> package, e.g., hgu95av2.db, hgu95av2PATH. The issue that comes up is
> when a probeset id maps to several pathways
>
>
>> featureNames(sample.ExpressionSet[201,])
> [1] "31440_at"
>> hgu95av2PATH[[featureNames(sample.ExpressionSet)[201]]]
>  [1] "04310" "04520" "04916" "05200" "05210" "05213" "05215" "05216" "05217"
> [10] "05221" "05412"
>
> and then the probeset id 1200_at is assigned to the 11 sets representing
> these different KEGG pathways.
>
>> GeneSetCollection(sample.ExpressionSet[201,], setType=KEGGCollection())
> GeneSetCollection
>  names: 04310, 04520, ..., 05412 (11 total)
>  unique identifiers: 31440_at (1 total)
>  types in collection:
>    geneIdType: AnnotationIdentifier (1 total)
>    collectionType: KEGGCollection (1 total)
>
> Martin
>
>>
>>
>> So, if we use download database such as ****.symbols.gmt,
>> how should we collapse the probes to symbols?
>>
>> Sorry to bother you so much. Thank you very much.
>>
>> Qiudao
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 10, 2010 at 9:47 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>> On 02/07/2010 03:25 PM, zrl wrote:
>>>> Hi Martin,
>>>>
>>>> Thank you for answering my question. Sorry I didn't make my question clearly.
>>>> In the case of "gsc <- GeneSetCollection(bcrneg_filt1,
>>>> setType=KEGGCollection())" and "Am<-incidence(gsc)", we use KEGG as
>>>> reference to create gene sets of bcrneg_filt1, then create a
>>>> incidence.
>>>>
>>>> My question is what if I use a download geneset database such as
>>>> "c3.all.v2.5.symbols.gmt" as reference to create gene set of
>>>> ExpressionSet bcrneg_filt1, then create a incidence matrix. Do I have
>>>> to manually do this? (I mean, identifying the genes in eset,then
>>>> correlates them in c3.all.v2.5.symbols.gmt to create gene sets) or is
>>>> there a direct command doing this?
>>>
>>> Hi --
>>>
>>>> c3gsc = getGmt("~/tmp/c3.all.v2.5.symbols.gmt",
>>> +                 geneIdType=SymbolIdentifier())
>>>
>>> It's possible to ask for the intersection of a gene set collection with
>>> specific gene dientifiers, so
>>>
>>>> c3gsc & c("DLC1", "FLJ39378")
>>>
>>> so for an Affy array like bcrneg_filt1 a command like
>>>
>>>  library(Biobase)
>>>  data(sample.ExpressionSet)
>>>  eset = sample.ExpressionSet[250:300,]
>>>  symbolIds = getSYMBOL(featureNames(eset), annotation(eset))
>>>
>>> gets the gene symbols, and
>>>
>>>  c3gsc1 = c3gsc & symbolIds
>>>
>>> does the subset. But it might be just as easy to
>>>
>>>  m = incidence(c3gsc)
>>>  m1 = m[,colnames(m) %in% symbolIds]
>>>  m1 = m1[rowSums(m) != 0, ]
>>>
>>> (the & operator alters the names of the gene sets, and keeps empty sets,
>>> so further processing would probably be needed).
>>>
>>> Hope that helps.
>>>
>>> Martin
>>>
>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Feb 7, 2010 at 9:11 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>>> On 02/06/2010 04:05 PM, zrl wrote:
>>>>>> Dear list,
>>>>>>
>>>>>> I have a question regarding using broad gene sets for GSEA anlaysis.
>>>>>>
>>>>>> As we know, we have "gsc <- GeneSetCollection(bcrneg_filt1,
>>>>>> setType=KEGGCollection())" and "Am<-incidence(gsc)" to generate
>>>>>> incidence matrix for further anlaysis.
>>>>>>
>>>>>> I have learned to get the geneset file from Broad such as: "c3gsc2 <-
>>>>>> getGmt("/path/to/c3.all.v2.5.symbols.gmt",
>>>>>> collectionType=BroadCollection(category="c3"),
>>>>>> geneIdType=SymbolIdentifier())"
>>>>>>
>>>>>> My question is how to use c3gsc2 and bcneg_filt1 to create a new
>>>>>> incidence matrix ? Do I have to manually do this? or there is a
>>>>>> command which can do this?
>>>>>
>>>>> Hi Quidao
>>>>>
>>>>> bcneg_filt1 is a subset of an ExpressionSet, and is just another source
>>>>> for creating a gene set collection. Here you're using
>>>>> c3.all.v2.5.symbols.gmt as a source for your gene set collection. The
>>>>> incidence matrix is
>>>>>
>>>>>> m <- incidence(c3gsc2)
>>>>>> class(m)
>>>>> [1] "matrix"
>>>>>> dim(m)
>>>>> [1]   837 15718
>>>>>> m[1:5, 1:5]
>>>>>                        DLC1 FLJ39378 PTGS1 RORC VPRBP
>>>>> RGAGGAARY_V$PU1_Q6         1        1     1    1     1
>>>>> KRCTCNNNNMANAGC_UNKNOWN    0        0     0    0     0
>>>>> AAAYWAACM_V$HFH4_01        0        0     0    0     0
>>>>> YYCATTCAWW_UNKNOWN         0        0     0    0     0
>>>>> CYTAGCAAY_UNKNOWN          0        0     0    0     0
>>>>>
>>>>> with rows as set names and columns as symbols.
>>>>>
>>>>> Martin
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Qiudao
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>>> --
>>>>> Martin Morgan
>>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N.
>>>>> PO Box 19024 Seattle, WA 98109
>>>>>
>>>>> Location: Arnold Building M1 B861
>>>>> Phone: (206) 667-2793
>>>>>
>>>
>>>
>>> --
>>> Martin Morgan
>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N.
>>> PO Box 19024 Seattle, WA 98109
>>>
>>> Location: Arnold Building M1 B861
>>> Phone: (206) 667-2793
>>>
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>



More information about the Bioconductor mailing list