[BioC] GSEA using Broad genesets
Martin Morgan
mtmorgan at fhcrc.org
Wed Feb 10 21:05:34 CET 2010
On 02/10/2010 10:47 AM, zrl wrote:
> Thank you Martin, these are what I want. I like the second method to
> create incidence matrix.
> My last question is in GSEABase when we do this:
>
> "gsc <- GeneSetCollection(bcrneg_filt1, setType=KEGGCollection())"
>
> how does GSEABase collapse the affy probes to gene symbols?
> (max,mean,median or not at all)
Remember that the gene set is a collection of symbols; expression
doesn't have anything to do with its construction. GeneSetCollection()
uses featureNames(bcrneg_filt1), and then the map between affy probe ids
and KEGG pathways provided by the relevant Bioconductor annotation
package, e.g., hgu95av2.db, hgu95av2PATH. The issue that comes up is
when a probeset id maps to several pathways
> featureNames(sample.ExpressionSet[201,])
[1] "31440_at"
> hgu95av2PATH[[featureNames(sample.ExpressionSet)[201]]]
[1] "04310" "04520" "04916" "05200" "05210" "05213" "05215" "05216" "05217"
[10] "05221" "05412"
and then the probeset id 1200_at is assigned to the 11 sets representing
these different KEGG pathways.
> GeneSetCollection(sample.ExpressionSet[201,], setType=KEGGCollection())
GeneSetCollection
names: 04310, 04520, ..., 05412 (11 total)
unique identifiers: 31440_at (1 total)
types in collection:
geneIdType: AnnotationIdentifier (1 total)
collectionType: KEGGCollection (1 total)
Martin
>
>
> So, if we use download database such as ****.symbols.gmt,
> how should we collapse the probes to symbols?
>
> Sorry to bother you so much. Thank you very much.
>
> Qiudao
>
>
>
>
>
>
> On Wed, Feb 10, 2010 at 9:47 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 02/07/2010 03:25 PM, zrl wrote:
>>> Hi Martin,
>>>
>>> Thank you for answering my question. Sorry I didn't make my question clearly.
>>> In the case of "gsc <- GeneSetCollection(bcrneg_filt1,
>>> setType=KEGGCollection())" and "Am<-incidence(gsc)", we use KEGG as
>>> reference to create gene sets of bcrneg_filt1, then create a
>>> incidence.
>>>
>>> My question is what if I use a download geneset database such as
>>> "c3.all.v2.5.symbols.gmt" as reference to create gene set of
>>> ExpressionSet bcrneg_filt1, then create a incidence matrix. Do I have
>>> to manually do this? (I mean, identifying the genes in eset,then
>>> correlates them in c3.all.v2.5.symbols.gmt to create gene sets) or is
>>> there a direct command doing this?
>>
>> Hi --
>>
>>> c3gsc = getGmt("~/tmp/c3.all.v2.5.symbols.gmt",
>> + geneIdType=SymbolIdentifier())
>>
>> It's possible to ask for the intersection of a gene set collection with
>> specific gene dientifiers, so
>>
>>> c3gsc & c("DLC1", "FLJ39378")
>>
>> so for an Affy array like bcrneg_filt1 a command like
>>
>> library(Biobase)
>> data(sample.ExpressionSet)
>> eset = sample.ExpressionSet[250:300,]
>> symbolIds = getSYMBOL(featureNames(eset), annotation(eset))
>>
>> gets the gene symbols, and
>>
>> c3gsc1 = c3gsc & symbolIds
>>
>> does the subset. But it might be just as easy to
>>
>> m = incidence(c3gsc)
>> m1 = m[,colnames(m) %in% symbolIds]
>> m1 = m1[rowSums(m) != 0, ]
>>
>> (the & operator alters the names of the gene sets, and keeps empty sets,
>> so further processing would probably be needed).
>>
>> Hope that helps.
>>
>> Martin
>>
>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Feb 7, 2010 at 9:11 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>>> On 02/06/2010 04:05 PM, zrl wrote:
>>>>> Dear list,
>>>>>
>>>>> I have a question regarding using broad gene sets for GSEA anlaysis.
>>>>>
>>>>> As we know, we have "gsc <- GeneSetCollection(bcrneg_filt1,
>>>>> setType=KEGGCollection())" and "Am<-incidence(gsc)" to generate
>>>>> incidence matrix for further anlaysis.
>>>>>
>>>>> I have learned to get the geneset file from Broad such as: "c3gsc2 <-
>>>>> getGmt("/path/to/c3.all.v2.5.symbols.gmt",
>>>>> collectionType=BroadCollection(category="c3"),
>>>>> geneIdType=SymbolIdentifier())"
>>>>>
>>>>> My question is how to use c3gsc2 and bcneg_filt1 to create a new
>>>>> incidence matrix ? Do I have to manually do this? or there is a
>>>>> command which can do this?
>>>>
>>>> Hi Quidao
>>>>
>>>> bcneg_filt1 is a subset of an ExpressionSet, and is just another source
>>>> for creating a gene set collection. Here you're using
>>>> c3.all.v2.5.symbols.gmt as a source for your gene set collection. The
>>>> incidence matrix is
>>>>
>>>>> m <- incidence(c3gsc2)
>>>>> class(m)
>>>> [1] "matrix"
>>>>> dim(m)
>>>> [1] 837 15718
>>>>> m[1:5, 1:5]
>>>> DLC1 FLJ39378 PTGS1 RORC VPRBP
>>>> RGAGGAARY_V$PU1_Q6 1 1 1 1 1
>>>> KRCTCNNNNMANAGC_UNKNOWN 0 0 0 0 0
>>>> AAAYWAACM_V$HFH4_01 0 0 0 0 0
>>>> YYCATTCAWW_UNKNOWN 0 0 0 0 0
>>>> CYTAGCAAY_UNKNOWN 0 0 0 0 0
>>>>
>>>> with rows as set names and columns as symbols.
>>>>
>>>> Martin
>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Qiudao
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>> --
>>>> Martin Morgan
>>>> Computational Biology / Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N.
>>>> PO Box 19024 Seattle, WA 98109
>>>>
>>>> Location: Arnold Building M1 B861
>>>> Phone: (206) 667-2793
>>>>
>>
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list