[Bioc-devel] A geneSet data class for facilitating GSEA

Sean Davis sdavis2 at mail.nih.gov
Wed Mar 14 11:50:39 CET 2007


GSEA, both the specific method and the general concept, is becoming more 
prevalent and important in data analysis.  There have been several mentions 
of including various "gene lists" for use with Category or other methods.  Is 
there interest in making a generic geneSet class for storing such 
information?  (Or does it already exist and I just haven't seen it?)  I bring 
this up because I think it could be quite useful to have a general solution 
for the community (like the eSet class has become).  A class could be as 
simple as a vector of Entrez Gene IDs to something more complicated (but 
perhaps a bit more useful for general consumption) like:

identifier: an identifier for the set (perhaps from a public database like 
MSigDB)
title:  One line title
description: free text description
species: The species to which the dataset applies
URL: from where the data were derived
MIAME: class "MIAME" object
protocol: (could be in MIAME, also) description of methods to produce genelist 
from raw data source
idType:  What type of ID is stored (Entrez, Refseq, Ensembl, etc)?
geneList: vector of IDs

A simple wrapper data structure (even just a list) could then be used to 
distribute the geneSets.  Some methods could then be defined for converting 
to an incidence matrix for use by Category, etc.  But I think the most 
important points from above are 1) maintaining some metadata about the 
genelists and 2) standardization to reduce duplicated work.  Individual 
groups would then instantiate the geneSets using whatever means they see fit 
(parsing MSigDB, IPI files, etc.).

Any thoughts?

Sean



More information about the Bioc-devel mailing list