[BioC] How map probeset_id to gene_symbols or other annotation information?

Mon Aug 10 18:52:45 CEST 2009

Hi Peng,

There is in fact a lot of documentation inside of each package if you
know how to look for it.  One form is in the form of manual pages which
can be listed like this example:

ls("package:mogene10stprobeset.db")

And then you can read the manual pages by typing ? followed by the name
of the object you want to know about like this example:

?mogene10stprobesetENTREZID

Finally, almost every bioconductor package has some sort vignette that
is associated with it.  In the case of the annotation packages, there
are three vignettes loaded with AnnotationDbi (which will always be
loaded before any annotation package, so they will always be there if
you look).  You can load a vignette by using the openVignette() command
like this:

openVignette()

And then just pick the number for the vignette that you would like to
read.  Reading the vignette will give a much more comprehensive overview
of the purpose of the package with even more examples than the manual
pages.  Both of these resources are critical if you want to be able to
use R.  I would recommend that you look at these in addition to reading
that R user manual that was mentioned before.

With respect to the annotation packages, they are not simply a repeat of
what is in the csv files from Affymetrix.  In fact, we don't actually
even know where Affymetrix gets the data in those files from, nor do we
use most of that data in those files in building the annotation
packages.  Instead we go direct to the source whenever possible and get
most of our information from places like NCBI, the EBI etc.  The only
information that we get from Affymetrix is the basic probe to gene
mapping data (in the form of probe to entrez gene, genbank accession
etc.) which we then map onto the information from primary sources such
as NCBI etc. in order to tie the other data to the probes.  You are free
of course to use whichever information source you prefer, but please be
advised that they are probably not equivalent.

  Marc

Peng Yu wrote:
> On Sun, Aug 9, 2009 at 4:46 PM, Kasper Daniel
> Hansen<khansen at stat.berkeley.edu> wrote:
>   
>> On Aug 9, 2009, at 13:06 , Peng Yu wrote:
>>
>>     
>>> On Sun, Aug 9, 2009 at 12:03 PM, Sean Davis<seandavi at gmail.com> wrote:
>>>       
>>>> Hi, Peng.
>>>>
>>>> I don't mean to sound rude, but everyone on this list is quite busy.  You
>>>> will need to make time to do some of your own research, unfortunately.
>>>>  As
>>>> an exercise and an answer to your question, check out the Table of
>>>> Contents
>>>> of the R Data Import/Export.  If there is still a question about what
>>>> section is most appropriate, feel free to post back to the list the code
>>>> you
>>>> have tried, any error messages, and the output of sessionInfo().  And,
>>>> yes,
>>>> you will benefit from at least skimming the entire manual--you will learn
>>>> quite a bit.
>>>>         
>>> Hi Sean,
>>>
>>> I have been skimming the manual. One thing I am not sure is that
>>> whether I should spend a few days on learning all the materials you
>>> mentioned, while I could use some other language that I am more
>>> familiar with and solve the problem quickly. I would like to solve my
>>> question today if possible. However, I completely understand that I
>>> should read all the manuals that you mentioned in the long run.
>>>
>>> I have thought of using perl to solve my problem. But I think that it
>>> is still better to figure out a way to do so in R as well. The code in
>>> perl would not be long, so I think the code in R would not be long,
>>> either. It doesn't seem that it would take an experienced R user a
>>> long time to figure out the R commands to map all the probeset_id to
>>> gene names or ensembl ids, does it?
>>>
>>> I know that I could use
>>> read.csv("MoGene-1_0-st-v1.na29.mm9.probeset.csv") to read the file,
>>> which gives a data frame. But how to extract the useful columns from
>>> the data frame? How to construct a mapping between the entry in one
>>> column to the entry in another column? I should use
>>> read.table("genes.txt") to read "genes.txt", right? How to replace its
>>> first column with the appropriate gene names or emsembl id using the
>>> mapping?
>>>
>>> It seems that MoGene-1_0-st-v1.na29.mm9.probeset.csv should have
>>> enough annotation information for my problem. Why do I need
>>> "mogene10stprobeset.db"?
>>>       
>> Peng,
>>
>> Let me quote Wolfgang Huber: "the purpose of this mailing list is not for
>> other people to do your homework for you".  I don't think anyone are very
>> inclined to help you, if you don't spend some time yourself reading about
>> the language.  Some of the questions you ask above are stuff you ought to
>> know after spending 10 minutes with "An introduction to R".
>>
>> I believe in using the right tools for the job, and if you think you can do
>> your stuff in a few hours using Perl, I think you should use Perl.  If you
>> want access to some of the powers and time saving features of R, you need to
>> devote some time to learning it.  But you cannot expect to do even simple
>> stuff in a new language without spending some initial time on it.
>>     
>
> Hi Kasper
>
> I don't think that I want somebody to do the homework for me. One
> thing that I feel frustrated about reading R documentation is that the
> useful information is often scattered in different places, which is
> not easy for a new user to piece them together. One example is
> mogene10stprobeset.db, whose document doesn't mention AnnotationDbi. I
> feel that learning from example complementing with reading R
> documentation is a more efficient way.
>
> BTW, Do you know why "mogene10stprobeset.db" is needed if I have
> MoGene-1_0-st-v1.na29.mm9.probeset.csv already?
>
> Regards,
> Peng
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>