[BioC] biomaRt manual
Steffen Durinck
durincks at mail.nih.gov
Thu Mar 29 18:03:23 CEST 2007
Hi Weiwei,
There are duplicates because Ensembl maps everything to the transcript
level. If you would add the ensembl_transcript_id to your query you
would get a better understanding of this. For example:
getBM(attributes=c("affy_hg_u95a", "entrezgene","ensembl_transcript_id"), filters="affy_hg_u95a", values="32864_at", mart=human)
gives:
affy_hg_u95a entrezgene ensembl_transcript_id
1 32864_at 6736 ENST00000383070
2 32864_at 6736 ENST00000327563
same entrezgene id with different transcript identifiers. Note that if
you have questions on the content of the Ensembl database/webservice
you can also contact them directly at helpdesk at ensembl.org. The biomaRt
package only provides an interface between their webservices and R and
as such we have little control on the data their webservice returns.
Best,
Steffen
Weiwei Shi wrote:
> Here is another question:
>
>> length(unique(ids2))
>>
> [1] 12558
>
>> length(ids2)
>>
> [1] 12558
>
>> head(ids2)
>>
> [1] "31307_at" "31308_at" "31309_r_at" "31310_at" "31311_at"
> [6] "31312_at"
>
>> t1 <- getBM(attributes=c("affy_hg_u95a", "entrezgene"), filters="affy_hg_u95a", values=(ids2), mart=human)
>> dim(t1)
>>
> [1] 26360 2
>
>> t1[1:20,]
>>
> affy_hg_u95a entrezgene
> 1 32864_at 6736
> 2 32864_at 6736
> 3 41214_at 6192
> 4 41214_at 6192
> 5 31534_at 7544
> 6 31534_at 7544
> 7 36367_at 83259
> 8 36367_at 83259
> 9 36367_at 83259
> 10 36367_at 83259
> 11 1199_at NA
> 12 35929_s_at 64591
> 13 35929_s_at 64591
> 14 35929_s_at NA
>
> Please look at line 12-14.
> Why are there so many duplications? Why is there some inconsistency
> between line12-14?
>
> Thanks for the previous prompt replies from every "hardworking"
> people. I am now at China and it should be about 6am at US.
>
> Cheers,
>
> Weiwei
>
>
>
> On 3/29/07, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>> On Thursday 29 March 2007 07:28, James W. MacDonald wrote:
>>
>>> Hi Weiwei,
>>>
>>> Weiwei Shi wrote:
>>>
>>>> Sorry :) when I am composing the following email, I did not realize
>>>> there are a couple of replies now. I read the manual carefully but I
>>>> am still having some questions like this:
>>>>
>>>> For example,
>>>>
>>>>
>>>>> getBM(attributes=c("affy_hg_u95a", "entrezgene"), filters="affy_hg_u95a",
>>>>> values=head(ids2), mart=human)
>>>>>
>>>> affy_hg_u95a entrezgene
>>>> 1 31308_at NA
>>>> 2 31310_at 2741
>>>> 3 31312_at 9312
>>>>
>>>>
>>>>> head(ids2)
>>>>>
>>>> [1] "31307_at" "31308_at" "31309_r_at" "31310_at" "31311_at"
>>>> [6] "31312_at"
>>>>
>>>>
>>>>> getBM(attributes=c("affy_hg_u95a", "entrezgene"), filters="affy_hg_u95a",
>>>>> values="31307_at", mart=human)
>>>>>
>>>> NULL
>>>>
>>>> I am confused by "NULL" and "NA". I am wondering about the difference b/w
>>>> them.
>>>>
>>> Steffen Durinck will know better, but I believe NULL means that Ensembl
>>> doesn't think that probeset maps to anything (e.g., there is nothing
>>> available), and NA means that there is no Entrez Gene ID for that probeset.
>>>
>>> For instance, if you pull the Entrez Gene ID for 31307_at from the
>>> hgu95aENTREZID environment, it lists 9594, but if you search Entrez Gene
>>> for that ID it says it has been discontinued.
>>>
>>>
>>>> Another question is how to make >8000 queries faster though I read
>>>> some from previous posts.
>>>>
>> Make sure that you really need to make 8000 queries. It is much faster to
>> make one or a few large queries than to make many small ones.
>>
>> Sean
>>
>>
>
>
>
--
Steffen Durinck, Ph.D.
Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/
Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877
More information about the Bioconductor
mailing list