[BioC] from using biomaRt and r10kcod

Tue May 15 07:14:08 CEST 2007

Hi Weiwei and James,

(sorry Weiwei, as I sent this email the first time only to you when  
my intention was to send it to the list too).

On May 15, 2007, at 5:29 AM, Weiwei Shi wrote:
> Hi, there:
>
> I happened to re-address this question of codelink probe id to human
> entrezgene id. I describe my question using an example:
>
> by using r10kcod package, you can find probe "GE16490" mapped to
> "502674", which I assume it is rat entrezgene id. However, when I use
> biomaRt to convert all rat entrezgene id in this array to human ones,
> I found the following maps involving 502674:
>
>          id MappedID rat.count human.count
> 4167 296197    11034         1           2
> 7021 502674    11034         1           2
>

I'm not too familiar with the biomaRt package but I guess that this  
result what is telling you is that you have two rat entrez id's  
296197 and 502674 (each appearing only once), which map to one human  
entrez id 11034 (appearing twice, one time for each rat id).

> so, basically, 296197, 502674 and 11034 are all associated with
> protein "destrin". To be accurate, 296197 is a rat protein which is
> similar to destrin.
>
> However, as shown in
> http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene
> , the other two (11034 and *502674*) are human ids (if I am wrong
> here, please correct me).
>

Well, for me searching 502674 using Entrez Gene comes up a link to  
the Destrin rat gene:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? 
db=gene&cmd=search&term=502674

clicking on this entry I can see the information about the Dstn  
(destrin) gene. In the bottom of the page there are mappings to  
different sequences (Related sequences). One is CB785830.1 and the  
other CF111187.1 The later one is the one used in r10kcod to map from  
Codelink probe to Genbank,

GE16490 -> CF111187.1

and then, this is used to map to Entrez Gene, if and understand a  
little how AnnBuilder works (that may not be the case). Of course, I  
use also the  mappings provided from the manufacturer from probe ids  
to Entrez Gene and Unigene but for this particular probe, there is no  
such mapping in the current mappings provided (last updated March 31,  
2006 so they are pretty old).

In fact, in those files, there is also the information about  
homologues in the other two organisms (from human, mouse and rat) and  
in the human probes that map to Entrez Gene 11034 I can find that  
they map to rat Entrez Gene 502674, in agreement with the biomaRt  
results.

> so my questions are:
>
> 1. whether 502674 is a rat entrezgene id or human one?
>

I would definitely say that it is a rat id.

> 2. r10kcod is wrong or ncbi is wrong or my understanding is wrong (i
> assume the last one :)
>

neither are wrong from my point of view, but let first see if we are  
seeing the same thing when we look for 502674 in Entrez Gene.

> 3. i found many many-2-many maps in this process of rat to human
> entrezgene ids. Like the following:
>
>> t0[t0[,1]== 396527,]
>>
>          id MappedID rat.count human.count
> 6608 396527    54576         9           4
> 6609 396527    54575         9           4
> 6610 396527    54600         9           4
> 6611 396527    54577         9           4
> 6612 396527    54578         9           4
> 6613 396527    54579         9           4
> 6614 396527    54657         9           4
> 6615 396527    54659         9           4
> 6616 396527    54658         9           4
>
>> t0[t0[,2]== 54576,]
>>
>          id MappedID rat.count human.count
> 2494 113992    54576         9           4
> 6608 396527    54576         9           4
> 6617 396551    54576         9           4
> 6626 396552    54576         9           4
>
>> t0[t0[,2]== 54577,]
>>
>          id MappedID rat.count human.count
> 2497 113992    54577         9           4
> 6611 396527    54577         9           4
> 6620 396551    54577         9           4
> 6629 396552    54577         9           4
>
> so, basically all the ids are related to different polypeptides
> associated with UDP glucuronosyltransferase 1 family. Are there some
> other situations causing this many2many mappings?
>
>

As for this, James has already answered (thanks for that). The probes  
are 30 base pair long, so it is not strange, but on the contrary,  
very common to find those probes mapping to multiple genes that can  
have related or unrelated functions. Is less common in the Codelink  
arrays to have multiple probes mapping to the same gene, but again,  
you can have multiple probes mapping to different Genbank ids that  
correspond to the same Entrez Gene identifier. The fact that you can  
have different paralogues and orthologues sequences and even  
sometimes unrelated sequences sharing the same piece of 30 base pair  
oligonucleotides makes this a very complex problem with no easy  
solution.

Regards,

Diego.

-----------------------------------------------
  Diego Diez, PhD.

  Bioknowledge systems, Kanehisa lab.
  Bioinformatics center,
  Institute for Chemical Research,
  Kyoto University.
  Gokasho, Uji, Kyoto 611-0011 JAPAN.

  e-mail:  diez at kuicr.kyoto-u.ac.jp
  url:     http://web.kuicr.kyoto-u.ac.jp/~diez
  tlf:     +81-774-38-3296
  fax:     +81-774-38-3269
-----------------------------------------------

> Sorry for the long questions,
>
> Regards,
>
> -- 
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/ 
> gmane.science.biology.informatics.conductor
>