[BioC] R: how to find the VALIDATED pair (miRNA, gene-3'UTR-sequence)

Thu Jun 25 15:56:27 CEST 2009

Hi Maura,

On Jun 25, 2009, at 7:01 AM, <mauede at alice.it> <mauede at alice.it> wrote:

> Thank you very much.
> Now I have to push my inquiry a little bit further ... sorry for  
> being pedantic.
> Do you know, or can you help me finf out, the correspondence naming  
> convention between the BioMart databases and miRecords and TarBase ?
> Thanks to your help I learnt how to find the association between  
> miRNA and gene-3UTR region. For instance:
>
> Similarity	hsa-miR-130a	miRanda	miRNA_target	2	120825363	120825385	 
> +	.	16.5359	1.687830e-02	ENST00000295228	INHBB
>
<snip>
>
> 14697198	Homo sapiens	human	MCSF	NM_000757.3	2	Homo sapiens	hsa- 
> miR-130a
> 14697198	Homo sapiens	human	MCSF	NM_000757.3	2	Homo sapiens	hsa- 
> miR-130a
> 16549775	Homo sapiens	human	MAFB	NM_005461.3		Homo sapiens	hsa- 
> miR-130a
> 17957028	Homo sapiens	human	GAX	        NM_005924.4		Homo sapiens	 
> hsa-miR-130a
> 17957028	Homo sapiens	human	GAX	        NM_005924.4		Homo sapiens	 
> hsa-miR-130a
> 17957028	Homo sapiens	human	GAX	        NM_005924.4		Homo sapiens	 
> hsa-miR-130a
> 17957028	Homo sapiens	human	HOXA5	NM_019102.2		Homo sapiens	hsa- 
> miR-130a
>
> It looks like miRNAs naming convebtion is the same for BioMart and  
> miRecords databases
> My problem is the apparently different genes naming convention.
> How can I map  the gene identifier used in BioMart databases  to the  
> gene identifiers used in miRecords ?
> Without such *hopefully* 1-1 mapping function I cannot use the  
> information across databases.

The gene IDs from your first result (eg: ENSTXXXX) are Ensembl  
transript IDs. The IDs used in your second example, (eg: NM_00757.3,  
etc) are Refseq IDs.  It seems that the .X in NM_*.3, NM_*.4, etc are  
for versioning purposes, so the actual refseq accession number for  
NM_000757.3 is NM_000757, make sense?

OK, know that we know that, you can use biomaRt (once again!) to  
create yourself a map of refseq <--> transcript IDs. I don't think  
you'll get an exact 1-1 mapping as you'd like (usually ID mapping is  
never so easy, but you might get lucky), so you'll probably need some  
further processing, but look here:

R> library(biomaRt)
R> hmart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
R> refseqs <-  
c 
("NM_000757 
","NM_000757 
","NM_005461","NM_005924","NM_005924","NM_005924","NM_019102")
R> gene.map <- getBM(attributes=c('hgnc_symbol', 'ensembl_gene_id',  
'ensembl_transcript_id','refseq_dna'), filters='refseq_dna',  
value=refseqs, mart=hmart)

R> gene.map
   hgnc_symbol ensembl_gene_id ensembl_transcript_id refseq_dna
1        CSF1 ENSG00000184371       ENST00000369802  NM_000757
2        MAFB ENSG00000204103       ENST00000396967  NM_005461
3       MEOX2 ENSG00000106511       ENST00000262041  NM_005924
4       HOXA5 ENSG00000106004       ENST00000222726  NM_019102

That should get you pretty close to where you want to be.

Hope that helps,
-steve

--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University

Contact Info: http://cbio.mskcc.org/~lianos/contact