[BioC] R: how to find the VALIDATED pair (miRNA, gene-3'UTR-sequence)
Steve Lianoglou
mailinglist.honeypot at gmail.com
Thu Jun 25 15:56:27 CEST 2009
Hi Maura,
On Jun 25, 2009, at 7:01 AM, <mauede at alice.it> <mauede at alice.it> wrote:
> Thank you very much.
> Now I have to push my inquiry a little bit further ... sorry for
> being pedantic.
> Do you know, or can you help me finf out, the correspondence naming
> convention between the BioMart databases and miRecords and TarBase ?
> Thanks to your help I learnt how to find the association between
> miRNA and gene-3UTR region. For instance:
>
> Similarity hsa-miR-130a miRanda miRNA_target 2 120825363 120825385
> + . 16.5359 1.687830e-02 ENST00000295228 INHBB
>
<snip>
>
> 14697198 Homo sapiens human MCSF NM_000757.3 2 Homo sapiens hsa-
> miR-130a
> 14697198 Homo sapiens human MCSF NM_000757.3 2 Homo sapiens hsa-
> miR-130a
> 16549775 Homo sapiens human MAFB NM_005461.3 Homo sapiens hsa-
> miR-130a
> 17957028 Homo sapiens human GAX NM_005924.4 Homo sapiens
> hsa-miR-130a
> 17957028 Homo sapiens human GAX NM_005924.4 Homo sapiens
> hsa-miR-130a
> 17957028 Homo sapiens human GAX NM_005924.4 Homo sapiens
> hsa-miR-130a
> 17957028 Homo sapiens human HOXA5 NM_019102.2 Homo sapiens hsa-
> miR-130a
>
> It looks like miRNAs naming convebtion is the same for BioMart and
> miRecords databases
> My problem is the apparently different genes naming convention.
> How can I map the gene identifier used in BioMart databases to the
> gene identifiers used in miRecords ?
> Without such *hopefully* 1-1 mapping function I cannot use the
> information across databases.
The gene IDs from your first result (eg: ENSTXXXX) are Ensembl
transript IDs. The IDs used in your second example, (eg: NM_00757.3,
etc) are Refseq IDs. It seems that the .X in NM_*.3, NM_*.4, etc are
for versioning purposes, so the actual refseq accession number for
NM_000757.3 is NM_000757, make sense?
OK, know that we know that, you can use biomaRt (once again!) to
create yourself a map of refseq <--> transcript IDs. I don't think
you'll get an exact 1-1 mapping as you'd like (usually ID mapping is
never so easy, but you might get lucky), so you'll probably need some
further processing, but look here:
R> library(biomaRt)
R> hmart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
R> refseqs <-
c
("NM_000757
","NM_000757
","NM_005461","NM_005924","NM_005924","NM_005924","NM_019102")
R> gene.map <- getBM(attributes=c('hgnc_symbol', 'ensembl_gene_id',
'ensembl_transcript_id','refseq_dna'), filters='refseq_dna',
value=refseqs, mart=hmart)
R> gene.map
hgnc_symbol ensembl_gene_id ensembl_transcript_id refseq_dna
1 CSF1 ENSG00000184371 ENST00000369802 NM_000757
2 MAFB ENSG00000204103 ENST00000396967 NM_005461
3 MEOX2 ENSG00000106511 ENST00000262041 NM_005924
4 HOXA5 ENSG00000106004 ENST00000222726 NM_019102
That should get you pretty close to where you want to be.
Hope that helps,
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list