[BioC] RefSeq coordinates from biomaRt
    Dave Tang 
    davetingpongtang at gmail.com
       
    Mon Nov 25 09:47:23 CET 2013
    
    
  
Hello,
I've been using biomaRt to fetch genomic coordinates of RefSeqs (perhaps  
in an incorrect manner?). I found that the RefSeq coordinates don't match  
the coordinates provided at the UCSC Genome Browser (NM_033453 at  
chr20:3190006-3204516):
library("biomaRt")
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
getBM(attributes=c('refseq_mrna','chromosome_name','start_position','end_position','strand'),
filters = 'refseq_mrna', values = 'NM_033453', mart = ensembl)
     refseq_mrna chromosome_name start_position end_position strand
1   NM_033453              20        3189514      3204516      1
The coordinates seem to match this Ensembl transcript (ENST00000483354)  
instead:
getBM(attributes=c('ensembl_transcript_id','chromosome_name','start_position','end_position','strand'),
filters = 'ensembl_transcript_id', values = 'ENST00000483354', mart =
ensembl)
     ensembl_transcript_id chromosome_name start_position end_position  
strand
1       ENST00000483354              20        3189514      3204516      1
Here's another RefSeq model, NM_181493, which should be mapped to  
chr20:3190134-3204516:
getBM(attributes=c('refseq_mrna','chromosome_name','start_position','end_position','strand'),
filters = 'refseq_mrna', values = 'NM_181493', mart = ensembl)
     refseq_mrna chromosome_name start_position end_position strand
1   NM_181493              20        3189514      3204516      1
So it seems the RefSeq IDs are mapped to the longest Ensembl transcript  
model that covers the RefSeq model. I searched around the web and looked  
at different available marts but nothing obvious popped out. How should I  
go about obtaining RefSeq coordinates using biomaRt? Or biomaRt is Ensembl  
centric?
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Australia.1252
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] biomaRt_2.16.0
loaded via a namespace (and not attached):
[1] RCurl_1.95-4.1 tools_3.0.2    XML_3.98-1.1
Cheers,
-- 
Dave
    
    
More information about the Bioconductor
mailing list