[Bioc-sig-seq] unexpected genes names list using getBM{biomaRt}

James W. MacDonald jmacdon at med.umich.edu
Tue Dec 8 16:51:44 CET 2009


Hi Ramzi,

Ramzi TEMANNI wrote:
> Hi James,
> I found out that I'm using UCSC hg19 ref but Biomart uses GRCh37 ref and 
> that's why the output generates many genes.
> i'm trying to find out a way to convert coordinates but seems I have to 
> align again using the GRCh37 ref.

I don't think that is the problem, as UCSC hg19 is based on GRCh37. I 
think there might be a problem on the Biomart server side. As far as I 
can tell, the query string produced by biomaRt is correct, but the 
Biomart server returns more than I think it should.

In fact, if you do the query for each chromosome individually, you 
either get one or zero genes returned. It's only if you do the query for 
more than one chromosomal region at a time that you get these extra genes:

 > getBM("hgnc_symbol", c("chromosome_name","start","end"), 
list(t.cpd[3,1], t.cpd[3,2], t.cpd[3,2]+50), enseembl)
[1] hgnc_symbol
<0 rows> (or 0-length row.names)
 > getBM("hgnc_symbol", c("chromosome_name","start","end"), 
list(t.cpd[4,1], t.cpd[4,2], t.cpd[4,2]+50), enseembl)
   hgnc_symbol
1      MPPED2
 > getBM("hgnc_symbol", c("chromosome_name","start","end"), 
list(t.cpd[5,1], t.cpd[5,2], t.cpd[5,2]+50), enseembl)
   hgnc_symbol
1       REEP1
 > tmp <- getBM("hgnc_symbol", c("chromosome_name","start","end"), 
list(t.cpd[4:5,1], t.cpd[4:5,2], t.cpd[4:5,2]+50), enseembl)
 > dim(tmp)
[1] 946   1
 > head(tmp)
   hgnc_symbol
1       ZFP91
2        CNTF
3    C11orf76
4     RPLP0P2
5    C11orf64
6      OR4D8P
 > grep("REEP1", tmp[,1], value=T)
[1] "REEP1"

The XMLQuery that is sent to the Biomart server looks like this (if 
Rhoda Kinsella or Steffen Durinck are following this at all).

Browse[2]> xmlQuery
[1] "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query 
virtualSchemaName = 'default' uniqueRows = '1' count = '0' 
datasetConfigVersion = '0.6' requestid= \"biomaRt\"> <Dataset name = 
'hsapiens_gene_ensembl'><Attribute name = 'hgnc_symbol'/><Filter name = 
'chromosome_name' value = '11,2' /><Filter name = 'start' value = 
'30576841,86479831' /><Filter name = 'end' value = '30576891,86479881' 
/></Dataset></Query>"

Best,

Jim



> 
> Regards,
> Ramzi
> ----------------------------------------------------------------
> 
> 
> On Mon, Dec 7, 2009 at 4:32 PM, Ramzi TEMANNI <ramzi.temanni at gmail.com 
> <mailto:ramzi.temanni at gmail.com>> wrote:
> 
>     Hi James,
> 
>     Thanks for your reply,
>     As i have short reads of 50b, I've modified the code accordigly to
>     your suggestion:
>     gn.m1<-getBM(attributes= c("hgnc_symbol"),
>            filters=c("chromosome_name","
>     start",*"end"*),
>           
>     values=list(t.cpd[1:10,1],as.numeric(t.cpd[1:10,2]),*as.numeric(t.cpd[1:10,2])+50*),
>     mart=ensembl)
> 
>     But still having more genes than expected:
> 
>     hgnc_symbol
>     1      UBE2G1
>     2      DUSP5P
>     3   HIST3H2BA
>     4     ZNF847P
>     5     ACTBP11
>     6     ZC3H11B
>     ........
>     8488     PPP2R2C
>     8489        WFS1
>     8490    SNORD73A
>     8491     SNORA24
>     8492     SNORA26
> 
>     Did I miss something  ?
> 
>     Thanks for the remark regarding the position, you are right ! I did
>     not notice that, have to get back to the bowtie alignment file and
>     see what is the reason behind that. I'm using bowtie with the last
>     hg19 ref.
> 
>     Thanks again for your comment. 
> 
>     Best Regards,
>     Ramzi
> 
>     ----------------------------------------------------------------
> 
> 
> 
>     On Mon, Dec 7, 2009 at 4:20 PM, James W. MacDonald
>     <jmacdon at med.umich.edu <mailto:jmacdon at med.umich.edu>> wrote:
> 
>         Hi Ramzi,
> 
> 
>         Ramzi TEMANNI wrote:
> 
>             Hi,
>             I want to extract the gene names knowing the chromosome and
>             the position for
>             each genes:
> 
>                 t.cpd[1:10,1:2]
> 
>                  CHR.M1 POS.M1
>              [1,] "12"   "140059033"
>              [2,] "19"   "164634640"
>              [3,] "10"   "32347784"
>              [4,] "11"   "30576841"
>              [5,] "2"    "86479831"
>              [6,] "12"   "237019866"
>              [7,] "4"    "76487174"
>              [8,] "20"   "136121868"
>              [9,] "2"    "6255547"
>             [10,] "1"    "67658137"
> 
>             i use the following commands:
>             library(biomaRt)
>             mart = useMart("ensembl")
>             ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
>             gn.m1<-getBM(attributes= c("hgnc_symbol"),
>                   filters=c("chromosome_name","start"),
>                   values=list(t.cpd[1:10,1],t.cpd[1:10,2]), mart=ensembl)
> 
>             I'm expecting having a list of 10 genes names, but instead i
>             get 8652 genes:
>             hgnc_symbol
>             1      OR2M1P
>             2      OR2L1P
>             3   HSD17B7P1
>             4     OR14L1P
>             5       OR2W5
>             6       VN1R5
>             ......
>             8649        WFS1
>             8650    SNORD73A
>             8651     SNORA24
>             8652     SNORA26
> 
>             Did I miss something ?
> 
> 
>         Yes. You are giving the start position, but not the end. Without
>         explicitly telling the Biomart server where to stop looking for
>         genes, where do you think it will stop by default?
> 
>         Also, several of your coordinates are nonsensical. For instance,
>         chr12 is only 133851859 bases long, chr20 is 63025520 bases
>         long, etc.
> 
>         Best,
> 
>         Jim
> 
> 
> 
>             Thanks in advance for your help
> 
>             Best Regards,
>             Ramzi
> 
>             ----------------------------------------------------------------
> 
>                    [[alternative HTML version deleted]]
> 
>             _______________________________________________
>             Bioc-sig-sequencing mailing list
>             Bioc-sig-sequencing at r-project.org
>             <mailto:Bioc-sig-sequencing at r-project.org>
>             https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
>         -- 
>         James W. MacDonald, M.S.
>         Biostatistician
>         Douglas Lab
>         University of Michigan
>         Department of Human Genetics
>         5912 Buhl
>         1241 E. Catherine St.
>         Ann Arbor MI 48109-5618
>         734-615-7826
>         **********************************************************
>         Electronic Mail is not secure, may not be read every day, and
>         should not be used for urgent or sensitive issues
> 
> 
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioc-sig-sequencing mailing list