[BioC] help with biomart

James W. MacDonald jmacdon at med.umich.edu
Thu Dec 3 15:29:23 CET 2009



Aaron Mackey wrote:
> I would expect that identifiers be treated as strings, not integers 
> (even if an identifier happened to look like an integer most of the time 
> -- it's not guaranteed to be true forever), no?  We don't do any math on 
> the identifiers, so why cast them to integers?

I would think this is another maintenance issue, although I don't 
presume to speak for Steffen on why things work as they do. But think 
for a moment about what you are asking.

The getBM() function is an all-purpose query device intended to get 
stuff from a database. Some of this stuff is numeric, and some of it is 
character. Some of the stuff that might be better treated as character 
looks like it is integer.

There are multiple different Biomart servers that can be queried using 
getBM():

 > listMarts()
                      biomart
1                    ensembl
2                        snp
3        functional_genomics
4                       vega
5                        msd
6           bacterial_mart_3
7              fungal_mart_3
8             metazoa_mart_3
9               plant_mart_3
10            protist_mart_3
11                      htgt
12                  REACTOME
13          wormbase_current
14                     dicty
15                 rgd__mart
16             ipi_rat__mart
17                SSLP__mart
18                  g4public
19                     pride
20               intermart-1
21              uniprot_mart
22 ensembl_expressionmart_48
23                 biomartDB
24        Eurexpress Biomart
25      pepseekerGOLD_mart06
26                 Potato_01
27            Sweetpotato_01
28     Pancreatic_Expression
29      ENSEMBL_MART_ENSEMBL
30         GRAMENE_MARKER_30
31            GRAMENE_MAP_30
32                  QTL_MART

not to mention the archived Biomart servers that can be queried.

So what you are asking is for Steffen to go through all those databases, 
look for all the data that appear to be integer, but should be kept as 
character, and then set up some functionality to make sure this happens.

In addition, he would then need to maintain this over time as different 
data are added to the various Biomart servers that can be queried (and 
as new Biomart servers come on line).

Sounds like fun to me!

Best,

Jim


> 
> -Aaron
> 
> On Wed, Dec 2, 2009 at 10:38 AM, James W. MacDonald 
> <jmacdon at med.umich.edu <mailto:jmacdon at med.umich.edu>> wrote:
> 
> 
> 
>     Tereza Roca wrote:
> 
>         thank you. this is bizarre... but now it works...
>         shouldn't this be fixed so that output of your query agrees with
>         the input you have to provide for the query?
> 
> 
>     Theoretically, yes. However it would be more difficult than you
>     would think. Note that R automatically truncates leading zeros:
> 
>      > 0002350152
>     [1] 2350152
> 
>     So when the Biomart server returns the Illumina ID (as a number), R
>     automatically truncates the leading zeros. One could use sprintf()
>     to keep the leading zeros, but you have to know a priori the width
>     of the field:
> 
>      > sprintf("%010d", 0002350152)
>     [1] "0002350152"
> 
>     There may be a trick to check the number of digits for a number
>     prior to R truncating leading zeros, but if there is one, I don't
>     know what it would be. If not, you would have to hard code the above
>     sprintf() call to use 10 digits, assuming that all Illumina IDs have
>     10 digits. But what if not all Illumina IDs are 10 digits? What if
>     they change to 7 digits in a future chip?
> 
>     As you can imagine, hard coding things like this can lead to a
>     nightmare for the maintainer. So although this isn't an ideal
>     situation I can't imagine things will change.
> 
>     Best,
> 
>     Jim
> 
> 
> 
> 
> 
> 
>         ------------------------------------------------------------------------
>         *From:* James W. MacDonald <jmacdon at med.umich.edu
>         <mailto:jmacdon at med.umich.edu>>
>         *To:* Tereza Roca <rocatereza at yahoo.co.uk
>         <mailto:rocatereza at yahoo.co.uk>>
>         *Cc:* bioconductor at stat.math.ethz.ch
>         <mailto:bioconductor at stat.math.ethz.ch>
> 
>         *Sent:* Wed, 2 December, 2009 14:41:50
>         *Subject:* Re: [BioC] help with biomart
> 
>         Hi Tereza,
> 
> 
>         Tereza Roca wrote:
>          > I found something wrong with biomart: if I request an
>         illumina ID from an esembl gene ID I obtain the following:
>          >> getBM(attributes =
>         c("ensembl_gene_id","illumina_humanwg_6_v1"),
>         filters="ensembl_gene_id", values = "ENSG00000165891", mart =
>         ensembl)
>          >  ensembl_gene_id illumina_humanwg_6_v1
>          > 1 ENSG00000165891                    NA
>          > 2 ENSG00000165891              2350152
>          >
>          > this is fine (altough why is there a NA?)
>          >
>          > but if I request the contrary (from illumina to gene ID) I
>         don't obtain anything:
>          >
>          >> getBM(attributes =
>         c("illumina_humanwg_6_v1","ensembl_gene_id"),
>         filters="illumina_humanwg_6_v1", values = c("2350152"), mart =
>         ensembl)
>          > [1] illumina_humanwg_6_v1 ensembl_gene_id      <0 rows> (or
>         0-length row.names)
>          >
>          >  is this an error? or am I making some mistakes in the way I
>         request it? Please advice
> 
>         Well, you aren't doing the correct query, but I don't know if I
>         would call it a mistake (or a weird 'feature' of how Illumina
>         IDs are coded in the Biomart database). I figured this out by
>         doing your first query at the Biomart server, which returned
>         0002350152 for the Illumina ID.
> 
>          > getBM(attributes =
>         c("illumina_humanwg_6_v1","ensembl_gene_id"),
>         filters="illumina_humanwg_6_v1", "0002350152",mart)
>          illumina_humanwg_6_v1 ensembl_gene_id
>         1              2350152 ENSG00000165891
> 
> 
>         Best,
> 
>         Jim
> 
> 
> 
> 
>          >
>          > thank you
>          >
>          > Tereza
>          >
>          >
>          >
>          >          [[alternative HTML version deleted]]
>          >
>          > _______________________________________________
>          > Bioconductor mailing list
>          > Bioconductor at stat.math.ethz.ch
>         <mailto:Bioconductor at stat.math.ethz.ch>
>         <mailto:Bioconductor at stat.math.ethz.ch
>         <mailto:Bioconductor at stat.math.ethz.ch>>
> 
>          > https://stat.ethz.ch/mailman/listinfo/bioconductor
>          > Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
>         -- James W. MacDonald, M..S.
>         Biostatistician
>         Douglas Lab
>         University of Michigan
>         Department of Human Genetics
>         5912 Buhl
>         1241 E. Catherine St.
>         Ann Arbor MI 48109-5618
>         734-615-7826
>         **********************************************************
>         Electronic Mail is not secure, may not be read every day, and
>         should not be used for urgent or sensitive issues
> 
> 
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     Douglas Lab
>     University of Michigan
>     Department of Human Genetics
>     5912 Buhl
>     1241 E. Catherine St.
>     Ann Arbor MI 48109-5618
>     734-615-7826
>     **********************************************************
>     Electronic Mail is not secure, may not be read every day, and should
>     not be used for urgent or sensitive issues
>     _______________________________________________
> 
>     Bioconductor mailing list
>     Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>     Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 



More information about the Bioconductor mailing list