[BioC] help with biomart
James W. MacDonald
jmacdon at med.umich.edu
Thu Dec 3 15:29:23 CET 2009
Aaron Mackey wrote:
> I would expect that identifiers be treated as strings, not integers
> (even if an identifier happened to look like an integer most of the time
> -- it's not guaranteed to be true forever), no? We don't do any math on
> the identifiers, so why cast them to integers?
I would think this is another maintenance issue, although I don't
presume to speak for Steffen on why things work as they do. But think
for a moment about what you are asking.
The getBM() function is an all-purpose query device intended to get
stuff from a database. Some of this stuff is numeric, and some of it is
character. Some of the stuff that might be better treated as character
looks like it is integer.
There are multiple different Biomart servers that can be queried using
getBM():
> listMarts()
biomart
1 ensembl
2 snp
3 functional_genomics
4 vega
5 msd
6 bacterial_mart_3
7 fungal_mart_3
8 metazoa_mart_3
9 plant_mart_3
10 protist_mart_3
11 htgt
12 REACTOME
13 wormbase_current
14 dicty
15 rgd__mart
16 ipi_rat__mart
17 SSLP__mart
18 g4public
19 pride
20 intermart-1
21 uniprot_mart
22 ensembl_expressionmart_48
23 biomartDB
24 Eurexpress Biomart
25 pepseekerGOLD_mart06
26 Potato_01
27 Sweetpotato_01
28 Pancreatic_Expression
29 ENSEMBL_MART_ENSEMBL
30 GRAMENE_MARKER_30
31 GRAMENE_MAP_30
32 QTL_MART
not to mention the archived Biomart servers that can be queried.
So what you are asking is for Steffen to go through all those databases,
look for all the data that appear to be integer, but should be kept as
character, and then set up some functionality to make sure this happens.
In addition, he would then need to maintain this over time as different
data are added to the various Biomart servers that can be queried (and
as new Biomart servers come on line).
Sounds like fun to me!
Best,
Jim
>
> -Aaron
>
> On Wed, Dec 2, 2009 at 10:38 AM, James W. MacDonald
> <jmacdon at med.umich.edu <mailto:jmacdon at med.umich.edu>> wrote:
>
>
>
> Tereza Roca wrote:
>
> thank you. this is bizarre... but now it works...
> shouldn't this be fixed so that output of your query agrees with
> the input you have to provide for the query?
>
>
> Theoretically, yes. However it would be more difficult than you
> would think. Note that R automatically truncates leading zeros:
>
> > 0002350152
> [1] 2350152
>
> So when the Biomart server returns the Illumina ID (as a number), R
> automatically truncates the leading zeros. One could use sprintf()
> to keep the leading zeros, but you have to know a priori the width
> of the field:
>
> > sprintf("%010d", 0002350152)
> [1] "0002350152"
>
> There may be a trick to check the number of digits for a number
> prior to R truncating leading zeros, but if there is one, I don't
> know what it would be. If not, you would have to hard code the above
> sprintf() call to use 10 digits, assuming that all Illumina IDs have
> 10 digits. But what if not all Illumina IDs are 10 digits? What if
> they change to 7 digits in a future chip?
>
> As you can imagine, hard coding things like this can lead to a
> nightmare for the maintainer. So although this isn't an ideal
> situation I can't imagine things will change.
>
> Best,
>
> Jim
>
>
>
>
>
>
> ------------------------------------------------------------------------
> *From:* James W. MacDonald <jmacdon at med.umich.edu
> <mailto:jmacdon at med.umich.edu>>
> *To:* Tereza Roca <rocatereza at yahoo.co.uk
> <mailto:rocatereza at yahoo.co.uk>>
> *Cc:* bioconductor at stat.math.ethz.ch
> <mailto:bioconductor at stat.math.ethz.ch>
>
> *Sent:* Wed, 2 December, 2009 14:41:50
> *Subject:* Re: [BioC] help with biomart
>
> Hi Tereza,
>
>
> Tereza Roca wrote:
> > I found something wrong with biomart: if I request an
> illumina ID from an esembl gene ID I obtain the following:
> >> getBM(attributes =
> c("ensembl_gene_id","illumina_humanwg_6_v1"),
> filters="ensembl_gene_id", values = "ENSG00000165891", mart =
> ensembl)
> > ensembl_gene_id illumina_humanwg_6_v1
> > 1 ENSG00000165891 NA
> > 2 ENSG00000165891 2350152
> >
> > this is fine (altough why is there a NA?)
> >
> > but if I request the contrary (from illumina to gene ID) I
> don't obtain anything:
> >
> >> getBM(attributes =
> c("illumina_humanwg_6_v1","ensembl_gene_id"),
> filters="illumina_humanwg_6_v1", values = c("2350152"), mart =
> ensembl)
> > [1] illumina_humanwg_6_v1 ensembl_gene_id <0 rows> (or
> 0-length row.names)
> >
> > is this an error? or am I making some mistakes in the way I
> request it? Please advice
>
> Well, you aren't doing the correct query, but I don't know if I
> would call it a mistake (or a weird 'feature' of how Illumina
> IDs are coded in the Biomart database). I figured this out by
> doing your first query at the Biomart server, which returned
> 0002350152 for the Illumina ID.
>
> > getBM(attributes =
> c("illumina_humanwg_6_v1","ensembl_gene_id"),
> filters="illumina_humanwg_6_v1", "0002350152",mart)
> illumina_humanwg_6_v1 ensembl_gene_id
> 1 2350152 ENSG00000165891
>
>
> Best,
>
> Jim
>
>
>
>
> >
> > thank you
> >
> > Tereza
> >
> >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> <mailto:Bioconductor at stat.math.ethz.ch>
> <mailto:Bioconductor at stat.math.ethz.ch
> <mailto:Bioconductor at stat.math.ethz.ch>>
>
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> -- James W. MacDonald, M..S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and
> should not be used for urgent or sensitive issues
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues
> _______________________________________________
>
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list