[BioC] help with biomart

James W. MacDonald jmacdon at med.umich.edu
Wed Dec 2 16:38:30 CET 2009



Tereza Roca wrote:
> thank you. this is bizarre... but now it works...
> shouldn't this be fixed so that output of your query agrees with the 
> input you have to provide for the query?

Theoretically, yes. However it would be more difficult than you would 
think. Note that R automatically truncates leading zeros:

 > 0002350152
[1] 2350152

So when the Biomart server returns the Illumina ID (as a number), R 
automatically truncates the leading zeros. One could use sprintf() to 
keep the leading zeros, but you have to know a priori the width of the 
field:

 > sprintf("%010d", 0002350152)
[1] "0002350152"

There may be a trick to check the number of digits for a number prior to 
R truncating leading zeros, but if there is one, I don't know what it 
would be. If not, you would have to hard code the above sprintf() call 
to use 10 digits, assuming that all Illumina IDs have 10 digits. But 
what if not all Illumina IDs are 10 digits? What if they change to 7 
digits in a future chip?

As you can imagine, hard coding things like this can lead to a nightmare 
for the maintainer. So although this isn't an ideal situation I can't 
imagine things will change.

Best,

Jim



> 
> 
> 
> ------------------------------------------------------------------------
> *From:* James W. MacDonald <jmacdon at med.umich.edu>
> *To:* Tereza Roca <rocatereza at yahoo.co.uk>
> *Cc:* bioconductor at stat.math.ethz.ch
> *Sent:* Wed, 2 December, 2009 14:41:50
> *Subject:* Re: [BioC] help with biomart
> 
> Hi Tereza,
> 
> 
> Tereza Roca wrote:
>  > I found something wrong with biomart: if I request an illumina ID 
> from an esembl gene ID I obtain the following:
>  >> getBM(attributes = c("ensembl_gene_id","illumina_humanwg_6_v1"), 
> filters="ensembl_gene_id", values = "ENSG00000165891", mart = ensembl)
>  >  ensembl_gene_id illumina_humanwg_6_v1
>  > 1 ENSG00000165891                    NA
>  > 2 ENSG00000165891              2350152
>  >
>  > this is fine (altough why is there a NA?)
>  >
>  > but if I request the contrary (from illumina to gene ID) I don't 
> obtain anything:
>  >
>  >> getBM(attributes = c("illumina_humanwg_6_v1","ensembl_gene_id"), 
> filters="illumina_humanwg_6_v1", values = c("2350152"), mart = ensembl)
>  > [1] illumina_humanwg_6_v1 ensembl_gene_id      <0 rows> (or 0-length 
> row.names)
>  >
>  >  is this an error? or am I making some mistakes in the way I request 
> it? Please advice
> 
> Well, you aren't doing the correct query, but I don't know if I would 
> call it a mistake (or a weird 'feature' of how Illumina IDs are coded in 
> the Biomart database). I figured this out by doing your first query at 
> the Biomart server, which returned 0002350152 for the Illumina ID.
> 
>  > getBM(attributes = c("illumina_humanwg_6_v1","ensembl_gene_id"), 
> filters="illumina_humanwg_6_v1", "0002350152",mart)
>   illumina_humanwg_6_v1 ensembl_gene_id
> 1              2350152 ENSG00000165891
> 
> 
> Best,
> 
> Jim
> 
> 
> 
> 
>  >
>  > thank you
>  >
>  > Tereza
>  >
>  >
>  >
>  >          [[alternative HTML version deleted]]
>  >
>  > _______________________________________________
>  > Bioconductor mailing list
>  > Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>  > https://stat.ethz.ch/mailman/listinfo/bioconductor
>  > Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> -- James W. MacDonald, M..S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not 
> be used for urgent or sensitive issues
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues 


More information about the Bioconductor mailing list