[BioC] biomaRt and Ensembl probe set filter....

Jesper Ryge Jesper.Ryge at ki.se
Mon Jan 26 11:20:26 CET 2009

Hi everybody

q1. I have been using biomaRt to filter Affymetrix probe sets prior to statistical testing such 
as limma or cyberT. That is, I only include  probe sets that are annotated in ensembl. In this 
sense I get rid of probe set that do not  align correctly to the intended genes - at least that 
was my intention.   I know this has been debated before, i.e. cdf file and probe set filtering of 
miss-aligned probe set and I find this to be the easiest way to exclude probes that might 
hybridize to wrong transcripts.
I now find that since 2007 the amount of annotated probe sets on the Affymetrix Rat 230_2 
has decreased from 17931 -> 12919 out of 31099 (i was redoing some analysis and found 
this discrepancy between the analysis i did in 2007 and the one conducted on the new 
ensembl database). I find that to be a rather drastic decrease, but perhaps thats not so? In 
essence I "loose" a lot of probes, but if those that are filtered are "false positives" it is of 
course worth it!  that was my logic so forth at least... So, first i would like to know if anybody 
considers this strategy wise/unwise? it just seems to me a bit surprising that the probe sets 
on the affy chips mismatch to such a large extend that only roughly a third of the probes 
remain in the analysis? 

I then wanted to check this decrease in affy annotated probe sets which leads me to question 
2, a pure biomaRt issue:

q2. I wish to access earlier ensembl versions to check and possible make a graph of the 
decrease of the annotated probe sets for the rat 230_2 chip over time. but i run into a 

> mart <- useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
Error in useMart("ensembl_mart_46", dataset = "rnorvegicus_gene_ensembl",  : 
  Incorrect BioMart name, use the listMarts function to see which BioMart databases are 

though they are listed in the archive: 

> listMarts(archive=T)
                       biomart                     version
1              ensembl_mart_47   ENSEMBL GENES 47 (SANGER)
2     genomic_features_mart_47            Genomic Features
3                  snp_mart_47                         SNP
4                 vega_mart_47                        Vega
5     compara_mart_homology_47            Compara homology
6  compara_mart_multiple_ga_47 Compara multiple alignments
7  compara_mart_pairwise_ga_47 Compara pairwise alignments
8              ensembl_mart_46   ENSEMBL GENES 46 (SANGER)
9     genomic_features_mart_46            Genomic Features
10                 snp_mart_46                         SNP
11                vega_mart_46                        Vega
12    compara_mart_homology_46            Compara homology
13 compara_mart_multiple_ga_46 Compara multiple alignments
14 compara_mart_pairwise_ga_46 Compara pairwise alignments
15             ensembl_mart_45   ENSEMBL GENES 45 (SANGER)
16                 snp_mart_45                         SNP
17                vega_mart_45                        Vega
18    compara_mart_homology_45            Compara homology
19 compara_mart_multiple_ga_45 Compara multiple alignments
20 compara_mart_pairwise_ga_45 Compara pairwise alignments
21             ensembl_mart_44   ENSEMBL GENES 44 (SANGER)
22                 snp_mart_44                         SNP
23                vega_mart_44                        Vega
24    compara_mart_homology_44            Compara homology
25 compara_mart_pairwise_ga_44 Compara pairwise alignments
26             ensembl_mart_43   ENSEMBL GENES 43 (SANGER)
27                 snp_mart_43                         SNP
28                vega_mart_43                        Vega
29    compara_mart_homology_43            Compara homology
30 compara_mart_pairwise_ga_43 Compara pairwise alignments

> sessionInfo()
R version 2.8.0 (2008-10-20) 


attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] rat2302cdf_2.3.0 biomaRt_1.99.2   affy_1.20.0      Biobase_2.2.1   

loaded via a namespace (and not attached):
[1] RCurl_0.94-0         XML_1.98-1           affyio_1.10.1       
[4] preprocessCore_1.4.0

Jesper Ryge, PhD student
karolinska Institutet
Dep. of Neuroscience

