[BioC] biomaRt and Ensembl probe set filter....

James W. MacDonald jmacdon at med.umich.edu
Mon Jan 26 15:16:38 CET 2009


HI Jesper,

Jesper Ryge wrote:
> Hi everybody
> 
> q1. I have been using biomaRt to filter Affymetrix probe sets prior to statistical testing such 
> as limma or cyberT. That is, I only include  probe sets that are annotated in ensembl. In this 
> sense I get rid of probe set that do not  align correctly to the intended genes - at least that 
> was my intention.   I know this has been debated before, i.e. cdf file and probe set filtering of 
> miss-aligned probe set and I find this to be the easiest way to exclude probes that might 
> hybridize to wrong transcripts.
> I now find that since 2007 the amount of annotated probe sets on the Affymetrix Rat 230_2 
> has decreased from 17931 -> 12919 out of 31099 (i was redoing some analysis and found 
> this discrepancy between the analysis i did in 2007 and the one conducted on the new 
> ensembl database). I find that to be a rather drastic decrease, but perhaps thats not so? In 
> essence I "loose" a lot of probes, but if those that are filtered are "false positives" it is of 
> course worth it!  that was my logic so forth at least... So, first i would like to know if anybody 
> considers this strategy wise/unwise? it just seems to me a bit surprising that the probe sets 
> on the affy chips mismatch to such a large extend that only roughly a third of the probes 
> remain in the analysis? 

I think you are making a pretty strong assumption here. Do you know how 
Ensembl is annotating Affy Probe IDs to transcript? It seems to me that 
you are assuming that Ensembl is somehow checking to see what transcript 
the probes are complementary to, whereas they may in fact be simply 
taking data from Affy and accepting them verbatim. I personally have no 
idea, but would want to know that before I filtered data in this way.


> 
> I then wanted to check this decrease in affy annotated probe sets which leads me to question 
> 2, a pure biomaRt issue:
> 
> q2. I wish to access earlier ensembl versions to check and possible make a graph of the 
> decrease of the annotated probe sets for the rat 230_2 chip over time. but i run into a 
> problem:
> 
>> mart <- useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
> Error in useMart("ensembl_mart_46", dataset = "rnorvegicus_gene_ensembl",  : 
>   Incorrect BioMart name, use the listMarts function to see which BioMart databases are 
> available
> 
> though they are listed in the archive:

I don't know if this is the problem, but you have mixed a devel version 
of biomaRt in your release version of R. This works for me with a 
release version of biomaRt:

mart <- 
useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
Checking attributes and filters ... ok
 >
 > sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_1.16.0                     fortunes_1.3-6
[3] RMySQL_0.6-1                       DBI_0.2-4
[5] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.1
[7] Biostrings_2.10.1                  IRanges_1.0.2

loaded via a namespace (and not attached):
[1] grid_2.8.0         lattice_0.17-15    Matrix_0.999375-16 
RCurl_0.92-0
[5] tools_2.8.0        XML_1.94-0.1

Best,

Jim


> 
>> listMarts(archive=T)
>                        biomart                     version
> 1              ensembl_mart_47   ENSEMBL GENES 47 (SANGER)
> 2     genomic_features_mart_47            Genomic Features
> 3                  snp_mart_47                         SNP
> 4                 vega_mart_47                        Vega
> 5     compara_mart_homology_47            Compara homology
> 6  compara_mart_multiple_ga_47 Compara multiple alignments
> 7  compara_mart_pairwise_ga_47 Compara pairwise alignments
> 8              ensembl_mart_46   ENSEMBL GENES 46 (SANGER)
> 9     genomic_features_mart_46            Genomic Features
> 10                 snp_mart_46                         SNP
> 11                vega_mart_46                        Vega
> 12    compara_mart_homology_46            Compara homology
> 13 compara_mart_multiple_ga_46 Compara multiple alignments
> 14 compara_mart_pairwise_ga_46 Compara pairwise alignments
> 15             ensembl_mart_45   ENSEMBL GENES 45 (SANGER)
> 16                 snp_mart_45                         SNP
> 17                vega_mart_45                        Vega
> 18    compara_mart_homology_45            Compara homology
> 19 compara_mart_multiple_ga_45 Compara multiple alignments
> 20 compara_mart_pairwise_ga_45 Compara pairwise alignments
> 21             ensembl_mart_44   ENSEMBL GENES 44 (SANGER)
> 22                 snp_mart_44                         SNP
> 23                vega_mart_44                        Vega
> 24    compara_mart_homology_44            Compara homology
> 25 compara_mart_pairwise_ga_44 Compara pairwise alignments
> 26             ensembl_mart_43   ENSEMBL GENES 43 (SANGER)
> 27                 snp_mart_43                         SNP
> 28                vega_mart_43                        Vega
> 29    compara_mart_homology_43            Compara homology
> 30 compara_mart_pairwise_ga_43 Compara pairwise alignments
> 
>> sessionInfo()
> R version 2.8.0 (2008-10-20) 
> i386-apple-darwin9.5.0 
> 
> locale:
> C
> 
> attached base packages:
> [1] tools     stats     graphics  grDevices utils     datasets  methods  
> [8] base     
> 
> other attached packages:
> [1] rat2302cdf_2.3.0 biomaRt_1.99.2   affy_1.20.0      Biobase_2.2.1   
> 
> loaded via a namespace (and not attached):
> [1] RCurl_0.94-0         XML_1.98-1           affyio_1.10.1       
> [4] preprocessCore_1.4.0
> 
> cheers,
> Jesper Ryge, PhD student
> karolinska Institutet
> Dep. of Neuroscience
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-0646
734-936-8662



More information about the Bioconductor mailing list