[BioC] biomaRt and Ensembl probe set filter....
James W. MacDonald
jmacdon at med.umich.edu
Mon Jan 26 15:16:38 CET 2009
HI Jesper,
Jesper Ryge wrote:
> Hi everybody
>
> q1. I have been using biomaRt to filter Affymetrix probe sets prior to statistical testing such
> as limma or cyberT. That is, I only include probe sets that are annotated in ensembl. In this
> sense I get rid of probe set that do not align correctly to the intended genes - at least that
> was my intention. I know this has been debated before, i.e. cdf file and probe set filtering of
> miss-aligned probe set and I find this to be the easiest way to exclude probes that might
> hybridize to wrong transcripts.
> I now find that since 2007 the amount of annotated probe sets on the Affymetrix Rat 230_2
> has decreased from 17931 -> 12919 out of 31099 (i was redoing some analysis and found
> this discrepancy between the analysis i did in 2007 and the one conducted on the new
> ensembl database). I find that to be a rather drastic decrease, but perhaps thats not so? In
> essence I "loose" a lot of probes, but if those that are filtered are "false positives" it is of
> course worth it! that was my logic so forth at least... So, first i would like to know if anybody
> considers this strategy wise/unwise? it just seems to me a bit surprising that the probe sets
> on the affy chips mismatch to such a large extend that only roughly a third of the probes
> remain in the analysis?
I think you are making a pretty strong assumption here. Do you know how
Ensembl is annotating Affy Probe IDs to transcript? It seems to me that
you are assuming that Ensembl is somehow checking to see what transcript
the probes are complementary to, whereas they may in fact be simply
taking data from Affy and accepting them verbatim. I personally have no
idea, but would want to know that before I filtered data in this way.
>
> I then wanted to check this decrease in affy annotated probe sets which leads me to question
> 2, a pure biomaRt issue:
>
> q2. I wish to access earlier ensembl versions to check and possible make a graph of the
> decrease of the annotated probe sets for the rat 230_2 chip over time. but i run into a
> problem:
>
>> mart <- useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
> Error in useMart("ensembl_mart_46", dataset = "rnorvegicus_gene_ensembl", :
> Incorrect BioMart name, use the listMarts function to see which BioMart databases are
> available
>
> though they are listed in the archive:
I don't know if this is the problem, but you have mixed a devel version
of biomaRt in your release version of R. This works for me with a
release version of biomaRt:
mart <-
useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
Checking attributes and filters ... ok
>
> sessionInfo()
R version 2.8.0 (2008-10-20)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_1.16.0 fortunes_1.3-6
[3] RMySQL_0.6-1 DBI_0.2-4
[5] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.1
[7] Biostrings_2.10.1 IRanges_1.0.2
loaded via a namespace (and not attached):
[1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16
RCurl_0.92-0
[5] tools_2.8.0 XML_1.94-0.1
Best,
Jim
>
>> listMarts(archive=T)
> biomart version
> 1 ensembl_mart_47 ENSEMBL GENES 47 (SANGER)
> 2 genomic_features_mart_47 Genomic Features
> 3 snp_mart_47 SNP
> 4 vega_mart_47 Vega
> 5 compara_mart_homology_47 Compara homology
> 6 compara_mart_multiple_ga_47 Compara multiple alignments
> 7 compara_mart_pairwise_ga_47 Compara pairwise alignments
> 8 ensembl_mart_46 ENSEMBL GENES 46 (SANGER)
> 9 genomic_features_mart_46 Genomic Features
> 10 snp_mart_46 SNP
> 11 vega_mart_46 Vega
> 12 compara_mart_homology_46 Compara homology
> 13 compara_mart_multiple_ga_46 Compara multiple alignments
> 14 compara_mart_pairwise_ga_46 Compara pairwise alignments
> 15 ensembl_mart_45 ENSEMBL GENES 45 (SANGER)
> 16 snp_mart_45 SNP
> 17 vega_mart_45 Vega
> 18 compara_mart_homology_45 Compara homology
> 19 compara_mart_multiple_ga_45 Compara multiple alignments
> 20 compara_mart_pairwise_ga_45 Compara pairwise alignments
> 21 ensembl_mart_44 ENSEMBL GENES 44 (SANGER)
> 22 snp_mart_44 SNP
> 23 vega_mart_44 Vega
> 24 compara_mart_homology_44 Compara homology
> 25 compara_mart_pairwise_ga_44 Compara pairwise alignments
> 26 ensembl_mart_43 ENSEMBL GENES 43 (SANGER)
> 27 snp_mart_43 SNP
> 28 vega_mart_43 Vega
> 29 compara_mart_homology_43 Compara homology
> 30 compara_mart_pairwise_ga_43 Compara pairwise alignments
>
>> sessionInfo()
> R version 2.8.0 (2008-10-20)
> i386-apple-darwin9.5.0
>
> locale:
> C
>
> attached base packages:
> [1] tools stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] rat2302cdf_2.3.0 biomaRt_1.99.2 affy_1.20.0 Biobase_2.2.1
>
> loaded via a namespace (and not attached):
> [1] RCurl_0.94-0 XML_1.98-1 affyio_1.10.1
> [4] preprocessCore_1.4.0
>
> cheers,
> Jesper Ryge, PhD student
> karolinska Institutet
> Dep. of Neuroscience
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-0646
734-936-8662
More information about the Bioconductor
mailing list