[BioC] biomaRt and Ensembl probe set filter....

Mon Jan 26 15:24:47 CET 2009

James W. MacDonald wrote:
> HI Jesper,
> 
> Jesper Ryge wrote:
>> Hi everybody
>>
>> q1. I have been using biomaRt to filter Affymetrix probe sets prior to 
>> statistical testing such as limma or cyberT. That is, I only include  
>> probe sets that are annotated in ensembl. In this sense I get rid of 
>> probe set that do not  align correctly to the intended genes - at 
>> least that was my intention.   I know this has been debated before, 
>> i.e. cdf file and probe set filtering of miss-aligned probe set and I 
>> find this to be the easiest way to exclude probes that might hybridize 
>> to wrong transcripts.
>> I now find that since 2007 the amount of annotated probe sets on the 
>> Affymetrix Rat 230_2 has decreased from 17931 -> 12919 out of 31099 (i 
>> was redoing some analysis and found this discrepancy between the 
>> analysis i did in 2007 and the one conducted on the new ensembl 
>> database). I find that to be a rather drastic decrease, but perhaps 
>> thats not so? In essence I "loose" a lot of probes, but if those that 
>> are filtered are "false positives" it is of course worth it!  that was 
>> my logic so forth at least... So, first i would like to know if 
>> anybody considers this strategy wise/unwise? it just seems to me a bit 
>> surprising that the probe sets on the affy chips mismatch to such a 
>> large extend that only roughly a third of the probes remain in the 
>> analysis? 
> 
> I think you are making a pretty strong assumption here. Do you know how 
> Ensembl is annotating Affy Probe IDs to transcript? It seems to me that 
> you are assuming that Ensembl is somehow checking to see what transcript 
> the probes are complementary to, whereas they may in fact be simply 
> taking data from Affy and accepting them verbatim. I personally have no 
> idea, but would want to know that before I filtered data in this way.

 From http://www.ensembl.org/info/docs/microarray_probe_set_mapping.html

Step One: Genome Sequence Mapping

In the first step individual probes (oligonucleotides) are mapped to the 
genome sequence. The Ensembl analysis and annotation pipeline uses the 
Exonerate sequence comparison and alignment tool (Slater et al., 2005) 
and tolerates only 1 bp mismatch between the probe and the genome 
sequence assembly. Probes that hit to 100 or more locations (e.g. 
suspected Alu repeats) are discarded and not stored in the database.

Step Two: Ensembl Transcript Mapping

In the second step, we aim to associate microarray probe sets with 
Ensembl transcript predictions (ENST...). Individual probes are grouped 
into probe sets and generally it is required that more than 50% of the 
probes in a probe set hit a given transcript sequence. Probe set sizes 
are determined dynamically on a per probe set basis, rather than taking 
the array-wide value documented by the manufacturer. Transcript cDNA 
sequences are extended by the length of the UTR. Where annotated UTRs 
are absent a default UTR length is used, calculated for both five and 
three prime UTRs as the highest of either the mean or the median of all 
annotated UTRs for a given species. Probes mapping across exon 
boundaries are not currently captured as the transcript annotations are 
based on the genomic mappings from step one.

> 
>>
>> I then wanted to check this decrease in affy annotated probe sets 
>> which leads me to question 2, a pure biomaRt issue:
>>
>> q2. I wish to access earlier ensembl versions to check and possible 
>> make a graph of the decrease of the annotated probe sets for the rat 
>> 230_2 chip over time. but i run into a problem:
>>
>>> mart <- 
>>> useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
>> Error in useMart("ensembl_mart_46", dataset = 
>> "rnorvegicus_gene_ensembl",  :   Incorrect BioMart name, use the 
>> listMarts function to see which BioMart databases are available
>>
>> though they are listed in the archive:
> 
> I don't know if this is the problem, but you have mixed a devel version 
> of biomaRt in your release version of R. This works for me with a 
> release version of biomaRt:
> 
> mart <- 
> useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
> Checking attributes and filters ... ok
>  >
>  > sessionInfo()
> R version 2.8.0 (2008-10-20)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
> States.1252;LC_MONETARY=English_United 
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] biomaRt_1.16.0                     fortunes_1.3-6
> [3] RMySQL_0.6-1                       DBI_0.2-4
> [5] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.1
> [7] Biostrings_2.10.1                  IRanges_1.0.2
> 
> loaded via a namespace (and not attached):
> [1] grid_2.8.0         lattice_0.17-15    Matrix_0.999375-16 RCurl_0.92-0
> [5] tools_2.8.0        XML_1.94-0.1
> 
> Best,
> 
> Jim
> 
> 
>>
>>> listMarts(archive=T)
>>                        biomart                     version
>> 1              ensembl_mart_47   ENSEMBL GENES 47 (SANGER)
>> 2     genomic_features_mart_47            Genomic Features
>> 3                  snp_mart_47                         SNP
>> 4                 vega_mart_47                        Vega
>> 5     compara_mart_homology_47            Compara homology
>> 6  compara_mart_multiple_ga_47 Compara multiple alignments
>> 7  compara_mart_pairwise_ga_47 Compara pairwise alignments
>> 8              ensembl_mart_46   ENSEMBL GENES 46 (SANGER)
>> 9     genomic_features_mart_46            Genomic Features
>> 10                 snp_mart_46                         SNP
>> 11                vega_mart_46                        Vega
>> 12    compara_mart_homology_46            Compara homology
>> 13 compara_mart_multiple_ga_46 Compara multiple alignments
>> 14 compara_mart_pairwise_ga_46 Compara pairwise alignments
>> 15             ensembl_mart_45   ENSEMBL GENES 45 (SANGER)
>> 16                 snp_mart_45                         SNP
>> 17                vega_mart_45                        Vega
>> 18    compara_mart_homology_45            Compara homology
>> 19 compara_mart_multiple_ga_45 Compara multiple alignments
>> 20 compara_mart_pairwise_ga_45 Compara pairwise alignments
>> 21             ensembl_mart_44   ENSEMBL GENES 44 (SANGER)
>> 22                 snp_mart_44                         SNP
>> 23                vega_mart_44                        Vega
>> 24    compara_mart_homology_44            Compara homology
>> 25 compara_mart_pairwise_ga_44 Compara pairwise alignments
>> 26             ensembl_mart_43   ENSEMBL GENES 43 (SANGER)
>> 27                 snp_mart_43                         SNP
>> 28                vega_mart_43                        Vega
>> 29    compara_mart_homology_43            Compara homology
>> 30 compara_mart_pairwise_ga_43 Compara pairwise alignments
>>
>>> sessionInfo()
>> R version 2.8.0 (2008-10-20) i386-apple-darwin9.5.0
>> locale:
>> C
>>
>> attached base packages:
>> [1] tools     stats     graphics  grDevices utils     datasets  
>> methods  [8] base    
>> other attached packages:
>> [1] rat2302cdf_2.3.0 biomaRt_1.99.2   affy_1.20.0      Biobase_2.2.1  
>> loaded via a namespace (and not attached):
>> [1] RCurl_0.94-0         XML_1.98-1           affyio_1.10.1       [4] 
>> preprocessCore_1.4.0
>>
>> cheers,
>> Jesper Ryge, PhD student
>> karolinska Institutet
>> Dep. of Neuroscience
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>