[BioC] biomaRt and Ensembl probe set filter....

Mon Jan 26 19:02:10 CET 2009

hi james

----- Original Message -----
From: Cei Abreu-Goodger <cei at ebi.ac.uk>
Date: Monday, January 26, 2009 3:24 pm
Subject: Re: [BioC] biomaRt and Ensembl probe set filter....
To: "James W. MacDonald" <jmacdon at med.umich.edu>
Cc: Jesper.Ryge at ki.se, bioC <bioconductor at stat.math.ethz.ch>

> 
> 
> James W. MacDonald wrote:
> > HI Jesper,
> > 
> > Jesper Ryge wrote:
> >> Hi everybody
> >>
> >> q1. I have been using biomaRt to filter Affymetrix probe sets 
> prior to 
> >> statistical testing such as limma or cyberT. That is, I only 
> include  
> >> probe sets that are annotated in ensembl. In this sense I get 
> rid of 
> >> probe set that do not  align correctly to the intended genes - 
> at 
> >> least that was my intention.   I know this has been debated 
> before, 
> >> i.e. cdf file and probe set filtering of miss-aligned probe set 
> and I 
> >> find this to be the easiest way to exclude probes that might 
> hybridize 
> >> to wrong transcripts.
> >> I now find that since 2007 the amount of annotated probe sets on 
> the 
> >> Affymetrix Rat 230_2 has decreased from 17931 -> 12919 out of 
> 31099 (i 
> >> was redoing some analysis and found this discrepancy between the 
> >> analysis i did in 2007 and the one conducted on the new ensembl 
> >> database). I find that to be a rather drastic decrease, but 
> perhaps 
> >> thats not so? In essence I "loose" a lot of probes, but if those 
> that 
> >> are filtered are "false positives" it is of course worth it!  
> that was 
> >> my logic so forth at least... So, first i would like to know if 
> >> anybody considers this strategy wise/unwise? it just seems to me 
> a bit 
> >> surprising that the probe sets on the affy chips mismatch to 
> such a 
> >> large extend that only roughly a third of the probes remain in 
> the 
> >> analysis? 
> > 
> > I think you are making a pretty strong assumption here. Do you 
> know how 
> > Ensembl is annotating Affy Probe IDs to transcript? It seems to 
> me that 
> > you are assuming that Ensembl is somehow checking to see what 
> transcript 
> > the probes are complementary to, whereas they may in fact be 
> simply 
> > taking data from Affy and accepting them verbatim. I personally 
> have no 
> > idea, but would want to know that before I filtered data in this 
> way.
> From 
> http://www.ensembl.org/info/docs/microarray_probe_set_mapping.html
> Step One: Genome Sequence Mapping
> 
> In the first step individual probes (oligonucleotides) are mapped 
> to the 
> genome sequence. The Ensembl analysis and annotation pipeline uses 
> the 
> Exonerate sequence comparison and alignment tool (Slater et al., 
> 2005) 
> and tolerates only 1 bp mismatch between the probe and the genome 
> sequence assembly. Probes that hit to 100 or more locations (e.g. 
> suspected Alu repeats) are discarded and not stored in the database.
> 
> Step Two: Ensembl Transcript Mapping
> 
> In the second step, we aim to associate microarray probe sets with 
> Ensembl transcript predictions (ENST...). Individual probes are 
> grouped 
> into probe sets and generally it is required that more than 50% of 
> the 
> probes in a probe set hit a given transcript sequence. Probe set 
> sizes 
> are determined dynamically on a per probe set basis, rather than 
> taking 
> the array-wide value documented by the manufacturer. Transcript 
> cDNA 
> sequences are extended by the length of the UTR. Where annotated 
> UTRs 
> are absent a default UTR length is used, calculated for both five 
> and 
> three prime UTRs as the highest of either the mean or the median of 
> all 
> annotated UTRs for a given species. Probes mapping across exon 
> boundaries are not currently captured as the transcript annotations 
> are 
> based on the genomic mappings from step one.
> 
> 

hm, it seems to me that the ensembl team is doing a decent effort to filter out non-specific 
probes... but im not convinced either way yet - to filter or not?  if I do a test to determine 
significantly differentially expressed genes ( e.g. limma) on the full data set, i end up with a 
list of genes that contains probe sets that by enseml are discarded (not annotated to a gene 
or transcript, but the probe set alignment information is available). this can be because they 
are not specific or that they have mismatching probes that potentially cross-hybridize to 
several other transcripts.  these are obviously not very trustworthy probe sets and they were 
my initial reason for eliminating them from my data set prior to any statistical analysis (in 
order for them not to affect the false discovery rate)... does anybody have any comment or 
experiences with this?  surely the probes as they are designed by affymetrix are not perfect 
and as the data base sequence quality increases it mak
es sense to filter out the probes that shows to have sequence similarity to regions not 
originally intended. but how? 

cheers,
jesper