[BioC] biomaRt and Ensembl probe set filter....
Cei Abreu-Goodger
cei at ebi.ac.uk
Mon Jan 26 15:24:47 CET 2009
James W. MacDonald wrote:
> HI Jesper,
>
> Jesper Ryge wrote:
>> Hi everybody
>>
>> q1. I have been using biomaRt to filter Affymetrix probe sets prior to
>> statistical testing such as limma or cyberT. That is, I only include
>> probe sets that are annotated in ensembl. In this sense I get rid of
>> probe set that do not align correctly to the intended genes - at
>> least that was my intention. I know this has been debated before,
>> i.e. cdf file and probe set filtering of miss-aligned probe set and I
>> find this to be the easiest way to exclude probes that might hybridize
>> to wrong transcripts.
>> I now find that since 2007 the amount of annotated probe sets on the
>> Affymetrix Rat 230_2 has decreased from 17931 -> 12919 out of 31099 (i
>> was redoing some analysis and found this discrepancy between the
>> analysis i did in 2007 and the one conducted on the new ensembl
>> database). I find that to be a rather drastic decrease, but perhaps
>> thats not so? In essence I "loose" a lot of probes, but if those that
>> are filtered are "false positives" it is of course worth it! that was
>> my logic so forth at least... So, first i would like to know if
>> anybody considers this strategy wise/unwise? it just seems to me a bit
>> surprising that the probe sets on the affy chips mismatch to such a
>> large extend that only roughly a third of the probes remain in the
>> analysis?
>
> I think you are making a pretty strong assumption here. Do you know how
> Ensembl is annotating Affy Probe IDs to transcript? It seems to me that
> you are assuming that Ensembl is somehow checking to see what transcript
> the probes are complementary to, whereas they may in fact be simply
> taking data from Affy and accepting them verbatim. I personally have no
> idea, but would want to know that before I filtered data in this way.
From http://www.ensembl.org/info/docs/microarray_probe_set_mapping.html
Step One: Genome Sequence Mapping
In the first step individual probes (oligonucleotides) are mapped to the
genome sequence. The Ensembl analysis and annotation pipeline uses the
Exonerate sequence comparison and alignment tool (Slater et al., 2005)
and tolerates only 1 bp mismatch between the probe and the genome
sequence assembly. Probes that hit to 100 or more locations (e.g.
suspected Alu repeats) are discarded and not stored in the database.
Step Two: Ensembl Transcript Mapping
In the second step, we aim to associate microarray probe sets with
Ensembl transcript predictions (ENST...). Individual probes are grouped
into probe sets and generally it is required that more than 50% of the
probes in a probe set hit a given transcript sequence. Probe set sizes
are determined dynamically on a per probe set basis, rather than taking
the array-wide value documented by the manufacturer. Transcript cDNA
sequences are extended by the length of the UTR. Where annotated UTRs
are absent a default UTR length is used, calculated for both five and
three prime UTRs as the highest of either the mean or the median of all
annotated UTRs for a given species. Probes mapping across exon
boundaries are not currently captured as the transcript annotations are
based on the genomic mappings from step one.
>
>>
>> I then wanted to check this decrease in affy annotated probe sets
>> which leads me to question 2, a pure biomaRt issue:
>>
>> q2. I wish to access earlier ensembl versions to check and possible
>> make a graph of the decrease of the annotated probe sets for the rat
>> 230_2 chip over time. but i run into a problem:
>>
>>> mart <-
>>> useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
>> Error in useMart("ensembl_mart_46", dataset =
>> "rnorvegicus_gene_ensembl", : Incorrect BioMart name, use the
>> listMarts function to see which BioMart databases are available
>>
>> though they are listed in the archive:
>
> I don't know if this is the problem, but you have mixed a devel version
> of biomaRt in your release version of R. This works for me with a
> release version of biomaRt:
>
> mart <-
> useMart("ensembl_mart_46",dataset="rnorvegicus_gene_ensembl",archive=T)
> Checking attributes and filters ... ok
> >
> > sessionInfo()
> R version 2.8.0 (2008-10-20)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] biomaRt_1.16.0 fortunes_1.3-6
> [3] RMySQL_0.6-1 DBI_0.2-4
> [5] BSgenome.Hsapiens.UCSC.hg18_1.3.11 BSgenome_1.10.1
> [7] Biostrings_2.10.1 IRanges_1.0.2
>
> loaded via a namespace (and not attached):
> [1] grid_2.8.0 lattice_0.17-15 Matrix_0.999375-16 RCurl_0.92-0
> [5] tools_2.8.0 XML_1.94-0.1
>
> Best,
>
> Jim
>
>
>>
>>> listMarts(archive=T)
>> biomart version
>> 1 ensembl_mart_47 ENSEMBL GENES 47 (SANGER)
>> 2 genomic_features_mart_47 Genomic Features
>> 3 snp_mart_47 SNP
>> 4 vega_mart_47 Vega
>> 5 compara_mart_homology_47 Compara homology
>> 6 compara_mart_multiple_ga_47 Compara multiple alignments
>> 7 compara_mart_pairwise_ga_47 Compara pairwise alignments
>> 8 ensembl_mart_46 ENSEMBL GENES 46 (SANGER)
>> 9 genomic_features_mart_46 Genomic Features
>> 10 snp_mart_46 SNP
>> 11 vega_mart_46 Vega
>> 12 compara_mart_homology_46 Compara homology
>> 13 compara_mart_multiple_ga_46 Compara multiple alignments
>> 14 compara_mart_pairwise_ga_46 Compara pairwise alignments
>> 15 ensembl_mart_45 ENSEMBL GENES 45 (SANGER)
>> 16 snp_mart_45 SNP
>> 17 vega_mart_45 Vega
>> 18 compara_mart_homology_45 Compara homology
>> 19 compara_mart_multiple_ga_45 Compara multiple alignments
>> 20 compara_mart_pairwise_ga_45 Compara pairwise alignments
>> 21 ensembl_mart_44 ENSEMBL GENES 44 (SANGER)
>> 22 snp_mart_44 SNP
>> 23 vega_mart_44 Vega
>> 24 compara_mart_homology_44 Compara homology
>> 25 compara_mart_pairwise_ga_44 Compara pairwise alignments
>> 26 ensembl_mart_43 ENSEMBL GENES 43 (SANGER)
>> 27 snp_mart_43 SNP
>> 28 vega_mart_43 Vega
>> 29 compara_mart_homology_43 Compara homology
>> 30 compara_mart_pairwise_ga_43 Compara pairwise alignments
>>
>>> sessionInfo()
>> R version 2.8.0 (2008-10-20) i386-apple-darwin9.5.0
>> locale:
>> C
>>
>> attached base packages:
>> [1] tools stats graphics grDevices utils datasets
>> methods [8] base
>> other attached packages:
>> [1] rat2302cdf_2.3.0 biomaRt_1.99.2 affy_1.20.0 Biobase_2.2.1
>> loaded via a namespace (and not attached):
>> [1] RCurl_0.94-0 XML_1.98-1 affyio_1.10.1 [4]
>> preprocessCore_1.4.0
>>
>> cheers,
>> Jesper Ryge, PhD student
>> karolinska Institutet
>> Dep. of Neuroscience
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list