[BioC] filtering probes in affymetrix data

Thu Feb 13 21:23:24 CET 2014

Thank you Jim.  I think my R version is up to date and I am making sure to use "library()".  I started the whole thing over and now I have this new error message, at an earlier step:

library(pd.mogene.2.0.st)
> con <- db(pd.mogene.2.0.st)
> dbGetQuery(con, "select * from type_dict;")
   type                   type_id
1     1                      main
2     2             control->affx
3     3             control->chip
4     4 control->bgp->antigenomic
5     5     control->bgp->genomic
6     6            normgene->exon
7     7          normgene->intron
8     8  rescue->FLmRNA->unmapped
9     9  control->affx->bac_spike
10   10            oligo_spike_in
11   11           r1_bac_spike_at
> table(dbGetQuery(con, "select type from featureSet;")[,1])

     1      2      4      7      9 
263551     18     23   5331     18 
> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner join 
+ featureSet on core_mps.fsetid=featureSet.fsetid where
+ featureSet.type='4';")
> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
+ probs=0.95)
> library(genefilter)
> minval <- max(bkg)
> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <- 
Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt"
> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt <- eset[ind,] 
Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt <- 
Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>

Here is my sessionInfo() output:

R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.12.0                  genefilter_1.44.0                     mogene20sttranscriptcluster.db_2.13.0
 [4] org.Mm.eg.db_2.10.1                   AnnotationDbi_1.24.0                  pd.mogene.2.0.st_2.12.0              
 [7] RSQLite_0.11.4                        DBI_0.2-7                             oligo_1.26.1                         
[10] Biostrings_2.30.1                     XVector_0.2.0                         IRanges_1.20.6                       
[13] Biobase_2.22.0                        oligoClasses_1.24.0                   BiocGenerics_0.8.0                   

loaded via a namespace (and not attached):
 [1] affxparser_1.34.0     affyio_1.30.0         annotate_1.40.0       bit_1.1-11            codetools_0.2-8       ff_2.2-12            
 [7] foreach_1.4.1         GenomicRanges_1.14.4  iterators_1.0.6       preprocessCore_1.24.0 splines_3.0.2         stats4_3.0.2         
[13] survival_2.37-7       tools_3.0.2           XML_3.98-1.1          xtable_1.7-1          zlibbioc_1.8.0       
>

I appreciate your help...
Julia

-----Original Message-----
From: James W. MacDonald [mailto:jmacdon at uw.edu] 
Sent: Thursday, February 13, 2014 12:08 PM
To: Sabet, Julia A
Cc: bioconductor at r-project.org
Subject: Re: [BioC] filtering probes in affymetrix data

Hi Julia,

You should always include the output from sessionInfo() with any questions, so we can see what versions you are running, and what you have loaded.

My guess is you are using an old version of R, prior to the introduction of that function, or you forgot to do library(affycoretools).

Best,

Jim

On Thursday, February 13, 2014 12:03:54 PM, Sabet, Julia A wrote:
> Thank you so much, Jim.  I did everything you recommended and everything seemed to be working and then I installed the affycoretools package and when I did:
> eset.filt <- getMainProbes(eset.filt)
>
> This error resulted:
> Error: could not find function "getMainProbes"
>
> What should I do?
> Thanks!
> Julia
>
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: Thursday, February 13, 2014 9:36 AM
> To: Sabet, Julia A
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] filtering probes in affymetrix data
>
> Hi Julia,
>
> There are several different things you can do. I'll show you one possibility.
>
> First, note that there are multiple different control probes on this 
> array that aren't intended to measure differential expression, and 
> should be excluded. So first let's look at the possible types of
> probesets:
>
>> library(pd.mogene.2.0.st)
>> con <- db(pd.mogene.2.0.st)
>> dbGetQuery(con, "select * from type_dict;")
>     type                   type_id
> 1     1                      main
> 2     2             control->affx
> 3     3             control->chip
> 4     4 control->bgp->antigenomic
> 5     5     control->bgp->genomic
> 6     6            normgene->exon
> 7     7          normgene->intron
> 8     8  rescue->FLmRNA->unmapped
> 9     9  control->affx->bac_spike
> 10   10            oligo_spike_in
> 11   11           r1_bac_spike_at
>
> These are all the possible types of probesets, but we don't have all of them on this array. To see which ones we do have we can do:
>
>
>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>
>       1      2      4      7      9
> 263551     18     23   5331     18
>
> So we only have these probeset types:
>
> 1     1                      main
> 2     2             control->affx
> 4     4 control->bgp->antigenomic
> 7     7          normgene->intron
> 9     9  control->affx->bac_spike
>
> And the 'main' probesets are those that we want to use for 
> differential expression. Now one thing you could do is to say that the 
> antigenomic probesets should give a good measure of background, as 
> they are supposed to have sequences that don't exist in mice. So you 
> could just extract those probesets, get some measure and use that as 
> the lower limit of what you think is expressed or not. That's pretty 
> naive, as a probe with higher GC content will have higher background 
> than one with a lower GC content, but worrying about that is way 
> beyond what I am prepared to go into.
>
> Now we can get the probeset IDs for the antigenomic probesets
>
> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner join 
> featureSet on core_mps.fsetid=featureSet.fsetid where
> featureSet.type='4';")
>
> And then extract those probesets and get a summary statistic.
>
> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
> probs=0.95)
>
> Which will give us the 95th percentile of these background probes. You 
> could then use the kOverA function in genefilter to filter out any 
> probesets where all samples are below the background values. The idea 
> being that you want to filter out any probesets unless k samples have 
> expression levels >= A. So if you have 10 samples, where 5 are 
> controls and 5 are treated, you would do something like
>
> minval <- max(bkg)
> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <- 
> eset[ind,]
>
> You should also filter out all the non-main probesets. You can do that 
> using getMainProbes() in the affycoretools package
>
> eset.filt <- getMainProbes(eset.filt)
>
> Best,
>
> Jim
>
>
>
>
> On Wednesday, February 12, 2014 10:16:31 PM, Sabet, Julia A wrote:
>> Hello all,
>> I am totally new to R/Bioconductor and have begun processing data from my Affymetrix Mouse Gene 2.0 ST arrays.  I normalized the data like this:
>>
>> library(pd.mogene.2.0.st)
>> eset <- rma(affyRaw)
>>
>> and added gene annotation and I am following the limma user's guide, 
>> which recommends removing "probes that appear not be expressed in any of the experimental conditions."  I have read on previous posts that filtering may not be necessary.  Should I filter, and if so, how?  Using what code?
>>
>> Thank you!
>> Julia Sabet
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099