[BioC] filtering probes in affymetrix data

Wed Feb 19 21:49:09 CET 2014

Hi Julia,

Here is an example. Here I am using a different array, and I only have 
three samples per group, but otherwise the steps are the same.

 > eset <- rma(dat)
 > eset.filt <- getMainProbes(eset)
 > library(genefilter)
 > ind <- genefilter(eset.filt, filterfun(kOverA(3, 6)))
 > eset.dblfilt <- eset.filt[ind,]
 > dim(eset)
Features  Samples
    70523       12
 > dim(eset.filt)
Features  Samples
    68993       12
 > dim(eset.dblfilt)
Features  Samples
    13445       12

Best,

Jim

On 2/19/2014 3:13 PM, Sabet, Julia A wrote:
> Thank you.  I couldn't figure out how to filter out controls.  What would the code be to filter out both controls and background using kOverA as you suggested?
> Thanks
> Julia
>   
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: Thursday, February 13, 2014 6:21 PM
> To: Sabet, Julia A
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] filtering probes in affymetrix data
>
> OK, well that's probably excessive. I sort of wonder about Affy and their control probes. I originally wrote getMainProbes because the intronic controls were always popping up in lists of differentially expressed probesets. I wonder if these antigenomic probes are actually measuring something.
>
> You might be better served to just filter out the controls right after running rma(), and then use something less conservative like
>
> filt <- filterfun(kOverA(12, 6)
>
> or you could look at the other options in genefilter like varFilter().
>
> Best,
>
> Jim
>
> On Thursday, February 13, 2014 5:48:59 PM, Sabet, Julia A wrote:
>> It gives me this:
>> Features  Samples
>>       2426       36
>>
>> -----Original Message-----
>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>> Sent: Thursday, February 13, 2014 5:47 PM
>> To: Sabet, Julia A
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] filtering probes in affymetrix data
>>
>> How many probesets do you have left after the genefilter step (e.g., what does dim(eset.filt) give you)?
>>
>> On Thursday, February 13, 2014 4:19:11 PM, Sabet, Julia A wrote:
>>> Could it be because I did the kOverA function twice, since the first time I put a 5 and I meant to put a 12?  I thought it would just get overrided but maybe it didn't?
>>>
>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>> join
>>> + featureSet on core_mps.fsetid=featureSet.fsetid where
>>> + featureSet.type='4';")
>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>> + probs=0.95)
>>>> minval <- max(bkg)
>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <-
>>>> eset[ind,] ind <- genefilter(eset, filterfun(kOverA(12, minval)))
>>>> eset.filt <- eset[ind,]
>>>> library(affycoretools)
>>> eset.filt <- getMainProbes(eset.filt) Error in orig[[nm]][i, , ...,
>>> drop = drop] :
>>>      (subscript) logical subscript too long
>>>
>>> -----Original Message-----
>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>> Sent: Thursday, February 13, 2014 4:05 PM
>>> To: Sabet, Julia A
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>
>>> OK, what do you get if you type
>>>
>>> getMainProbes
>>>
>>> at an R prompt? It shouldn't be subsetting the ExpressionSet twice like that.
>>>
>>> Jim
>>>
>>>
>>>
>>> On Thursday, February 13, 2014 3:59:38 PM, Sabet, Julia A wrote:
>>>> I get this:
>>>>
>>>>> traceback()
>>>> 3: input[ind, ]
>>>> 2: input[ind, ]
>>>> 1: getMainProbes(eset.filt)
>>>> -----Original Message-----
>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>> Sent: Thursday, February 13, 2014 3:57 PM
>>>> To: Sabet, Julia A
>>>> Cc: bioconductor at r-project.org
>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>
>>>> What do you get when you run
>>>>
>>>> traceback()
>>>>
>>>> right after that error?
>>>>
>>>> Jim
>>>>
>>>>
>>>> On 2/13/2014 3:51 PM, Sabet, Julia A wrote:
>>>>> Thank you both.  Solved that problem.  Now when I do the next line of code, I get another error message:
>>>>>
>>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>> Error in orig[[nm]][i, , ..., drop = drop] :
>>>>>        (subscript) logical subscript too long
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>>> Sent: Thursday, February 13, 2014 3:44 PM
>>>>> To: Sabet, Julia A
>>>>> Cc: bioconductor at r-project.org
>>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>>
>>>>> Hi Julia,
>>>>>
>>>>> On 2/13/2014 3:23 PM, Sabet, Julia A wrote:
>>>>>> Thank you Jim.  I think my R version is up to date and I am making sure to use "library()".  I started the whole thing over and now I have this new error message, at an earlier step:
>>>>>>
>>>>>> library(pd.mogene.2.0.st)
>>>>>>> con <- db(pd.mogene.2.0.st)
>>>>>>> dbGetQuery(con, "select * from type_dict;")
>>>>>>          type                   type_id
>>>>>> 1     1                      main
>>>>>> 2     2             control->affx
>>>>>> 3     3             control->chip
>>>>>> 4     4 control->bgp->antigenomic
>>>>>> 5     5     control->bgp->genomic
>>>>>> 6     6            normgene->exon
>>>>>> 7     7          normgene->intron
>>>>>> 8     8  rescue->FLmRNA->unmapped
>>>>>> 9     9  control->affx->bac_spike
>>>>>> 10   10            oligo_spike_in
>>>>>> 11   11           r1_bac_spike_at
>>>>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>>>>            1      2      4      7      9
>>>>>> 263551     18     23   5331     18
>>>>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>>>>> join
>>>>>> + featureSet on core_mps.fsetid=featureSet.fsetid where
>>>>>> + featureSet.type='4';")
>>>>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>>>>> + probs=0.95)
>>>>>>> library(genefilter)
>>>>>>> minval <- max(bkg)
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt
>>>>>>> <-
>>>>> The above line has a bit extra at the end that R doesn't like.
>>>>>
>>>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt"
>>>>> And this is your hint. Error messages are your friends.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt
>>>>>>> <- eset[ind,]
>>>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt
>>>>>>> <-
>>>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>>>>> Here is my sessionInfo() output:
>>>>>>
>>>>>> R version 3.0.2 (2013-09-25)
>>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
>>>>>> [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>>        [1] BiocInstaller_1.12.0                  genefilter_1.44.0                     mogene20sttranscriptcluster.db_2.13.0
>>>>>>        [4] org.Mm.eg.db_2.10.1                   AnnotationDbi_1.24.0                  pd.mogene.2.0.st_2.12.0
>>>>>>        [7] RSQLite_0.11.4                        DBI_0.2-7                             oligo_1.26.1
>>>>>> [10] Biostrings_2.30.1                     XVector_0.2.0                         IRanges_1.20.6
>>>>>> [13] Biobase_2.22.0                        oligoClasses_1.24.0                   BiocGenerics_0.8.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>>        [1] affxparser_1.34.0     affyio_1.30.0         annotate_1.40.0       bit_1.1-11            codetools_0.2-8       ff_2.2-12
>>>>>>        [7] foreach_1.4.1         GenomicRanges_1.14.4  iterators_1.0.6       preprocessCore_1.24.0 splines_3.0.2         stats4_3.0.2
>>>>>> [13] survival_2.37-7       tools_3.0.2           XML_3.98-1.1          xtable_1.7-1          zlibbioc_1.8.0
>>>>>> I appreciate your help...
>>>>>> Julia
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>>>> Sent: Thursday, February 13, 2014 12:08 PM
>>>>>> To: Sabet, Julia A
>>>>>> Cc: bioconductor at r-project.org
>>>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>>>
>>>>>> Hi Julia,
>>>>>>
>>>>>> You should always include the output from sessionInfo() with any questions, so we can see what versions you are running, and what you have loaded.
>>>>>>
>>>>>> My guess is you are using an old version of R, prior to the introduction of that function, or you forgot to do library(affycoretools).
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>> On Thursday, February 13, 2014 12:03:54 PM, Sabet, Julia A wrote:
>>>>>>> Thank you so much, Jim.  I did everything you recommended and everything seemed to be working and then I installed the affycoretools package and when I did:
>>>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>>>>
>>>>>>> This error resulted:
>>>>>>> Error: could not find function "getMainProbes"
>>>>>>>
>>>>>>> What should I do?
>>>>>>> Thanks!
>>>>>>> Julia
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>>>>> Sent: Thursday, February 13, 2014 9:36 AM
>>>>>>> To: Sabet, Julia A
>>>>>>> Cc: bioconductor at r-project.org
>>>>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>>>>
>>>>>>> Hi Julia,
>>>>>>>
>>>>>>> There are several different things you can do. I'll show you one possibility.
>>>>>>>
>>>>>>> First, note that there are multiple different control probes on
>>>>>>> this array that aren't intended to measure differential
>>>>>>> expression, and should be excluded. So first let's look at the
>>>>>>> possible types of
>>>>>>> probesets:
>>>>>>>
>>>>>>>> library(pd.mogene.2.0.st)
>>>>>>>> con <- db(pd.mogene.2.0.st)
>>>>>>>> dbGetQuery(con, "select * from type_dict;")
>>>>>>>           type                   type_id
>>>>>>> 1     1                      main
>>>>>>> 2     2             control->affx
>>>>>>> 3     3             control->chip
>>>>>>> 4     4 control->bgp->antigenomic
>>>>>>> 5     5     control->bgp->genomic
>>>>>>> 6     6            normgene->exon
>>>>>>> 7     7          normgene->intron
>>>>>>> 8     8  rescue->FLmRNA->unmapped
>>>>>>> 9     9  control->affx->bac_spike
>>>>>>> 10   10            oligo_spike_in
>>>>>>> 11   11           r1_bac_spike_at
>>>>>>>
>>>>>>> These are all the possible types of probesets, but we don't have all of them on this array. To see which ones we do have we can do:
>>>>>>>
>>>>>>>
>>>>>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>>>>>             1      2      4      7      9
>>>>>>> 263551     18     23   5331     18
>>>>>>>
>>>>>>> So we only have these probeset types:
>>>>>>>
>>>>>>> 1     1                      main
>>>>>>> 2     2             control->affx
>>>>>>> 4     4 control->bgp->antigenomic
>>>>>>> 7     7          normgene->intron
>>>>>>> 9     9  control->affx->bac_spike
>>>>>>>
>>>>>>> And the 'main' probesets are those that we want to use for
>>>>>>> differential expression. Now one thing you could do is to say
>>>>>>> that the antigenomic probesets should give a good measure of
>>>>>>> background, as they are supposed to have sequences that don't
>>>>>>> exist in mice. So you could just extract those probesets, get
>>>>>>> some measure and use that as the lower limit of what you think is expressed or not.
>>>>>>> That's pretty naive, as a probe with higher GC content will have
>>>>>>> higher background than one with a lower GC content, but worrying
>>>>>>> about that is way beyond what I am prepared to go into.
>>>>>>>
>>>>>>> Now we can get the probeset IDs for the antigenomic probesets
>>>>>>>
>>>>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>>>>> join featureSet on core_mps.fsetid=featureSet.fsetid where
>>>>>>> featureSet.type='4';")
>>>>>>>
>>>>>>> And then extract those probesets and get a summary statistic.
>>>>>>>
>>>>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>>>>>> probs=0.95)
>>>>>>>
>>>>>>> Which will give us the 95th percentile of these background probes.
>>>>>>> You could then use the kOverA function in genefilter to filter
>>>>>>> out any probesets where all samples are below the background values.
>>>>>>> The idea being that you want to filter out any probesets unless k
>>>>>>> samples have expression levels >= A. So if you have 10 samples,
>>>>>>> where 5 are controls and 5 are treated, you would do something
>>>>>>> like
>>>>>>>
>>>>>>> minval <- max(bkg)
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt
>>>>>>> <- eset[ind,]
>>>>>>>
>>>>>>> You should also filter out all the non-main probesets. You can do
>>>>>>> that using getMainProbes() in the affycoretools package
>>>>>>>
>>>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday, February 12, 2014 10:16:31 PM, Sabet, Julia A wrote:
>>>>>>>> Hello all,
>>>>>>>> I am totally new to R/Bioconductor and have begun processing data from my Affymetrix Mouse Gene 2.0 ST arrays.  I normalized the data like this:
>>>>>>>>
>>>>>>>> library(pd.mogene.2.0.st)
>>>>>>>> eset <- rma(affyRaw)
>>>>>>>>
>>>>>>>> and added gene annotation and I am following the limma user's
>>>>>>>> guide, which recommends removing "probes that appear not be expressed in any of the experimental conditions."  I have read on previous posts that filtering may not be necessary.  Should I filter, and if so, how?  Using what code?
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>> Julia Sabet
>>>>>>>>
>>>>>>>> 	[[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conducto
>>>>>>>> r
>>>>>>> --
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> University of Washington
>>>>>>> Environmental and Occupational Health Sciences
>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>> Seattle WA 98105-6099
>>>>>> --
>>>>>> James W. MacDonald, M.S.
>>>>>> Biostatistician
>>>>>> University of Washington
>>>>>> Environmental and Occupational Health Sciences
>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>> Seattle WA 98105-6099
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099