[BioC] filtering probes in affymetrix data
James W. MacDonald
jmacdon at uw.edu
Wed Feb 19 21:49:09 CET 2014
Hi Julia,
Here is an example. Here I am using a different array, and I only have
three samples per group, but otherwise the steps are the same.
> eset <- rma(dat)
> eset.filt <- getMainProbes(eset)
> library(genefilter)
> ind <- genefilter(eset.filt, filterfun(kOverA(3, 6)))
> eset.dblfilt <- eset.filt[ind,]
> dim(eset)
Features Samples
70523 12
> dim(eset.filt)
Features Samples
68993 12
> dim(eset.dblfilt)
Features Samples
13445 12
Best,
Jim
On 2/19/2014 3:13 PM, Sabet, Julia A wrote:
> Thank you. I couldn't figure out how to filter out controls. What would the code be to filter out both controls and background using kOverA as you suggested?
> Thanks
> Julia
>
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: Thursday, February 13, 2014 6:21 PM
> To: Sabet, Julia A
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] filtering probes in affymetrix data
>
> OK, well that's probably excessive. I sort of wonder about Affy and their control probes. I originally wrote getMainProbes because the intronic controls were always popping up in lists of differentially expressed probesets. I wonder if these antigenomic probes are actually measuring something.
>
> You might be better served to just filter out the controls right after running rma(), and then use something less conservative like
>
> filt <- filterfun(kOverA(12, 6)
>
> or you could look at the other options in genefilter like varFilter().
>
> Best,
>
> Jim
>
> On Thursday, February 13, 2014 5:48:59 PM, Sabet, Julia A wrote:
>> It gives me this:
>> Features Samples
>> 2426 36
>>
>> -----Original Message-----
>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>> Sent: Thursday, February 13, 2014 5:47 PM
>> To: Sabet, Julia A
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] filtering probes in affymetrix data
>>
>> How many probesets do you have left after the genefilter step (e.g., what does dim(eset.filt) give you)?
>>
>> On Thursday, February 13, 2014 4:19:11 PM, Sabet, Julia A wrote:
>>> Could it be because I did the kOverA function twice, since the first time I put a 5 and I meant to put a 12? I thought it would just get overrided but maybe it didn't?
>>>
>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>> join
>>> + featureSet on core_mps.fsetid=featureSet.fsetid where
>>> + featureSet.type='4';")
>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>> + probs=0.95)
>>>> minval <- max(bkg)
>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <-
>>>> eset[ind,] ind <- genefilter(eset, filterfun(kOverA(12, minval)))
>>>> eset.filt <- eset[ind,]
>>>> library(affycoretools)
>>> eset.filt <- getMainProbes(eset.filt) Error in orig[[nm]][i, , ...,
>>> drop = drop] :
>>> (subscript) logical subscript too long
>>>
>>> -----Original Message-----
>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>> Sent: Thursday, February 13, 2014 4:05 PM
>>> To: Sabet, Julia A
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>
>>> OK, what do you get if you type
>>>
>>> getMainProbes
>>>
>>> at an R prompt? It shouldn't be subsetting the ExpressionSet twice like that.
>>>
>>> Jim
>>>
>>>
>>>
>>> On Thursday, February 13, 2014 3:59:38 PM, Sabet, Julia A wrote:
>>>> I get this:
>>>>
>>>>> traceback()
>>>> 3: input[ind, ]
>>>> 2: input[ind, ]
>>>> 1: getMainProbes(eset.filt)
>>>> -----Original Message-----
>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>> Sent: Thursday, February 13, 2014 3:57 PM
>>>> To: Sabet, Julia A
>>>> Cc: bioconductor at r-project.org
>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>
>>>> What do you get when you run
>>>>
>>>> traceback()
>>>>
>>>> right after that error?
>>>>
>>>> Jim
>>>>
>>>>
>>>> On 2/13/2014 3:51 PM, Sabet, Julia A wrote:
>>>>> Thank you both. Solved that problem. Now when I do the next line of code, I get another error message:
>>>>>
>>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>> Error in orig[[nm]][i, , ..., drop = drop] :
>>>>> (subscript) logical subscript too long
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>>> Sent: Thursday, February 13, 2014 3:44 PM
>>>>> To: Sabet, Julia A
>>>>> Cc: bioconductor at r-project.org
>>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>>
>>>>> Hi Julia,
>>>>>
>>>>> On 2/13/2014 3:23 PM, Sabet, Julia A wrote:
>>>>>> Thank you Jim. I think my R version is up to date and I am making sure to use "library()". I started the whole thing over and now I have this new error message, at an earlier step:
>>>>>>
>>>>>> library(pd.mogene.2.0.st)
>>>>>>> con <- db(pd.mogene.2.0.st)
>>>>>>> dbGetQuery(con, "select * from type_dict;")
>>>>>> type type_id
>>>>>> 1 1 main
>>>>>> 2 2 control->affx
>>>>>> 3 3 control->chip
>>>>>> 4 4 control->bgp->antigenomic
>>>>>> 5 5 control->bgp->genomic
>>>>>> 6 6 normgene->exon
>>>>>> 7 7 normgene->intron
>>>>>> 8 8 rescue->FLmRNA->unmapped
>>>>>> 9 9 control->affx->bac_spike
>>>>>> 10 10 oligo_spike_in
>>>>>> 11 11 r1_bac_spike_at
>>>>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>>>> 1 2 4 7 9
>>>>>> 263551 18 23 5331 18
>>>>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>>>>> join
>>>>>> + featureSet on core_mps.fsetid=featureSet.fsetid where
>>>>>> + featureSet.type='4';")
>>>>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>>>>> + probs=0.95)
>>>>>>> library(genefilter)
>>>>>>> minval <- max(bkg)
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt
>>>>>>> <-
>>>>> The above line has a bit extra at the end that R doesn't like.
>>>>>
>>>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt"
>>>>> And this is your hint. Error messages are your friends.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt
>>>>>>> <- eset[ind,]
>>>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt
>>>>>>> <-
>>>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>>>>> Here is my sessionInfo() output:
>>>>>>
>>>>>> R version 3.0.2 (2013-09-25)
>>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
>>>>>> [4] LC_NUMERIC=C LC_TIME=English_United States.1252
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] BiocInstaller_1.12.0 genefilter_1.44.0 mogene20sttranscriptcluster.db_2.13.0
>>>>>> [4] org.Mm.eg.db_2.10.1 AnnotationDbi_1.24.0 pd.mogene.2.0.st_2.12.0
>>>>>> [7] RSQLite_0.11.4 DBI_0.2-7 oligo_1.26.1
>>>>>> [10] Biostrings_2.30.1 XVector_0.2.0 IRanges_1.20.6
>>>>>> [13] Biobase_2.22.0 oligoClasses_1.24.0 BiocGenerics_0.8.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] affxparser_1.34.0 affyio_1.30.0 annotate_1.40.0 bit_1.1-11 codetools_0.2-8 ff_2.2-12
>>>>>> [7] foreach_1.4.1 GenomicRanges_1.14.4 iterators_1.0.6 preprocessCore_1.24.0 splines_3.0.2 stats4_3.0.2
>>>>>> [13] survival_2.37-7 tools_3.0.2 XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0
>>>>>> I appreciate your help...
>>>>>> Julia
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>>>> Sent: Thursday, February 13, 2014 12:08 PM
>>>>>> To: Sabet, Julia A
>>>>>> Cc: bioconductor at r-project.org
>>>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>>>
>>>>>> Hi Julia,
>>>>>>
>>>>>> You should always include the output from sessionInfo() with any questions, so we can see what versions you are running, and what you have loaded.
>>>>>>
>>>>>> My guess is you are using an old version of R, prior to the introduction of that function, or you forgot to do library(affycoretools).
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>> On Thursday, February 13, 2014 12:03:54 PM, Sabet, Julia A wrote:
>>>>>>> Thank you so much, Jim. I did everything you recommended and everything seemed to be working and then I installed the affycoretools package and when I did:
>>>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>>>>
>>>>>>> This error resulted:
>>>>>>> Error: could not find function "getMainProbes"
>>>>>>>
>>>>>>> What should I do?
>>>>>>> Thanks!
>>>>>>> Julia
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>>>>> Sent: Thursday, February 13, 2014 9:36 AM
>>>>>>> To: Sabet, Julia A
>>>>>>> Cc: bioconductor at r-project.org
>>>>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>>>>
>>>>>>> Hi Julia,
>>>>>>>
>>>>>>> There are several different things you can do. I'll show you one possibility.
>>>>>>>
>>>>>>> First, note that there are multiple different control probes on
>>>>>>> this array that aren't intended to measure differential
>>>>>>> expression, and should be excluded. So first let's look at the
>>>>>>> possible types of
>>>>>>> probesets:
>>>>>>>
>>>>>>>> library(pd.mogene.2.0.st)
>>>>>>>> con <- db(pd.mogene.2.0.st)
>>>>>>>> dbGetQuery(con, "select * from type_dict;")
>>>>>>> type type_id
>>>>>>> 1 1 main
>>>>>>> 2 2 control->affx
>>>>>>> 3 3 control->chip
>>>>>>> 4 4 control->bgp->antigenomic
>>>>>>> 5 5 control->bgp->genomic
>>>>>>> 6 6 normgene->exon
>>>>>>> 7 7 normgene->intron
>>>>>>> 8 8 rescue->FLmRNA->unmapped
>>>>>>> 9 9 control->affx->bac_spike
>>>>>>> 10 10 oligo_spike_in
>>>>>>> 11 11 r1_bac_spike_at
>>>>>>>
>>>>>>> These are all the possible types of probesets, but we don't have all of them on this array. To see which ones we do have we can do:
>>>>>>>
>>>>>>>
>>>>>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>>>>> 1 2 4 7 9
>>>>>>> 263551 18 23 5331 18
>>>>>>>
>>>>>>> So we only have these probeset types:
>>>>>>>
>>>>>>> 1 1 main
>>>>>>> 2 2 control->affx
>>>>>>> 4 4 control->bgp->antigenomic
>>>>>>> 7 7 normgene->intron
>>>>>>> 9 9 control->affx->bac_spike
>>>>>>>
>>>>>>> And the 'main' probesets are those that we want to use for
>>>>>>> differential expression. Now one thing you could do is to say
>>>>>>> that the antigenomic probesets should give a good measure of
>>>>>>> background, as they are supposed to have sequences that don't
>>>>>>> exist in mice. So you could just extract those probesets, get
>>>>>>> some measure and use that as the lower limit of what you think is expressed or not.
>>>>>>> That's pretty naive, as a probe with higher GC content will have
>>>>>>> higher background than one with a lower GC content, but worrying
>>>>>>> about that is way beyond what I am prepared to go into.
>>>>>>>
>>>>>>> Now we can get the probeset IDs for the antigenomic probesets
>>>>>>>
>>>>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>>>>> join featureSet on core_mps.fsetid=featureSet.fsetid where
>>>>>>> featureSet.type='4';")
>>>>>>>
>>>>>>> And then extract those probesets and get a summary statistic.
>>>>>>>
>>>>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>>>>>> probs=0.95)
>>>>>>>
>>>>>>> Which will give us the 95th percentile of these background probes.
>>>>>>> You could then use the kOverA function in genefilter to filter
>>>>>>> out any probesets where all samples are below the background values.
>>>>>>> The idea being that you want to filter out any probesets unless k
>>>>>>> samples have expression levels >= A. So if you have 10 samples,
>>>>>>> where 5 are controls and 5 are treated, you would do something
>>>>>>> like
>>>>>>>
>>>>>>> minval <- max(bkg)
>>>>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt
>>>>>>> <- eset[ind,]
>>>>>>>
>>>>>>> You should also filter out all the non-main probesets. You can do
>>>>>>> that using getMainProbes() in the affycoretools package
>>>>>>>
>>>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday, February 12, 2014 10:16:31 PM, Sabet, Julia A wrote:
>>>>>>>> Hello all,
>>>>>>>> I am totally new to R/Bioconductor and have begun processing data from my Affymetrix Mouse Gene 2.0 ST arrays. I normalized the data like this:
>>>>>>>>
>>>>>>>> library(pd.mogene.2.0.st)
>>>>>>>> eset <- rma(affyRaw)
>>>>>>>>
>>>>>>>> and added gene annotation and I am following the limma user's
>>>>>>>> guide, which recommends removing "probes that appear not be expressed in any of the experimental conditions." I have read on previous posts that filtering may not be necessary. Should I filter, and if so, how? Using what code?
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>> Julia Sabet
>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conducto
>>>>>>>> r
>>>>>>> --
>>>>>>> James W. MacDonald, M.S.
>>>>>>> Biostatistician
>>>>>>> University of Washington
>>>>>>> Environmental and Occupational Health Sciences
>>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>>> Seattle WA 98105-6099
>>>>>> --
>>>>>> James W. MacDonald, M.S.
>>>>>> Biostatistician
>>>>>> University of Washington
>>>>>> Environmental and Occupational Health Sciences
>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>> Seattle WA 98105-6099
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list