[BioC] filtering probes in affymetrix data

Thu Feb 13 22:05:07 CET 2014

OK, what do you get if you type

getMainProbes

at an R prompt? It shouldn't be subsetting the ExpressionSet twice like 
that.

Jim

On Thursday, February 13, 2014 3:59:38 PM, Sabet, Julia A wrote:
> I get this:
>
>> traceback()
> 3: input[ind, ]
> 2: input[ind, ]
> 1: getMainProbes(eset.filt)
>>
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: Thursday, February 13, 2014 3:57 PM
> To: Sabet, Julia A
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] filtering probes in affymetrix data
>
> What do you get when you run
>
> traceback()
>
> right after that error?
>
> Jim
>
>
> On 2/13/2014 3:51 PM, Sabet, Julia A wrote:
>> Thank you both.  Solved that problem.  Now when I do the next line of code, I get another error message:
>>
>>> eset.filt <- getMainProbes(eset.filt)
>> Error in orig[[nm]][i, , ..., drop = drop] :
>>     (subscript) logical subscript too long
>>
>>
>>
>> -----Original Message-----
>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>> Sent: Thursday, February 13, 2014 3:44 PM
>> To: Sabet, Julia A
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] filtering probes in affymetrix data
>>
>> Hi Julia,
>>
>> On 2/13/2014 3:23 PM, Sabet, Julia A wrote:
>>> Thank you Jim.  I think my R version is up to date and I am making sure to use "library()".  I started the whole thing over and now I have this new error message, at an earlier step:
>>>
>>> library(pd.mogene.2.0.st)
>>>> con <- db(pd.mogene.2.0.st)
>>>> dbGetQuery(con, "select * from type_dict;")
>>>       type                   type_id
>>> 1     1                      main
>>> 2     2             control->affx
>>> 3     3             control->chip
>>> 4     4 control->bgp->antigenomic
>>> 5     5     control->bgp->genomic
>>> 6     6            normgene->exon
>>> 7     7          normgene->intron
>>> 8     8  rescue->FLmRNA->unmapped
>>> 9     9  control->affx->bac_spike
>>> 10   10            oligo_spike_in
>>> 11   11           r1_bac_spike_at
>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>         1      2      4      7      9
>>> 263551     18     23   5331     18
>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>> join
>>> + featureSet on core_mps.fsetid=featureSet.fsetid where
>>> + featureSet.type='4';")
>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>> + probs=0.95)
>>>> library(genefilter)
>>>> minval <- max(bkg)
>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <-
>> The above line has a bit extra at the end that R doesn't like.
>>
>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt"
>> And this is your hint. Error messages are your friends.
>>
>> Best,
>>
>> Jim
>>
>>
>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt <-
>>>> eset[ind,]
>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt <-
>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>> Here is my sessionInfo() output:
>>>
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
>>> [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>>     [1] BiocInstaller_1.12.0                  genefilter_1.44.0                     mogene20sttranscriptcluster.db_2.13.0
>>>     [4] org.Mm.eg.db_2.10.1                   AnnotationDbi_1.24.0                  pd.mogene.2.0.st_2.12.0
>>>     [7] RSQLite_0.11.4                        DBI_0.2-7                             oligo_1.26.1
>>> [10] Biostrings_2.30.1                     XVector_0.2.0                         IRanges_1.20.6
>>> [13] Biobase_2.22.0                        oligoClasses_1.24.0                   BiocGenerics_0.8.0
>>>
>>> loaded via a namespace (and not attached):
>>>     [1] affxparser_1.34.0     affyio_1.30.0         annotate_1.40.0       bit_1.1-11            codetools_0.2-8       ff_2.2-12
>>>     [7] foreach_1.4.1         GenomicRanges_1.14.4  iterators_1.0.6       preprocessCore_1.24.0 splines_3.0.2         stats4_3.0.2
>>> [13] survival_2.37-7       tools_3.0.2           XML_3.98-1.1          xtable_1.7-1          zlibbioc_1.8.0
>>> I appreciate your help...
>>> Julia
>>>
>>> -----Original Message-----
>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>> Sent: Thursday, February 13, 2014 12:08 PM
>>> To: Sabet, Julia A
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>
>>> Hi Julia,
>>>
>>> You should always include the output from sessionInfo() with any questions, so we can see what versions you are running, and what you have loaded.
>>>
>>> My guess is you are using an old version of R, prior to the introduction of that function, or you forgot to do library(affycoretools).
>>>
>>> Best,
>>>
>>> Jim
>>>
>>> On Thursday, February 13, 2014 12:03:54 PM, Sabet, Julia A wrote:
>>>> Thank you so much, Jim.  I did everything you recommended and everything seemed to be working and then I installed the affycoretools package and when I did:
>>>> eset.filt <- getMainProbes(eset.filt)
>>>>
>>>> This error resulted:
>>>> Error: could not find function "getMainProbes"
>>>>
>>>> What should I do?
>>>> Thanks!
>>>> Julia
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>> Sent: Thursday, February 13, 2014 9:36 AM
>>>> To: Sabet, Julia A
>>>> Cc: bioconductor at r-project.org
>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>
>>>> Hi Julia,
>>>>
>>>> There are several different things you can do. I'll show you one possibility.
>>>>
>>>> First, note that there are multiple different control probes on this
>>>> array that aren't intended to measure differential expression, and
>>>> should be excluded. So first let's look at the possible types of
>>>> probesets:
>>>>
>>>>> library(pd.mogene.2.0.st)
>>>>> con <- db(pd.mogene.2.0.st)
>>>>> dbGetQuery(con, "select * from type_dict;")
>>>>        type                   type_id
>>>> 1     1                      main
>>>> 2     2             control->affx
>>>> 3     3             control->chip
>>>> 4     4 control->bgp->antigenomic
>>>> 5     5     control->bgp->genomic
>>>> 6     6            normgene->exon
>>>> 7     7          normgene->intron
>>>> 8     8  rescue->FLmRNA->unmapped
>>>> 9     9  control->affx->bac_spike
>>>> 10   10            oligo_spike_in
>>>> 11   11           r1_bac_spike_at
>>>>
>>>> These are all the possible types of probesets, but we don't have all of them on this array. To see which ones we do have we can do:
>>>>
>>>>
>>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>>          1      2      4      7      9
>>>> 263551     18     23   5331     18
>>>>
>>>> So we only have these probeset types:
>>>>
>>>> 1     1                      main
>>>> 2     2             control->affx
>>>> 4     4 control->bgp->antigenomic
>>>> 7     7          normgene->intron
>>>> 9     9  control->affx->bac_spike
>>>>
>>>> And the 'main' probesets are those that we want to use for
>>>> differential expression. Now one thing you could do is to say that
>>>> the antigenomic probesets should give a good measure of background,
>>>> as they are supposed to have sequences that don't exist in mice. So
>>>> you could just extract those probesets, get some measure and use
>>>> that as the lower limit of what you think is expressed or not.
>>>> That's pretty naive, as a probe with higher GC content will have
>>>> higher background than one with a lower GC content, but worrying
>>>> about that is way beyond what I am prepared to go into.
>>>>
>>>> Now we can get the probeset IDs for the antigenomic probesets
>>>>
>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>> join featureSet on core_mps.fsetid=featureSet.fsetid where
>>>> featureSet.type='4';")
>>>>
>>>> And then extract those probesets and get a summary statistic.
>>>>
>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>>> probs=0.95)
>>>>
>>>> Which will give us the 95th percentile of these background probes.
>>>> You could then use the kOverA function in genefilter to filter out
>>>> any probesets where all samples are below the background values. The
>>>> idea being that you want to filter out any probesets unless k
>>>> samples have expression levels >= A. So if you have 10 samples,
>>>> where 5 are controls and 5 are treated, you would do something like
>>>>
>>>> minval <- max(bkg)
>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <-
>>>> eset[ind,]
>>>>
>>>> You should also filter out all the non-main probesets. You can do
>>>> that using getMainProbes() in the affycoretools package
>>>>
>>>> eset.filt <- getMainProbes(eset.filt)
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>>
>>>> On Wednesday, February 12, 2014 10:16:31 PM, Sabet, Julia A wrote:
>>>>> Hello all,
>>>>> I am totally new to R/Bioconductor and have begun processing data from my Affymetrix Mouse Gene 2.0 ST arrays.  I normalized the data like this:
>>>>>
>>>>> library(pd.mogene.2.0.st)
>>>>> eset <- rma(affyRaw)
>>>>>
>>>>> and added gene annotation and I am following the limma user's
>>>>> guide, which recommends removing "probes that appear not be expressed in any of the experimental conditions."  I have read on previous posts that filtering may not be necessary.  Should I filter, and if so, how?  Using what code?
>>>>>
>>>>> Thank you!
>>>>> Julia Sabet
>>>>>
>>>>> 	[[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099