[BioC] How to read a subset of the .CEL files

Tue Jun 27 22:38:34 CEST 2006

Hi Greg,

I use some nice functions written by Ariel Chernomoretz (thanks!!) and 
posted to the Bioconductor mailing list on April 21, 2005:
http://article.gmane.org/gmane.science.biology.informatics.conductor/4258/match=, 
although I do have a caveat/question about the effects of removing probe 
sets on gcrma values (see below).

You can use the main function 'RemoveProbes' to remove individual probes 
and/or entire probe sets, and I routinely use it to remove the non-soybean 
probe sets from Affy's soybean array. I made one change to the code of 
'RemoveProbes', which was to add a call at the beginning to automatically 
load the cdf and probe packages if they weren't loaded already:

require(cdfpackagename,character.only=T)
require(probepackagename,character.only=T)

If you contact me off-list I can send you my revised code and a few 
pointers if you're having trouble figuring out how to get 'RemoveProbes' to 
work. Basically, all you need to input is a list of the probe set names 
that you want to remove.

CAVEAT: Ariel pointed out in a post on April 20, 2005 
(http://article.gmane.org/gmane.science.biology.informatics.conductor/4242/match=) 
that removing even one probe will slightly affect the resulting gcrma 
values, because of the sampling of affinity values, likely in the internal 
function 'bg.parameters.ns'. I have found this effect to be greatly 
increased when I remove the non-soybean probe sets (~40% of the array) 
before running 'gcrma': 88% of the remaining probe sets have smaller gcrma 
values than before, and 25% have been decreased 1.4 fold or more. The 
changes occur in the background adjustment step, not in the normalization 
or summarization steps. If you remove a random 40% of the probe sets, there 
isn't the same magnitude (90% <  1.05 fold) or bias in direction (45% 
smaller) of changes . That's when I noticed that the distribution of 
expression values in my samples for the soybean genes was very different 
from the distribution of non-soybean genes. Nearly 80% of the soybean genes 
are called present in my samples but only 5% of the non-soybean genes are 
called present.

My question/concern is why a change in the distribution of probe values can 
have a large effect on the gc-based background correction? The background 
corrections used by RMA and MAS5 aren't affected by removing the 
non-soybean genes, and neither is the background correction+normalization 
of VSN. It seems to be that there might be some inherent assumptions in the 
gc-based method that aren't in the other methods about the distribution of 
probe values. Which gcrma values are "better"? Oops - I just compared the 
limma results using both sets of gcrma values, and it appears that whatever 
changes are occurring don't really affect the significant gene lists nor 
the magnitude of direction of changes between two treatment groups, so 
perhaps this caveat of mine is unnecessary! It might be important for users 
of soybean and other large arrays, if they run 'gcrma' on random subsets of 
arrays instead of all together in order to overcome memory limitations - 
make sure you are consistent in cutting out the probe sets before running 
'gcrma', else your values will not be comparable between subsets!!

Cheers,
Jenny

At 05:30 AM 6/26/2006, Henrik Bengtsson wrote:
>See the affxparser package, e.g. readCelUnits(filenames,
>units=c(1,600:612,45)).  At the moment,you have to take it from there
>yourself.
>
>Henrik Bengtsson
>
>On 6/24/06, James W. MacDonald <jmacdon at med.umich.edu> wrote:
> > Hi Greg,
> >
> > Alvord, Greg (DMS) [Contr] wrote:
> > >
> > >
> > > Dear List -
> > >
> > >
> > >
> > > I am new to BioConductor and R, working under Windows with a gig of RAM,
> > > version R-2.2.1 of R.  I have successfully read in six .CEL files and
> > > created the following AffyBatch object.
> > >
> > >
> > >
> > >
> > >>soy.ab
> > >
> > >
> > > AffyBatch object
> > >
> > > size of arrays=1164x1164 features (63516 kb)
> > >
> > > cdf=Soybean (61170 affyids)
> > >
> > > number of samples=6
> > >
> > > number of genes=61170
> > >
> > > annotation=soybean
> > >
> > >
> > >
> > > The investigator for whom I'm working is interested in an analysis of
> > > differential gene expression on a subset of affyids in this AffyBatch
> > > object, specifically in 37,744 of the 61,170 affyids (indicated above)
> > > that relate specifically to the soybean genome.  I have learned that the
> > > relevant species of interest is labeled 'Glycine max'.  I obtained this
> > > information from another source and have not (due to my ignorance) been
> > > able to identify any slot in soy.ab AffyBatch object that identifies
> > > this species.  Here is a table of the species on the soy.ab AffyBatch
> > > object (which I obtained from another source).
> > >
> > >
> > >
> > >
> > >>cbind(table(Species))
> > >
> > >
> > >                                           [,1]
> > >
> > > Alfalfa mosaic virus                         3
> > >
> > > Bean pod mottle virus strain G-7             2
> > >
> > > Bean pod mottle virus strain K-Hancock1      1
> > >
> > > Clover yellow vein virus                     1
> > >
> > > Glycine max                              37744
> > >
> > > Heterodera glycines                       7539
> > >
> > > Phytophthora sojae                       15864
> > >
> > > S. saman                                     4
> > >
> > > Southern bean mosaic virus strain SBMV-S     1
> > >
> > > Soybean mosaic virus                         1
> > >
> > > Soybean mosaic virus strain G5               3
> > >
> > > Soybean mosaic virus strain G7               1
> > >
> > > Soybean mosaic virus strain N                1
> > >
> > > Tobacco ringspot virus                       2
> > >
> > > Tobacco streak virus                         3
> > >
> > >
> > >
> > >
> > >
> > > I want to select from the soy.ab AffyBatch object the relevant
> > > information for the species 'Glycine max' only.  I have created a data
> > > frame containing those Affy.ID's for species 'Glycine max', e.g.,
> > >
> > >
> > >
> > >
> > >>Glycine.max.Species.AffyID.df[c(1:3,37742:37744),]
> > >
> > >
> > >           Species                Affy.ID
> > >
> > > 8     Glycine max         AFFX-BioB-3_at
> > >
> > > 9     Glycine max         AFFX-BioB-5_at
> > >
> > > 10    Glycine max         AFFX-BioB-M_at
> > >
> > > 37749 Glycine max soybean_rRNA_838_RC_at
> > >
> > > 37750 Glycine max    soybean_rRNA_918_at
> > >
> > > 37751 Glycine max soybean_rRNA_918_RC_at
> > >
> > >
> > >
> > >
> > >>dim(Glycine.max.Species.AffyID.df)
> > >
> > >
> > > [1] 37744     2
> > >
> > >
> > >
> > > How do I extract/create an AffyBatch object containing only the
> > > appropriate Affy.ID's related to the 'Glycine max' species?
> >
> > An AffyBatch object isn't the best for subsetting this way. Better would
> > be to compute expression values using rma() or your favorite method, and
> > then subset.
> >
> > eset <- rma(soy.ab)
> > subsetted.exprset <- eset[Glycine.max.Species.AffyID.df[,2],]
> >
> > HTH,
> >
> > Jim
> >
> > --
> > James W. MacDonald
> > University of Michigan
> > Affymetrix and cDNA Microarray Core
> > 1500 E Medical Center Drive
> > Ann Arbor MI 48109
> > 734-647-5623
> >
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should 
> not be used for urgent or sensitive issues.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: 
>http://news.gmane.org/gmane.science.biology.informatics.conductor

Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu