[BioC] How to read a subset of the .CEL files
Jenny Drnevich
drnevich at uiuc.edu
Tue Jun 27 22:38:34 CEST 2006
Hi Greg,
I use some nice functions written by Ariel Chernomoretz (thanks!!) and
posted to the Bioconductor mailing list on April 21, 2005:
http://article.gmane.org/gmane.science.biology.informatics.conductor/4258/match=,
although I do have a caveat/question about the effects of removing probe
sets on gcrma values (see below).
You can use the main function 'RemoveProbes' to remove individual probes
and/or entire probe sets, and I routinely use it to remove the non-soybean
probe sets from Affy's soybean array. I made one change to the code of
'RemoveProbes', which was to add a call at the beginning to automatically
load the cdf and probe packages if they weren't loaded already:
require(cdfpackagename,character.only=T)
require(probepackagename,character.only=T)
If you contact me off-list I can send you my revised code and a few
pointers if you're having trouble figuring out how to get 'RemoveProbes' to
work. Basically, all you need to input is a list of the probe set names
that you want to remove.
CAVEAT: Ariel pointed out in a post on April 20, 2005
(http://article.gmane.org/gmane.science.biology.informatics.conductor/4242/match=)
that removing even one probe will slightly affect the resulting gcrma
values, because of the sampling of affinity values, likely in the internal
function 'bg.parameters.ns'. I have found this effect to be greatly
increased when I remove the non-soybean probe sets (~40% of the array)
before running 'gcrma': 88% of the remaining probe sets have smaller gcrma
values than before, and 25% have been decreased 1.4 fold or more. The
changes occur in the background adjustment step, not in the normalization
or summarization steps. If you remove a random 40% of the probe sets, there
isn't the same magnitude (90% < 1.05 fold) or bias in direction (45%
smaller) of changes . That's when I noticed that the distribution of
expression values in my samples for the soybean genes was very different
from the distribution of non-soybean genes. Nearly 80% of the soybean genes
are called present in my samples but only 5% of the non-soybean genes are
called present.
My question/concern is why a change in the distribution of probe values can
have a large effect on the gc-based background correction? The background
corrections used by RMA and MAS5 aren't affected by removing the
non-soybean genes, and neither is the background correction+normalization
of VSN. It seems to be that there might be some inherent assumptions in the
gc-based method that aren't in the other methods about the distribution of
probe values. Which gcrma values are "better"? Oops - I just compared the
limma results using both sets of gcrma values, and it appears that whatever
changes are occurring don't really affect the significant gene lists nor
the magnitude of direction of changes between two treatment groups, so
perhaps this caveat of mine is unnecessary! It might be important for users
of soybean and other large arrays, if they run 'gcrma' on random subsets of
arrays instead of all together in order to overcome memory limitations -
make sure you are consistent in cutting out the probe sets before running
'gcrma', else your values will not be comparable between subsets!!
Cheers,
Jenny
At 05:30 AM 6/26/2006, Henrik Bengtsson wrote:
>See the affxparser package, e.g. readCelUnits(filenames,
>units=c(1,600:612,45)). At the moment,you have to take it from there
>yourself.
>
>Henrik Bengtsson
>
>On 6/24/06, James W. MacDonald <jmacdon at med.umich.edu> wrote:
> > Hi Greg,
> >
> > Alvord, Greg (DMS) [Contr] wrote:
> > >
> > >
> > > Dear List -
> > >
> > >
> > >
> > > I am new to BioConductor and R, working under Windows with a gig of RAM,
> > > version R-2.2.1 of R. I have successfully read in six .CEL files and
> > > created the following AffyBatch object.
> > >
> > >
> > >
> > >
> > >>soy.ab
> > >
> > >
> > > AffyBatch object
> > >
> > > size of arrays=1164x1164 features (63516 kb)
> > >
> > > cdf=Soybean (61170 affyids)
> > >
> > > number of samples=6
> > >
> > > number of genes=61170
> > >
> > > annotation=soybean
> > >
> > >
> > >
> > > The investigator for whom I'm working is interested in an analysis of
> > > differential gene expression on a subset of affyids in this AffyBatch
> > > object, specifically in 37,744 of the 61,170 affyids (indicated above)
> > > that relate specifically to the soybean genome. I have learned that the
> > > relevant species of interest is labeled 'Glycine max'. I obtained this
> > > information from another source and have not (due to my ignorance) been
> > > able to identify any slot in soy.ab AffyBatch object that identifies
> > > this species. Here is a table of the species on the soy.ab AffyBatch
> > > object (which I obtained from another source).
> > >
> > >
> > >
> > >
> > >>cbind(table(Species))
> > >
> > >
> > > [,1]
> > >
> > > Alfalfa mosaic virus 3
> > >
> > > Bean pod mottle virus strain G-7 2
> > >
> > > Bean pod mottle virus strain K-Hancock1 1
> > >
> > > Clover yellow vein virus 1
> > >
> > > Glycine max 37744
> > >
> > > Heterodera glycines 7539
> > >
> > > Phytophthora sojae 15864
> > >
> > > S. saman 4
> > >
> > > Southern bean mosaic virus strain SBMV-S 1
> > >
> > > Soybean mosaic virus 1
> > >
> > > Soybean mosaic virus strain G5 3
> > >
> > > Soybean mosaic virus strain G7 1
> > >
> > > Soybean mosaic virus strain N 1
> > >
> > > Tobacco ringspot virus 2
> > >
> > > Tobacco streak virus 3
> > >
> > >
> > >
> > >
> > >
> > > I want to select from the soy.ab AffyBatch object the relevant
> > > information for the species 'Glycine max' only. I have created a data
> > > frame containing those Affy.ID's for species 'Glycine max', e.g.,
> > >
> > >
> > >
> > >
> > >>Glycine.max.Species.AffyID.df[c(1:3,37742:37744),]
> > >
> > >
> > > Species Affy.ID
> > >
> > > 8 Glycine max AFFX-BioB-3_at
> > >
> > > 9 Glycine max AFFX-BioB-5_at
> > >
> > > 10 Glycine max AFFX-BioB-M_at
> > >
> > > 37749 Glycine max soybean_rRNA_838_RC_at
> > >
> > > 37750 Glycine max soybean_rRNA_918_at
> > >
> > > 37751 Glycine max soybean_rRNA_918_RC_at
> > >
> > >
> > >
> > >
> > >>dim(Glycine.max.Species.AffyID.df)
> > >
> > >
> > > [1] 37744 2
> > >
> > >
> > >
> > > How do I extract/create an AffyBatch object containing only the
> > > appropriate Affy.ID's related to the 'Glycine max' species?
> >
> > An AffyBatch object isn't the best for subsetting this way. Better would
> > be to compute expression values using rma() or your favorite method, and
> > then subset.
> >
> > eset <- rma(soy.ab)
> > subsetted.exprset <- eset[Glycine.max.Species.AffyID.df[,2],]
> >
> > HTH,
> >
> > Jim
> >
> > --
> > James W. MacDonald
> > University of Michigan
> > Affymetrix and cDNA Microarray Core
> > 1500 E Medical Center Drive
> > Ann Arbor MI 48109
> > 734-647-5623
> >
> >
> >
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues.
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives:
>http://news.gmane.org/gmane.science.biology.informatics.conductor
Jenny Drnevich, Ph.D.
Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign
330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA
ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at uiuc.edu
More information about the Bioconductor
mailing list