[BioC] set.seed() in gcrma

Wed Feb 13 22:13:26 CET 2008

Hi All,

The sample step occurs in the gcrma.engine/gcrma.engine2 sub function.
A random sample of the entire matrix of PM values is selected to
calculate an linear regression coefficient that is used in the
full_model and affinities model.
I believe the set.seed allows one to obtain the same results for any
reanalysis of exactly the same CEL files.
If one adds or subtracts even one CEL file the sampling of that entire
matrix change and a different result is obtained.

S. Keith Anderson, MS
Statistician, Cancer Center Statistics
Mayo Clinic
email:  anderson.s at mayo.edu

-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Henrik
Bengtsson
Sent: Wednesday, February 13, 2008 2:41 PM
To: Kasper Daniel Hansen
Cc: William T Barry; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] set.seed() in gcrma

On Feb 13, 2008 11:54 AM, Kasper Daniel Hansen
<khansen at stat.berkeley.edu> wrote:
> On Feb 13, 2008, at 11:18 AM, Henrik Bengtsson wrote:
>
> > On Feb 13, 2008 10:46 AM, Kasper Daniel Hansen
> > <khansen at stat.berkeley.edu> wrote:
> >> In my opinion gcrma should _not_ play with the seed. That is
> >> something for the user to decide.
> >
> > It's a trade-off between providing reproducible examples and adding
to
> > the overall confusion.  I think it is important to be able to
generate
> > identical results for identical calls, and since it is not clear to
> > most users that the gcRMA parameters are estimated based on a sample
> > one I can see how the current solution avoids the problem of having
> > ask people to set the seed explicitly before calling gcrma() [also I
> > have a hard time to keep track on what is going on in all methods].
>
> When a function sets the seed within itself and then does a sample()
> immediately afterwards what it really does is using a fixed index
> vector for arrays of a certain size. To call this "a random sample"
> is simply wrong. It would be easier and more transparent to just
> choose a regular grid like
>    seg(from = 1, to = maxIndex, length.out = sampleSize)
> Now some people would argue that it seems wrong to use a regular grid
> like this and that the probes should be picked at random. To which I
> repeat what i just wrote: when you do set.seed you are not selecting
> anything at random - but you are obscuring this fact. If it really is
> important to not always use the same probes, the seed should not be
> set and you should live with the fact that you do not get
> reproducible effects.
>
> I agree that for gcRMA it would make sense to have the results be
> highly reproducible (otherwise I foresee many posts about this), but
> that is achieved even better by using a grid as above. Or perhaps do
> something like having an argument called (say) randomSample = FALSE
> as default and then do something like
> if(randomSample)
>    gridIdx = sample(maxIndex, size = sampleSize)
> else
>    gridIdx = seq(from = 1, to = maxIndex, length.out = sampleSize)
> (appropriately modified if you want to exclude certain indices
> corresponding to QC probes). Setting the seed inside a function
> should in my opinion always be discouraged.

I don't disagree with your arguments.  However, there is a risk of
selecting probes/units with common properties if seq() is used.  This
comes down to how Affymetrix has outlined the probes on the particular
chip types being analyzed, and since we don't know what chip types
will show up in the future, I would like to argue that a per-chip-type
pre-determined sample() is safer than a uniform seq().

I've checked the latest BioC devel version of 'gcrma' and I noticed
that in both bg.parameters.ns() and bg.parameters.ns2() set.seed() is
called, but I cannot see any code that actually sample:s anything.
Are those set.seed() left overs or am I missing something?

/Henrik

>
> Kasper
>
>
>
> > One solution would be to move the value of seed to be an argument
with
> > a default value and if NA/NULL the seed is *not* set.  That would
> > provide a backward-compatible upgrade to gcrma().
>
> This is the solution taken by most people who want a similar
> functionality. I still
>
>
>
> > Comments?
> >
> > Henrik
> >
> >>
> >> Kasper
> >>
> >>
> >> On Feb 13, 2008, at 9:49 AM, Zhijin Wu wrote:
> >>
> >>> Bill
> >>> Thank you for that suggestion. We will edit gcrma to restore the
> >>> seed in
> >>> the next release.
> >>> Zhijin
> >>> Henrik Bengtsson wrote:
> >>>> Hi,
> >>>>
> >>>> I think you have spotted something very important, and I agree
that
> >>>> the random seed should be reset by any function that set it in
> >>>> order
> >>>> to get reproducible samples.  We've been working with a port/
> >>>> wrapper
> >>>> for gcrma() but this never struck us - thanks for bringing it up.
> >>>>
> >>>> FYI, I've just sent a question to R-devel, as it is more
> >>>> appropriate
> >>>> there, on how to best push the random generator back to the
> >>>> state it
> >>>> was before calling set.seed().
> >>>>
> >>>> Best,
> >>>>
> >>>> Henrik
> >>>>
> >>>> On Feb 13, 2008 7:53 AM, William T Barry <bill.barry at duke.edu>
> >>>> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> While recently using the gcrma package, I noticed some strange
> >>>>> behavior
> >>>>> that I want to report.  I was applying the pre-processing
> >>>>> algorithm (using
> >>>>> default settings) inside a resampling scheme and noticed that
> >>>>> subsequent
> >>>>> calls of sample() were returning a constant vector of values.
> >>>>> Example
> >>>>> output shown below:
> >>>>>
> >>>>>> temp <- gcrma(celdat[,1:2])
> >>>>> Adjusting for optical effect..Done.
> >>>>> Computing affinities.Done.
> >>>>> Adjusting for non-specific binding..Done.
> >>>>> Normalizing
> >>>>> Calculating Expression
> >>>>>
> >>>>>> sample(1:10000,3)
> >>>>> [1] 7030 5473 8109
> >>>>>
> >>>>>> temp <- gcrma(celdat[,1:2])
> >>>>> Adjusting for optical effect..Done.
> >>>>> Computing affinities.Done.
> >>>>> Adjusting for non-specific binding..Done.
> >>>>> Normalizing
> >>>>> Calculating Expression
> >>>>>
> >>>>>> sample(1:10000,3)
> >>>>> [1] 7030 5473 8109
> >>>>>
> >>>>>> sessionInfo()
> >>>>> R version 2.6.1 (2007-11-26)
> >>>>> x86_64-redhat-linux-gnu
> >>>>>
> >>>>> locale:
> >>>>>
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=e
> >>>>> n_
> >>>>>
US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=
> >>>>> en
> >>>>>
_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_
> >>>>> US
> >>>>> .UTF-8;LC_IDENTIFICATION=C
> >>>>>
> >>>>> attached base packages:
> >>>>> [1] splines   tools     stats     graphics  grDevices utils
> >>>>> datasets
> >>>>> [8] methods   base
> >>>>>
> >>>>> other attached packages:
> >>>>> [1] hu6800probe_2.0.0    hu6800cdf_2.0.0      gcrma_2.10.0
> >>>>> [4] matchprobes_1.10.0   affy_1.16.0
preprocessCore_1.0.0
> >>>>> [7] affyio_1.6.1         Biobase_1.16.2
> >>>>>
> >>>>> loaded via a namespace (and not attached):
> >>>>> [1] rcompgen_0.1-17
> >>>>>
> >>>>> In looking at the gcrma functions, it appears that gcrma.engine2
> >>>>> () is
> >>>>> setting the seed without saving and restoring .Random.seed .
> >>>>> This was
> >>>>> easily corrected in the external loop of the resampling scheme,
> >>>>> but I was
> >>>>> unsure whether this would be considered a 'bug' in gcrma.
> >>>>>
> >>>>> Kind regards,
> >>>>> Bill
> >>>>>
> >>>>>
> >>>>> -------------------------------
> >>>>> William T. Barry, Ph.D.
> >>>>> Dept. of Biostatistics & Bioinformatics, DUMC
> >>>>> Duke Institute for Genome Science & Policy
> >>>>> Hock Plaza, Room 8030
> >>>>> DUMC Box 2717
> >>>>> Durham, NC 27710
> >>>>>
> >>>>> Phone: 919-681-5047
> >>>>> Fax: 919-668-9335
> >>>>>
> >>>>>
> >>>>>         [[alternative HTML version deleted]]
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioconductor mailing list
> >>>>> Bioconductor at stat.math.ethz.ch
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>> Search the archives: http://news.gmane.org/
> >>>>> gmane.science.biology.informatics.conductor
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> Bioconductor mailing list
> >>>> Bioconductor at stat.math.ethz.ch
> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>> Search the archives: http://news.gmane.org/
> >>>> gmane.science.biology.informatics.conductor
> >>>
> >>>
> >>> --
> >>> -------------------------------------------
> >>> Zhijin (Jean) Wu
> >>> Assistant Professor of Biostatistics
> >>> Brown University, Box G-S121
> >>> Providence, RI  02912
> >>>
> >>> Tel: 401 863 1230
> >>> Fax: 401 863 9182
> >>> http://stat.brown.edu/~zwu
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at stat.math.ethz.ch
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives: http://news.gmane.org/
> >>> gmane.science.biology.informatics.conductor
> >>
> >>
>
>

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor