[BioC] Pairs plots in lumi, plots look different?
Kasper Daniel Hansen
khansen at stat.Berkeley.EDU
Thu May 8 21:19:09 CEST 2008
I have two comments to this, a general and a specific.
I'll start with the specific: in this case you are providing a pairs
plot. Presumably to avoid overplotting you subsample the data points.
Depending on what you want to use the plot for this may be quite ok -
but the users need to know this! Clearly in this case the user was
surprised to see it (perhaps it is highlighted on the help page, I
don't know). For certain things - especially for QC I would say - I
would personally prefer to plot all points (perhaps using a smoother
like Wolfgang suggested). If users start interpreting these plots
without knowing that it is only a fraction of the data they see, it is
likely that they will misinterpret them. Setting the seed just
addresses the symptom - that the plots are not "reproducible", not the
underlying problem that this plots may not be suitable for whatever
the original poster had in mind (otherwise he would not care that they
look differently). What in my opinion should be done instead is
1) highlight it in the help page
2) provide some title on the plot like "based on 5000 samples" so that
people do not get confused.
3) not set the seed
And now for the general comment (I guess there are two points in the
following): I believe it is very misleading to set the seed in
essentially any package (see below for one special case though). The
seed is essentially a global variable and when you mess with it, other
parts of the analysis may get affected. If an analysis method depends
on random sampling, the conclusions (or the method) should take this
into account. That means that the conclusions should be completely
unaffected by whatever random numbers were generated. If that is not
the case the analysis is flawed. It can be fixed by fixing the method,
increase the number of samples or finally by adjusting the conclusions
of the analysis. In most cases setting the seed for reproducibility
(as was done in gcrma, see older post on the email list) just hides
the problem and worse - typically makes users unaware of the fact that
they need to take the effect of the randomness into account. So my
points are
1) any conclusion based on random sampling should be invariant to this
sampling.
2) setting the seed affects a global variable which you should never do.
Now, some people have a seed parameter to their function. In case this
parameters has a default argument like
.., seeed = 123,...
I believe it is very dangerous based on the stuff above. If the
default case of the seed parameter is to not set a seed (perhaps by
doing something like)
.., seed = NULL,.. or ..., seed = FALSE, ...
you might as well not include it. There is not much difference between
set.seed(123)
myFunc()
and
myFunc(seed = 123)
Finally I can only think of one case where a package might have a good
reason to play with the seed: if you are trying to provide an update
method for a resampling based method, like
update(bootstrapObject, additonalSample = 1000)
and even then it needs to be done with great care.
Kasper
On May 8, 2008, at 10:49 AM, Pan Du wrote:
> The "seed" is a function parameter. Users can easily change it.
> Thanks.
>
>
> Pan
>
>
> On 5/8/08 12:41 PM, "Kasper Daniel Hansen" <khansen at stat.Berkeley.EDU>
> wrote:
>
>> Please don't put seed inside functions, it may mess up the random
>> number stream. If someone wants reproducible plots you should either
>> increase the number of points are let him set the seed himself.
>>
>> Kasper
>>
>> On May 8, 2008, at 7:35 AM, Pan Du wrote:
>>
>>>
>>> Yes, as Matthias mentioned, we use random subset to increase the
>>> efficiency
>>> of plotting. To avoid variations over different plots, I have added
>>> the
>>> "seed" parameter to the these plot functions. Please check the
>>> latest
>>> developing version of lumi 1.7.3. Thanks for using lumi!
>>>
>>> Best regards,
>>>
>>>
>>> Pan
>>>
>>>
>>>
>>> On 5/8/08 5:00 AM, "bioconductor-request at stat.math.ethz.ch"
>>> <bioconductor-request at stat.math.ethz.ch> wrote:
>>>
>>>> Message: 1
>>>> Date: Wed, 07 May 2008 12:41:48 +0200
>>>> From: Matthias Kohl <Matthias.Kohl at stamats.de>
>>>> Subject: Re: [BioC] Pairs plots in lumi, plots look different?
>>>> To: Julien Bauer <jb393 at cam.ac.uk>
>>>> Cc: bioconductor at stat.math.ethz.ch
>>>> Message-ID: <4821876C.3020705 at stamats.de>
>>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>>
>>>> Hello,
>>>>
>>>> the default pairs plot uses a random subset of the data
>>>>
>>>> pairs(x, ..., logMode = TRUE, subset = 5000)
>>>>
>>>> confer
>>>> library(lumi)
>>>> ?"pairs-methods"
>>>>
>>>> Hence, each call of pairs leads to different results. By setting
>>>> the
>>>> random seed or the argument "subset" appropriately you could obtain
>>>> identical plots for each call.
>>>>
>>>> Best regards,
>>>> Matthias
>>>>
>>>>
>>>> Julien Bauer wrote:
>>>>> Hello,
>>>>> I am working for a microarray facility at Cambridge University, we
>>>>> have been using the Illumina platform for a while now.
>>>>> Lumi is a great package but I noticed something rather odd, when
>>>>> using
>>>>> the plot function "pairs" on my data, if I run it again the plots
>>>>> look
>>>>> different, some points are shifted or change location. The rest
>>>>> stay
>>>>> the same it just the look of the graphs that change.
>>>>> My guess is that it is because of the auto scaling of the graph
>>>>> but I
>>>>> would like to be sure. I look in the mailing list archive and in
>>>>> the
>>>>> vignette but I couldn't find the answer for this.
>>>>> Thanks in advance for your help,
>>>>>
>>>>> Julien Bauer
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> --
>>>> Dr. Matthias Kohl
>>>> www.stamats.de
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list