[BioC] Pairs plots in lumi, plots look different?

Thu May 8 21:19:09 CEST 2008

I have two comments to this, a general and a specific.

I'll start with the specific: in this case you are providing a pairs  
plot. Presumably to avoid overplotting you subsample the data points.  
Depending on what you want to use the plot for this may be quite ok -  
but the users need to know this! Clearly in this case the user was  
surprised to see it (perhaps it is highlighted on the help page, I  
don't know). For certain things - especially for QC I would say - I  
would personally prefer to plot all points (perhaps using a smoother  
like Wolfgang suggested). If users start interpreting these plots  
without knowing that it is only a fraction of the data they see, it is  
likely that they will misinterpret them. Setting the seed just  
addresses the symptom - that the plots are not "reproducible", not the  
underlying problem that this plots may not be suitable for whatever  
the original poster had in mind (otherwise he would not care that they  
look differently). What in my opinion should be done instead is
1) highlight it in the help page
2) provide some title on the plot like "based on 5000 samples" so that  
people do not get confused.
3) not set the seed

And now for the general comment (I guess there are two points in the  
following): I believe it is very misleading to set the seed in  
essentially any package (see below for one special case though). The  
seed is essentially a global variable and when you mess with it, other  
parts of the analysis may get affected. If an analysis method depends  
on random sampling, the conclusions (or the method) should take this  
into account. That means that the conclusions should be completely  
unaffected by whatever random numbers were generated. If that is not  
the case the analysis is flawed. It can be fixed by fixing the method,  
increase the number of samples or finally by adjusting the conclusions  
of the analysis. In most cases setting the seed for reproducibility  
(as was done in gcrma, see older post on the email list) just hides  
the problem and worse - typically makes users unaware of the fact that  
they need to take the effect of the randomness into account. So my  
points are
1) any conclusion based on random sampling should be invariant to this  
sampling.
2) setting the seed affects a global variable which you should never do.

Now, some people have a seed parameter to their function. In case this  
parameters has a default argument like
   .., seeed = 123,...
I believe it is very dangerous based on the stuff above. If the  
default case of the seed parameter is to not set a seed (perhaps by  
doing something like)
   .., seed = NULL,.. or ..., seed = FALSE, ...
you might as well not include it. There is not much difference between
   set.seed(123)
   myFunc()
and
   myFunc(seed = 123)

Finally I can only think of one case where a package might have a good  
reason to play with the seed: if you are trying to provide an update  
method for a resampling based method, like
   update(bootstrapObject, additonalSample = 1000)
and even then it needs to be done with great care.

Kasper

On May 8, 2008, at 10:49 AM, Pan Du wrote:

> The "seed" is a function parameter. Users can easily change it.  
> Thanks.
>
>
> Pan
>
>
> On 5/8/08 12:41 PM, "Kasper Daniel Hansen" <khansen at stat.Berkeley.EDU>
> wrote:
>
>> Please don't put seed inside functions, it may mess up the random
>> number stream. If someone wants reproducible plots you should either
>> increase the number of points are let him set the seed himself.
>>
>> Kasper
>>
>> On May 8, 2008, at 7:35 AM, Pan Du wrote:
>>
>>>
>>> Yes, as Matthias mentioned, we use random subset to increase the
>>> efficiency
>>> of plotting. To avoid variations over different plots, I have added
>>> the
>>> "seed" parameter to the these plot functions. Please check the  
>>> latest
>>> developing version of lumi 1.7.3. Thanks for using lumi!
>>>
>>> Best regards,
>>>
>>>
>>> Pan
>>>
>>>
>>>
>>> On 5/8/08 5:00 AM, "bioconductor-request at stat.math.ethz.ch"
>>> <bioconductor-request at stat.math.ethz.ch> wrote:
>>>
>>>> Message: 1
>>>> Date: Wed, 07 May 2008 12:41:48 +0200
>>>> From: Matthias Kohl <Matthias.Kohl at stamats.de>
>>>> Subject: Re: [BioC] Pairs plots in lumi, plots look different?
>>>> To: Julien Bauer <jb393 at cam.ac.uk>
>>>> Cc: bioconductor at stat.math.ethz.ch
>>>> Message-ID: <4821876C.3020705 at stamats.de>
>>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>>
>>>> Hello,
>>>>
>>>> the default pairs plot uses a random subset of the data
>>>>
>>>> pairs(x, ..., logMode = TRUE, subset = 5000)
>>>>
>>>> confer
>>>> library(lumi)
>>>> ?"pairs-methods"
>>>>
>>>> Hence, each call of pairs leads to different results. By setting  
>>>> the
>>>> random seed or the argument "subset" appropriately you could obtain
>>>> identical plots for each call.
>>>>
>>>> Best regards,
>>>> Matthias
>>>>
>>>>
>>>> Julien Bauer wrote:
>>>>> Hello,
>>>>> I am working for a microarray facility at Cambridge University, we
>>>>> have been using the Illumina platform for a while now.
>>>>> Lumi is a great package but I noticed something rather odd, when
>>>>> using
>>>>> the plot function "pairs" on my data, if I run it again the plots
>>>>> look
>>>>> different, some points are shifted or change location. The rest  
>>>>> stay
>>>>> the same it just the look of the graphs that change.
>>>>> My guess is that it is because of the auto scaling of the graph
>>>>> but I
>>>>> would like to be sure. I look in the mailing list archive and in  
>>>>> the
>>>>> vignette but I couldn't find the answer for this.
>>>>> Thanks in advance for your help,
>>>>>
>>>>> Julien Bauer
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> -- 
>>>> Dr. Matthias Kohl
>>>> www.stamats.de
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor