[R] How important is set.seed

Ebert,Timothy Aaron tebert @end|ng |rom u||@edu
Tue Mar 22 18:27:28 CET 2022


Not wrong, just mostly different words.
1) I think of reproducible code as something for teaching or sharing. It can be useful in debugging if I want help (one reason for sharing). In solo debugging my code, I have not used set.seed() -- at least not yet. However, my programs are all small, mostly less than 100 lines of code.
2) Agreed. 
3) Agreed -- one needs to be very clear on why one is using set seed(). In many situations it is undoing the purpose of using a random number generator. 
4) Agreed -- this is why it is so important to publish the version of R and the package used when presenting results. A great deal of effort has gone into building and selecting a good RNG. Depending on how the RNG is used, a basic understanding of what defines "good" is valuable. If there are huge numbers of calls to the RNG then periodicity in the RNG may start making a difference. Random.org might be another place for the OP to explore.

Tim

-----Original Message-----
From: Bert Gunter <bgunter.4567 using gmail.com> 
Sent: Tuesday, March 22, 2022 12:12 PM
To: Neha gupta <neha.bologna90 using gmail.com>
Cc: Ebert,Timothy Aaron <tebert using ufl.edu>; r-help using r-project.org
Subject: Re: [R] How important is set.seed

[External Email]

OK, I'm somewhat puzzled by this discussion. Maybe I'm just clueless. But...

1. set.seed() is used to make any procedure that uses R's pseudo-random number generator -- including, for example, sampling from a distribution, random data splitting, etc. -- "reproducible".
That is, if the procedure is repeated *exactly,* by invoking
set.seed() with its original argument values (once!) *before* the procedure begins, exactly the same results should be produced by the procedure. Full stop. It does not matter how many times random number generation occurs within the procedure thereafter -- R preserves the state of the rng between invocations (but see the notes in ?set.seed for subtle qualifications of this claim).

2. Hence, if no (pseudo-) random number generation is used, set.seed() is irrelevant. Full stop.

3. Hence, if you don't care about reproducibility (you should! -- if for no other reason than debugging), you don't need set.seed()

4. The "randomness" of any sequence of results from any particular
set.seed() arguments (including further calls to the rng) is a complex issue. ?set.seed has some discussion of this, but one needs considerable expertise to make informed choices here. As usual, we untutored users should be guided by the expert recommendations of the Help file.

*** If anything I have said above is wrong, I would greatly appreciate a public response here showing my error.***

Bert Gunter

"The trouble with having an open mind is that people keep coming along and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )



On Tue, Mar 22, 2022 at 7:48 AM Neha gupta <neha.bologna90 using gmail.com> wrote:
>
> Hello Tim
>
> In some of the examples I see in the tutorials, they put the random 
> seed just before the model training e.g train function in case of caret library.
> Should I follow this?
>
> Best regards
> On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
>
> > Ah, so maybe what you need is to think of “set.seed()” as a 
> > treatment in an experiment. You could use a random number generator 
> > to select an appropriate number of seeds, then use those seeds 
> > repeatedly in the different models to see how seed selection 
> > influences outcomes. I am not quite sure how many seeds would 
> > constitute a good sample. For me that would depend on what I find and how long a run takes.
> >
> >   In parallel processing you set seed in master and then use a 
> > random number generator to set seeds in each worker.
> >
> > Tim
> >
> >
> >
> > *From:* Neha gupta <neha.bologna90 using gmail.com>
> > *Sent:* Tuesday, March 22, 2022 6:33 AM
> > *To:* Ebert,Timothy Aaron <tebert using ufl.edu>
> > *Cc:* Jeff Newmiller <jdnewmil using dcn.davis.ca.us>; 
> > r-help using r-project.org
> > *Subject:* Re: How important is set.seed
> >
> >
> >
> > *[External Email]*
> >
> > Thank you all.
> >
> >
> >
> > Actually I need set.seed because I have to evaluate the consistency 
> > of features selection generated by different models, so I think for 
> > this, it's recommended to use the seed.
> >
> >
> >
> > Warm regards
> >
> > On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
> >
> > If you are using the program for data analysis then set.seed() is 
> > not necessary unless you are developing a reproducible example. In a 
> > standard analysis it is mostly counter-productive because one should 
> > then ask if your presented results are an artifact of a specific 
> > seed that you selected to get a particular result. However, in cases 
> > where you need a reproducible example, debugging a program, or 
> > specific other cases where you might need the same result with every 
> > run of the program then set.seed() is an essential tool.
> > Tim
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces using r-project.org> On Behalf Of Jeff 
> > Newmiller
> > Sent: Monday, March 21, 2022 8:41 PM
> > To: r-help using r-project.org; Neha gupta <neha.bologna90 using gmail.com>; 
> > r-help mailing list <r-help using r-project.org>
> > Subject: Re: [R] How important is set.seed
> >
> > [External Email]
> >
> > First off, "ML models" do not all use random numbers (for prediction 
> > I would guess very few of them do). Learn and pay attention to what 
> > the functions you are using do.
> >
> > Second, if you use random numbers properly and understand the 
> > precision that your specific use case offers, then you don't need to use set.seed.
> > However, in practice, using set.seed can allow you to temporarily 
> > avoid chasing precision gremlins, or set up specific test cases for 
> > testing code, not results. It is your responsibility to not let this 
> > become a crutch... a randomized simulation that is actually 
> > sensitive to the seed is unlikely to offer an accurate result.
> >
> > Where to put set.seed depends a lot on how you are performing your 
> > simulations. In general each process should set it once uniquely at 
> > the beginning, and if you use parallel processing then use the 
> > features of your parallel processing framework to insure that this 
> > happens. Beware of setting all worker processes to use the same seed.
> >
> > On March 21, 2022 5:03:30 PM PDT, Neha gupta 
> > <neha.bologna90 using gmail.com>
> > wrote:
> > >Hello everyone
> > >
> > >I want to know
> > >
> > >(1) In which cases, we need to use set.seed while building ML models?
> > >
> > >(2) Which is the exact location we need to put the set.seed function i.e.
> > >when we split data into train/test sets, or just before we train a model?
> > >
> > >Thank you
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> > >https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_m
> > >ailm 
> > >an_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVe
> > >AsRz 
> > >sn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2
> > >jrmf 0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2WyRxpXsq4Y3TRMU&e=
> > >PLEASE do read the posting guide
> > >https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject
> > >.org 
> > >_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kV
> > >eAsR 
> > >zsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt
> > >2jrm f0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Sent from my phone. Please excuse my brevity.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.
> > ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&
> > r= 9PEhQh2kVeAsRzsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_
> > AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf0UaX&s=5b117E3OFSf5VyLOctfnrz0r
> > j5B2Wy
> > RxpXsq4Y3TRMU&e=
> > PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=https-3A__urldefense.proofpoint&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=RA3jbebfWP_jAtB4a6543HFsPfG3Tl8cTn03TrDJZMeOH8G7S6ws8olwiMKccCkt&s=F-ZDg4sYpidt7qOt5ikZ_N8hvKD2QqnQ7KFUYEcyI0k&e= .
> > com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.
> > html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=
> > s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcL
> > wt2jrmf0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> man_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
> Rzsn7AkP-g&m=RA3jbebfWP_jAtB4a6543HFsPfG3Tl8cTn03TrDJZMeOH8G7S6ws8olwi
> MKccCkt&s=TS_4TMUnIWCeWX45h32k6ye0EgS5gRfudlmC0UlUCcs&e=
> PLEASE do read the posting guide 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> g_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> sRzsn7AkP-g&m=RA3jbebfWP_jAtB4a6543HFsPfG3Tl8cTn03TrDJZMeOH8G7S6ws8olw
> iMKccCkt&s=-89dwL44gxINLqIPnPtRjXdBpJi4YSOhH1v4_mI1frQ&e=
> and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list