[R] How important is set.seed

Tue Mar 22 17:33:14 CET 2022

Assuming train methods MLP and repatedcv both draw from the R random number generator, they do different things with those numbers. This question is like asking whether player 1 will consistently win if you play war and gin rummy with identically-shuffled decks of cards... all you can tell is that repeating war games will turn out the same. There is no intrinsic value to setting the seed in this case. If you want to compare results then you need to run each training session with enough variation in seeds that the resulting averaged uncertainty in results is consistent, and then consider comparing between methods. Simply repeating the train calls without using set.seed at all will accomplish this.

But this departs from discussion of the R language into a discussion of how caret::train works (not on-topic here)... which I don't know anything about, but you clearly need to understand better.

On March 22, 2022 9:03:21 AM PDT, Neha gupta <neha.bologna90 using gmail.com> wrote:
>Thank you again Tim
>
>d=readARFF("my data")
>
>set.seed(123)
>
>tr <- d[index, ]
>ts <- d[-index, ]
>
>
>ctrl <- trainControl(method = "repeatedcv",number=10)
>
>set.seed(123)
>ran_search <- train(lneff ~ ., data = tr,
>                     method = "mlp",
>                       tuneLength = 30,
>                     metric = "MAE",
>                     preProc = c("center", "scale", "nzv"),
>                     trControl = ctrl)
>getTrainPerf(ran_search)
>
>
>Would it be good?
>
>On Tue, Mar 22, 2022 at 4:34 PM Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
>
>> My inclination is to follow Jeff’s advice and put it at the beginning of
>> the program.
>>
>> You can always experiment:
>>
>>
>>
>> set.seed(42)
>>
>> rnorm(5,5,5)
>>
>> rnorm(5,5,5)
>>
>> runif(5,0,3)
>>
>>
>>
>> As long as the commands are executed in the order they are written, then
>> the outcome is the same every time. Set seed is giving you reproducible
>> outcomes. However, the second rnorm() does not give you the same outcome as
>> the first. So set seed starts at the same point but if you want the first
>> and second rnorm() call to give the same results you will need another
>> set.seed(42).
>>
>>
>>
>> Note also, that it does not matter if you pause: run the above code as a
>> chunk, or run each command individually you get the same result (as long as
>> you do it in the sequence written). So, if you set seed, run some code,
>> take a break, come back write some more code you  might get in trouble
>> because R is still using the original set.seed() command.
>>
>> To solve this issue use
>>
>> set.seed(Sys.time())
>>
>>
>>
>> Or
>>
>>
>>
>> set.seed(NULL)
>>
>>
>>
>> Some of this is just good programming style workflow:
>>
>>
>>
>> Import data
>>
>> Declare variables and constants (set.seed() typically goes here)
>>
>> Define functions
>>
>> Body of code
>>
>> Generate output
>>
>> Clean up ( set.seed(NULL) would go here, along with removing unused
>> variables and such)
>>
>>
>>
>> Regards,
>>
>> Tim
>>
>>
>>
>> *From:* Neha gupta <neha.bologna90 using gmail.com>
>> *Sent:* Tuesday, March 22, 2022 10:48 AM
>> *To:* Ebert,Timothy Aaron <tebert using ufl.edu>
>> *Cc:* Jeff Newmiller <jdnewmil using dcn.davis.ca.us>; r-help using r-project.org
>> *Subject:* Re: How important is set.seed
>>
>>
>>
>> *[External Email]*
>>
>>
>> Hello Tim
>>
>>
>>
>> In some of the examples I see in the tutorials, they put the random seed
>> just before the model training e.g train function in case of caret library.
>> Should I follow this?
>>
>>
>>
>> Best regards
>> On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
>>
>> Ah, so maybe what you need is to think of “set.seed()” as a treatment in
>> an experiment. You could use a random number generator to select an
>> appropriate number of seeds, then use those seeds repeatedly in the
>> different models to see how seed selection influences outcomes. I am not
>> quite sure how many seeds would constitute a good sample. For me that would
>> depend on what I find and how long a run takes.
>>
>>   In parallel processing you set seed in master and then use a random
>> number generator to set seeds in each worker.
>>
>> Tim
>>
>>
>>
>> *From:* Neha gupta <neha.bologna90 using gmail.com>
>> *Sent:* Tuesday, March 22, 2022 6:33 AM
>> *To:* Ebert,Timothy Aaron <tebert using ufl.edu>
>> *Cc:* Jeff Newmiller <jdnewmil using dcn.davis.ca.us>; r-help using r-project.org
>> *Subject:* Re: How important is set.seed
>>
>>
>>
>> *[External Email]*
>>
>> Thank you all.
>>
>>
>>
>> Actually I need set.seed because I have to evaluate the consistency of
>> features selection generated by different models, so I think for this, it's
>> recommended to use the seed.
>>
>>
>>
>> Warm regards
>>
>> On Tuesday, March 22, 2022, Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
>>
>> If you are using the program for data analysis then set.seed() is not
>> necessary unless you are developing a reproducible example. In a standard
>> analysis it is mostly counter-productive because one should then ask if
>> your presented results are an artifact of a specific seed that you selected
>> to get a particular result. However, in cases where you need a reproducible
>> example, debugging a program, or specific other cases where you might need
>> the same result with every run of the program then set.seed() is an
>> essential tool.
>> Tim
>>
>> -----Original Message-----
>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Jeff Newmiller
>> Sent: Monday, March 21, 2022 8:41 PM
>> To: r-help using r-project.org; Neha gupta <neha.bologna90 using gmail.com>; r-help
>> mailing list <r-help using r-project.org>
>> Subject: Re: [R] How important is set.seed
>>
>> [External Email]
>>
>> First off, "ML models" do not all use random numbers (for prediction I
>> would guess very few of them do). Learn and pay attention to what the
>> functions you are using do.
>>
>> Second, if you use random numbers properly and understand the precision
>> that your specific use case offers, then you don't need to use set.seed.
>> However, in practice, using set.seed can allow you to temporarily avoid
>> chasing precision gremlins, or set up specific test cases for testing code,
>> not results. It is your responsibility to not let this become a crutch... a
>> randomized simulation that is actually sensitive to the seed is unlikely to
>> offer an accurate result.
>>
>> Where to put set.seed depends a lot on how you are performing your
>> simulations. In general each process should set it once uniquely at the
>> beginning, and if you use parallel processing then use the features of your
>> parallel processing framework to insure that this happens. Beware of
>> setting all worker processes to use the same seed.
>>
>> On March 21, 2022 5:03:30 PM PDT, Neha gupta <neha.bologna90 using gmail.com>
>> wrote:
>> >Hello everyone
>> >
>> >I want to know
>> >
>> >(1) In which cases, we need to use set.seed while building ML models?
>> >
>> >(2) Which is the exact location we need to put the set.seed function i.e.
>> >when we split data into train/test sets, or just before we train a model?
>> >
>> >Thank you
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailm
>> >an_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRz
>> >sn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf
>> >0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2WyRxpXsq4Y3TRMU&e=
>> >PLEASE do read the posting guide
>> >https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org
>> >_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsR
>> >zsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrm
>> >f0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf0UaX&s=5b117E3OFSf5VyLOctfnrz0rj5B2WyRxpXsq4Y3TRMU&e=
>> PLEASE do read the posting guide
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=s9osWKJN-zG2VafjXQYCmU_AMS5w3eAtCfeJAwnphAb7ap8kDYfcLwt2jrmf0UaX&s=wI6SycC_C2fno2VfxGg9ObD3Dd1qh6vn56pIvmCcobg&e=
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>

-- 
Sent from my phone. Please excuse my brevity.