[R] Random Seed Location

Mon Mar 5 00:26:14 CET 2018

On Sun, Mar 4, 2018 at 10:18 AM, Paul Gilbert <pgilbert902 at gmail.com> wrote:
> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at sbcglobal.net>
> wrote:
>
> (Sorry to be a bit slow responding.)
>
> You have not supplied a complete example, which would be good in this case
> because what you are suggesting could be a serious bug in R or a package.
> Serious journals require reproducibility these days. For example, JSS is
> very clear on this point.
>
> To your question
>> My question simply is:  should the location of the set.seed command
>> matter,
>> provided that it is applied before any commands which involve randomness
>> (such as partitioning)?
>
> the answer is no, it should not matter. But the proviso is important.
>
> You can determine where things are messing up using something like
>
>  set.seed(654321)
>  zk <- RNGkind()        # [1] "Mersenne-Twister" "Inversion"
>  zk
>  z <- runif(2)
>  z
>  set.seed(654321)
>
>  #  install.packages(c('caret', 'ggplot2', 'e1071'))
>  library(caret)
>  all(runif(2)  == z)   # should be true but it is not always
>
>  set.seed(654321)
>  library(ggplot2)
>  all(runif(2)  == z)   # should be true
>
>  set.seed(654321)
>  library(e1071)
>  all(runif(2)  == z)   # should be true
>
>  all(RNGkind() == zk)  # should be true
>
> On my computer package caret seems to sometimes, but not always, do
> something that advances or changes the RNG. So you will need to set the seed
> after that package is loaded if you want reproducibility.
>
> As Bill Dunlap points out, parallel can introduce much more complicated
> issues. If you are in fact using parallel then we really need a new thread
> with a better subject line, and the discussion will get much messier.
>
> The short answer is that, yes you should be able to get reproducible results
> with parallel computing. If you cannot then you are almost certainly doing
> something wrong. To publish you really must have reproducible results.
>
> In the example that Bill gave, I think the problem is that set.seed() only
> resets the seed in the main thread, the nodes continue to operate with
> unreset RNG. To demonstrate this to yourself you can do
>
>  library(parallel)
>  cl <- parallel::makeCluster(3)
>  parallel::clusterCall(cl, function()set.seed(100))
>  parallel::clusterCall(cl, function()RNGkind())
>  parallel::clusterCall(cl, function()runif(2)) # similar result from all
> nodes
>                                                # [1] 0.3077661 0.2576725
>
> However, do *NOT* do that in real work. You will be getting the same RNG
> stream from each node. If you are using random numbers and parallel you need
> to read a lot more, and probably consider a variant of the "L'Ecuyer"
> generator or something designed for parallel computing.
>
> One special point I will mention because it does not seem to be widely
> appreciated: the number of nodes affects the random stream, so recording the
> number of compute nodes along with the RNG and seed information is important
> for reproducible results. This has the unfortunate consequence that an
> experiment cannot be reproduced on a larger cluster. (If anyone knows
> differently I would very much like to hear.)

[Disclaimer: I'm the author] future.apply::future_lapply(X, ...,
future.seed) etc. can produce identical RNG results regardless of how
'X' is chunked up.  For example,

library(future.apply)

task <- function(i) {
  c(i = i, random = sample.int(10, size = 1), pid = Sys.getpid())
}

y <- list()

plan(multiprocess, workers = 1L)
y[[1]] <- future_sapply(1:10, FUN = task, future.seed = 42)

plan(multiprocess, workers = 2L)
y[[2]] <- future_sapply(1:10, FUN = task, future.seed = 42)

plan(multiprocess, workers = 3L)
y[[3]] <- future_sapply(1:10, FUN = task, future.seed = 42)

gives the exact same random output:

> y

[[1]]
        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
i          1     2     3     4     5     6     7     8     9    10
random     5    10     1     8     7     9     3     5    10     4
pid    31933 31933 31933 31933 31933 31933 31933 31933 31933 31933

[[2]]
        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
i          1     2     3     4     5     6     7     8     9    10
random     5    10     1     8     7     9     3     5    10     4
pid    32141 32141 32141 32141 32141 32142 32142 32142 32142 32142

[[3]]
        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
i          1     2     3     4     5     6     7     8     9    10
random     5    10     1     8     7     9     3     5    10     4
pid    32199 32199 32199 32200 32200 32200 32200 32201 32201 32201

To base the RNG on the current RNG seed (== .GlobalEnv$.Random.seed),
one can use 'future.seed = TRUE'.  For performance reasons, I choose
the default to be 'future.seed = FALSE', because there can be a
substantial overhead in setting up reproducible L'Ecuyer
subRNG-streams for all elements in 'X'.

I think the snowFT package by Sevcikova & Rossini also provides this
mechanism; Hana Sevcikova is also behind the rlecuyer package.

Hope this helps

/Henrik

>
> Paul Gilbert
>
>
>
>> Hi all,
>>
>> For some odd reason when running naïve bayes, k-NN, etc., I get slightly
>> different results (e.g., error rates, classification probabilities) from
>> run
>> to run even though I am using the same random seed.
>>
>> Nothing else (input-wise) is changing, but my results are somewhat
>> different
>> from run to run.  The only randomness should be in the partitioning, and I
>> have set the seed before this point.
>>
>> My question simply is:  should the location of the set.seed command
>> matter,
>> provided that it is applied before any commands which involve randomness
>> (such as partitioning)?
>>
>> If you need to see the code, it is below:
>>
>> Thank you,
>> Gary
>>
>>
>> A.      Separate the original (in-sample) data from the new
>> (out-of-sample)
>> data.  Set a random seed.
>>
>>> InvestTech <- as.data.frame(InvestTechRevised)
>>> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
>>> InvestTech <- InvestTech[1:5000, ]
>>> set.seed(654321)
>>
>> B.      Install and load the caret, ggplot2 and e1071 packages.
>>
>>> install.packages(“caret”)
>>> install.packages(“ggplot2”)
>>> install.packages(“e1071”)
>>> library(caret)
>>> library(ggplot2)
>>> library(e1071)
>>
>> C.      Bin the predictor variables with approximately equal counts using
>> the cut_number function from the ggplot2 package.  We will use 20 bins.
>>
>>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
>>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
>>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
>>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)
>>
>> D.      Partition the original (in-sample) data into 60% training and 40%
>> validation sets.
>>
>>> n <- nrow(InvestTech)
>>> train <- sample(1:n, size = 0.6 * n, replace = FALSE)
>>> InvestTechTrain <- InvestTech[train, ]
>>> InvestTechVal <- InvestTech[-train, ]
>>
>> E.      Use the naiveBayes function in the e1071 package to fit the model.
>>
>>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain)
>>> prob <- predict(model, newdata = InvestTechVal, type = “raw”)
>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>
>> F.      Use the confusionMatrix function in the caret package to output
>> the
>> confusion matrix.
>>
>>> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode =
>> “everything”, positive = “1”)
>>> accuracy <- confMtr$overall[1]
>>> valError <- 1 – accuracy
>>> confMtr
>>
>> G.      Classify the 18 new (out-of-sample) readers using the following
>> code.
>>> prob <- predict(model, newdata = outOfSample, type = “raw”)
>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>> cbind(pred, prob, outOfSample[, -3])
>>
>>
>>
>
>
> If your computations involve the parallel package then set.seed(seed)
> may not produce repeatable results.  E.g.,
>
>> cl <- parallel::makeCluster(3)  # Create cluster with 3 nodes on local
> host
>> set.seed(100); runif(2)
> [1] 0.3077661 0.2576725
>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>          [,1]     [,2]     [,3]
> [1,] 101.7779 102.5308 103.3459
> [2,] 101.8128 102.6114 103.9102
>>
>> set.seed(100); runif(2)
> [1] 0.3077661 0.2576725
>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
>          [,1]     [,2]     [,3]
> [1,] 101.1628 102.9643 103.2684
> [2,] 101.9205 102.6937 103.7907
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.