[R] Random Seed Location
Henrik Bengtsson
henrik.bengtsson at gmail.com
Mon Mar 5 00:26:14 CET 2018
On Sun, Mar 4, 2018 at 10:18 AM, Paul Gilbert <pgilbert902 at gmail.com> wrote:
> On Mon, Feb 26, 2018 at 3:25 PM, Gary Black <gwblack001 at sbcglobal.net>
> wrote:
>
> (Sorry to be a bit slow responding.)
>
> You have not supplied a complete example, which would be good in this case
> because what you are suggesting could be a serious bug in R or a package.
> Serious journals require reproducibility these days. For example, JSS is
> very clear on this point.
>
> To your question
>> My question simply is: should the location of the set.seed command
>> matter,
>> provided that it is applied before any commands which involve randomness
>> (such as partitioning)?
>
> the answer is no, it should not matter. But the proviso is important.
>
> You can determine where things are messing up using something like
>
> set.seed(654321)
> zk <- RNGkind() # [1] "Mersenne-Twister" "Inversion"
> zk
> z <- runif(2)
> z
> set.seed(654321)
>
> # install.packages(c('caret', 'ggplot2', 'e1071'))
> library(caret)
> all(runif(2) == z) # should be true but it is not always
>
> set.seed(654321)
> library(ggplot2)
> all(runif(2) == z) # should be true
>
> set.seed(654321)
> library(e1071)
> all(runif(2) == z) # should be true
>
> all(RNGkind() == zk) # should be true
>
> On my computer package caret seems to sometimes, but not always, do
> something that advances or changes the RNG. So you will need to set the seed
> after that package is loaded if you want reproducibility.
>
> As Bill Dunlap points out, parallel can introduce much more complicated
> issues. If you are in fact using parallel then we really need a new thread
> with a better subject line, and the discussion will get much messier.
>
> The short answer is that, yes you should be able to get reproducible results
> with parallel computing. If you cannot then you are almost certainly doing
> something wrong. To publish you really must have reproducible results.
>
> In the example that Bill gave, I think the problem is that set.seed() only
> resets the seed in the main thread, the nodes continue to operate with
> unreset RNG. To demonstrate this to yourself you can do
>
> library(parallel)
> cl <- parallel::makeCluster(3)
> parallel::clusterCall(cl, function()set.seed(100))
> parallel::clusterCall(cl, function()RNGkind())
> parallel::clusterCall(cl, function()runif(2)) # similar result from all
> nodes
> # [1] 0.3077661 0.2576725
>
> However, do *NOT* do that in real work. You will be getting the same RNG
> stream from each node. If you are using random numbers and parallel you need
> to read a lot more, and probably consider a variant of the "L'Ecuyer"
> generator or something designed for parallel computing.
>
> One special point I will mention because it does not seem to be widely
> appreciated: the number of nodes affects the random stream, so recording the
> number of compute nodes along with the RNG and seed information is important
> for reproducible results. This has the unfortunate consequence that an
> experiment cannot be reproduced on a larger cluster. (If anyone knows
> differently I would very much like to hear.)
[Disclaimer: I'm the author] future.apply::future_lapply(X, ...,
future.seed) etc. can produce identical RNG results regardless of how
'X' is chunked up. For example,
library(future.apply)
task <- function(i) {
c(i = i, random = sample.int(10, size = 1), pid = Sys.getpid())
}
y <- list()
plan(multiprocess, workers = 1L)
y[[1]] <- future_sapply(1:10, FUN = task, future.seed = 42)
plan(multiprocess, workers = 2L)
y[[2]] <- future_sapply(1:10, FUN = task, future.seed = 42)
plan(multiprocess, workers = 3L)
y[[3]] <- future_sapply(1:10, FUN = task, future.seed = 42)
gives the exact same random output:
> y
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
i 1 2 3 4 5 6 7 8 9 10
random 5 10 1 8 7 9 3 5 10 4
pid 31933 31933 31933 31933 31933 31933 31933 31933 31933 31933
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
i 1 2 3 4 5 6 7 8 9 10
random 5 10 1 8 7 9 3 5 10 4
pid 32141 32141 32141 32141 32141 32142 32142 32142 32142 32142
[[3]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
i 1 2 3 4 5 6 7 8 9 10
random 5 10 1 8 7 9 3 5 10 4
pid 32199 32199 32199 32200 32200 32200 32200 32201 32201 32201
To base the RNG on the current RNG seed (== .GlobalEnv$.Random.seed),
one can use 'future.seed = TRUE'. For performance reasons, I choose
the default to be 'future.seed = FALSE', because there can be a
substantial overhead in setting up reproducible L'Ecuyer
subRNG-streams for all elements in 'X'.
I think the snowFT package by Sevcikova & Rossini also provides this
mechanism; Hana Sevcikova is also behind the rlecuyer package.
Hope this helps
/Henrik
>
> Paul Gilbert
>
>
>
>> Hi all,
>>
>> For some odd reason when running naïve bayes, k-NN, etc., I get slightly
>> different results (e.g., error rates, classification probabilities) from
>> run
>> to run even though I am using the same random seed.
>>
>> Nothing else (input-wise) is changing, but my results are somewhat
>> different
>> from run to run. The only randomness should be in the partitioning, and I
>> have set the seed before this point.
>>
>> My question simply is: should the location of the set.seed command
>> matter,
>> provided that it is applied before any commands which involve randomness
>> (such as partitioning)?
>>
>> If you need to see the code, it is below:
>>
>> Thank you,
>> Gary
>>
>>
>> A. Separate the original (in-sample) data from the new
>> (out-of-sample)
>> data. Set a random seed.
>>
>>> InvestTech <- as.data.frame(InvestTechRevised)
>>> outOfSample <- InvestTech[5001:nrow(InvestTech), ]
>>> InvestTech <- InvestTech[1:5000, ]
>>> set.seed(654321)
>>
>> B. Install and load the caret, ggplot2 and e1071 packages.
>>
>>> install.packages(“caret”)
>>> install.packages(“ggplot2”)
>>> install.packages(“e1071”)
>>> library(caret)
>>> library(ggplot2)
>>> library(e1071)
>>
>> C. Bin the predictor variables with approximately equal counts using
>> the cut_number function from the ggplot2 package. We will use 20 bins.
>>
>>> InvestTech[, 1] <- cut_number(InvestTech[, 1], n = 20)
>>> InvestTech[, 2] <- cut_number(InvestTech[, 2], n = 20)
>>> outOfSample[, 1] <- cut_number(outOfSample[, 1], n = 20)
>>> outOfSample[, 2] <- cut_number(outOfSample[, 2], n = 20)
>>
>> D. Partition the original (in-sample) data into 60% training and 40%
>> validation sets.
>>
>>> n <- nrow(InvestTech)
>>> train <- sample(1:n, size = 0.6 * n, replace = FALSE)
>>> InvestTechTrain <- InvestTech[train, ]
>>> InvestTechVal <- InvestTech[-train, ]
>>
>> E. Use the naiveBayes function in the e1071 package to fit the model.
>>
>>> model <- naiveBayes(`Purchase (1=yes, 0=no)` ~ ., data = InvestTechTrain)
>>> prob <- predict(model, newdata = InvestTechVal, type = “raw”)
>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>
>> F. Use the confusionMatrix function in the caret package to output
>> the
>> confusion matrix.
>>
>>> confMtr <- confusionMatrix(pred,unlist(InvestTechVal[, 3]),mode =
>> “everything”, positive = “1”)
>>> accuracy <- confMtr$overall[1]
>>> valError <- 1 – accuracy
>>> confMtr
>>
>> G. Classify the 18 new (out-of-sample) readers using the following
>> code.
>>> prob <- predict(model, newdata = outOfSample, type = “raw”)
>>> pred <- ifelse(prob[, 2] >= 0.3, 1, 0)
>>> cbind(pred, prob, outOfSample[, -3])
>>
>>
>>
>
>
> If your computations involve the parallel package then set.seed(seed)
> may not produce repeatable results. E.g.,
>
>> cl <- parallel::makeCluster(3) # Create cluster with 3 nodes on local
> host
>> set.seed(100); runif(2)
> [1] 0.3077661 0.2576725
>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
> [,1] [,2] [,3]
> [1,] 101.7779 102.5308 103.3459
> [2,] 101.8128 102.6114 103.9102
>>
>> set.seed(100); runif(2)
> [1] 0.3077661 0.2576725
>> parallel::parSapply(cl, 101:103, function(i)runif(2, i, i+1))
> [,1] [,2] [,3]
> [1,] 101.1628 102.9643 103.2684
> [2,] 101.9205 102.6937 103.7907
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list