[Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
William Dunlap
wdunlap at tibco.com
Fri Nov 3 18:57:25 CET 2017
Another other generator is subject to the same problem with the same
probabilitiy.
> Filter(function(s){set.seed(s,
kind="Knuth-TAOCP-2002");runif(1,17,26)>25.99}, 1:10000)
[1] 280 415 826 1372 2224 2544 3270 3594 3809 4116 4236 5018 5692 7043
7212 7364 7747 9256 9491 9568 9886
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, Nov 3, 2017 at 10:31 AM, Tirthankar Chakravarty <
tirthankar.lists at gmail.com> wrote:
>
> Bill,
>
> I have clarified this on SO, and I will copy that clarification in here:
>
> "Sure, we tested them on other 8-digit numbers as well & we could not
> replicate. However, these are honest-to-goodness numbers generated by a
> non-adversarial system that has no conception of these numbers being used
> for anything other than a unique key for an entity -- these are not a
> specially constructed edge case. Would be good to know what seeds will and
> will not work, and why."
>
> These numbers are generated by an application that serves a form, and
> associates form IDs in a sequence. The application calls our API depending
> on the form values entered by users, which in turn calls our R code that
> executes some code that needs an RNG. Since the API has to be stateless, to
> be able to replicate the results for possible debugging, we need to draw
> random numbers in a way that we can replicate the results of the API
> response -- we use the form ID as seeds.
>
> I repeat, there is no design or anything adversarial about the way that
> these numbers were generated -- the system generating these numbers and
> the users entering inputs have no conception of our use of an RNG -- this
> is meant to just be a random sequence of form IDs. This issue was
> discovered completely by chance when the output of the API was observed to
> be highly non-random. It is possible that it is a 1/10^8 chance, but that
> is hard to believe, given that the API hit depends on user input. Note also
> that the issue goes away when we use a different RNG as mentioned below.
>
> T
>
> On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap <wdunlap at tibco.com> wrote:
>
>> The random numbers in a stream initialized with one seed should have
>> about the desired distribution. You don't win by changing the seed all the
>> time. Your seeds caused the first numbers of a bunch of streams to be
>> about the same, but the second and subsequent entries in each stream do
>> look uniformly distributed.
>>
>> You didn't say what your 'upstream process' was, but it is easy to come
>> up with seeds that give about the same first value:
>>
>> > Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:10000)
>> [1] 514 532 1951 2631 3974 4068 4229 6092 6432 7264 9090
>>
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty <
>> tirthankar.lists at gmail.com> wrote:
>>
>>> This is cross-posted from SO (https://stackoverflow.com/q/4
>>> 7079702/1414455),
>>> but I now feel that this needs someone from R-Devel to help understand
>>> why
>>> this is happening.
>>>
>>> We are facing a weird situation in our code when using R's [`runif`][1]
>>> and
>>> setting seed with `set.seed` with the `kind = NULL` option (which
>>> resolves,
>>> unless I am mistaken, to `kind = "default"`; the default being
>>> `"Mersenne-Twister"`).
>>>
>>> We set the seed using (8 digit) unique IDs generated by an upstream
>>> system,
>>> before calling `runif`:
>>>
>>> seeds = c(
>>> "86548915", "86551615", "86566163", "86577411", "86584144",
>>> "86584272", "86620568", "86724613", "86756002", "86768593",
>>> "86772411",
>>> "86781516", "86794389", "86805854", "86814600", "86835092",
>>> "86874179",
>>> "86876466", "86901193", "86987847", "86988080")
>>>
>>> random_values = sapply(seeds, function(x) {
>>> set.seed(x)
>>> y = runif(1, 17, 26)
>>> return(y)
>>> })
>>>
>>> This gives values that are **extremely** bunched together.
>>>
>>> > summary(random_values)
>>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>>> 25.13 25.36 25.66 25.58 25.83 25.94
>>>
>>> This behaviour of `runif` goes away when we use `kind =
>>> "Knuth-TAOCP-2002"`, and we get values that appear to be much more evenly
>>> spread out.
>>>
>>> random_values = sapply(seeds, function(x) {
>>> set.seed(x, kind = "Knuth-TAOCP-2002")
>>> y = runif(1, 17, 26)
>>> return(y)
>>> })
>>>
>>> *Output omitted.*
>>>
>>> ---
>>>
>>> **The most interesting thing here is that this does not happen on Windows
>>> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
>>> below).
>>>
>>> # Windows output: #
>>>
>>> > seeds = c(
>>> + "86548915", "86551615", "86566163", "86577411", "86584144",
>>> + "86584272", "86620568", "86724613", "86756002", "86768593",
>>> "86772411",
>>> + "86781516", "86794389", "86805854", "86814600", "86835092",
>>> "86874179",
>>> + "86876466", "86901193", "86987847", "86988080")
>>> >
>>> > random_values = sapply(seeds, function(x) {
>>> + set.seed(x)
>>> + y = runif(1, 17, 26)
>>> + return(y)
>>> + })
>>> >
>>> > summary(random_values)
>>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>>> 17.32 20.14 23.00 22.17 24.07 25.90
>>>
>>> Can someone help understand what is going on?
>>>
>>> Ubuntu
>>> ------
>>>
>>> R version 3.4.0 (2017-04-21)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: Ubuntu 16.04.2 LTS
>>>
>>> Matrix products: default
>>> BLAS: /usr/lib/libblas/libblas.so.3.6.0
>>> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
>>> [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets
>>> methods base
>>>
>>> other attached packages:
>>> [1] RMySQL_0.10.8 DBI_0.6-1
>>> [3] jsonlite_1.4 tidyjson_0.2.2
>>> [5] optiRum_0.37.3 lubridate_1.6.0
>>> [7] httr_1.2.1 gdata_2.18.0
>>> [9] XLConnect_0.2-12 XLConnectJars_0.2-12
>>> [11] data.table_1.10.4 stringr_1.2.0
>>> [13] readxl_1.0.0 xlsx_0.5.7
>>> [15] xlsxjars_0.6.1 rJava_0.9-8
>>> [17] sqldf_0.4-10 RSQLite_1.1-2
>>> [19] gsubfn_0.6-6 proto_1.0.0
>>> [21] dplyr_0.5.0 purrr_0.2.4
>>> [23] readr_1.1.1 tidyr_0.6.3
>>> [25] tibble_1.3.0 tidyverse_1.1.1
>>> [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
>>> [29] MLmetrics_1.1.1 caret_6.0-76
>>> [31] ROCR_1.0-7 gplots_3.0.1
>>> [33] effects_3.1-2 pROC_1.10.0
>>> [35] pscl_1.4.9 lattice_0.20-35
>>> [37] MASS_7.3-47 ggplot2_2.2.1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] splines_3.4.0 foreach_1.4.3 AUC_0.3.0
>>> modelr_0.1.0
>>> [5] gtools_3.5.0 assertthat_0.2.0 stats4_3.4.0
>>> cellranger_1.1.0
>>> [9] quantreg_5.33 chron_2.3-50 digest_0.6.10
>>> rvest_0.3.2
>>> [13] minqa_1.2.4 colorspace_1.3-2 Matrix_1.2-10
>>> plyr_1.8.4
>>> [17] psych_1.7.3.21 XML_3.98-1.7 broom_0.4.2
>>> SparseM_1.77
>>> [21] haven_1.0.0 scales_0.4.1 lme4_1.1-13
>>> MatrixModels_0.4-1
>>> [25] mgcv_1.8-17 car_2.1-5 nnet_7.3-12
>>> lazyeval_0.2.0
>>> [29] pbkrtest_0.4-7 mnormt_1.5-5 magrittr_1.5
>>> memoise_1.0.0
>>> [33] nlme_3.1-131 forcats_0.2.0 xml2_1.1.1
>>> foreign_0.8-69
>>> [37] tools_3.4.0 hms_0.3 munsell_0.4.3
>>> compiler_3.4.0
>>> [41] caTools_1.17.1 rlang_0.1.1 grid_3.4.0
>>> nloptr_1.0.4
>>> [45] iterators_1.0.8 bitops_1.0-6 tcltk_3.4.0
>>> gtable_0.2.0
>>> [49] ModelMetrics_1.1.0 codetools_0.2-15 reshape2_1.4.2
>>> R6_2.2.0
>>>
>>> [53] knitr_1.15.1 KernSmooth_2.23-15 stringi_1.1.5
>>> Rcpp_0.12.11
>>>
>>>
>>>
>>> Windows
>>> -------
>>>
>>> > sessionInfo()
>>> R version 3.3.2 (2016-10-31)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> Running under: Windows >= 8 x64 (build 9200)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
>>> LC_MONETARY=English_India.1252
>>> [4] LC_NUMERIC=C LC_TIME=English_India.1252
>>>
>>> attached base packages:
>>> [1] graphics grDevices utils datasets grid stats
>>> methods base
>>>
>>> other attached packages:
>>> [1] bindrcpp_0.2 h2o_3.14.0.3 ggrepel_0.6.5
>>> eulerr_1.1.0 VennDiagram_1.6.17
>>> [6] futile.logger_1.4.3 scales_0.4.1 FinCal_0.6.3
>>> xml2_1.0.0 httr_1.3.0
>>> [11] wesanderson_0.3.2 wordcloud_2.5 RColorBrewer_1.1-2
>>> htmltools_0.3.6 urltools_1.6.0
>>> [16] timevis_0.4 dtplyr_0.0.1 magrittr_1.5
>>> shiny_1.0.5 RODBC_1.3-14
>>> [21] zoo_1.8-0 sqldf_0.4-10 RSQLite_1.1-2
>>> gsubfn_0.6-6 proto_1.0.0
>>> [26] gdata_2.17.0 stringr_1.2.0 XLConnect_0.2-12
>>> XLConnectJars_0.2-12 data.table_1.10.4
>>> [31] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8
>>> readxl_0.1.1 googlesheets_0.2.1
>>> [36] jsonlite_1.5 tidyjson_0.2.1 RMySQL_0.10.9
>>> RPostgreSQL_0.4-1 DBI_0.5-1
>>> [41] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1
>>> tidyr_0.7.0 tibble_1.3.3
>>> [46] ggplot2_2.2.0 tidyverse_1.0.0 lubridate_1.6.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] gtools_3.5.0 assertthat_0.2.0 triebeard_0.3.0
>>> cellranger_1.1.0 yaml_2.1.14
>>> [6] slam_0.1-40 lattice_0.20-34 glue_1.1.1
>>> chron_2.3-48 digest_0.6.12.1
>>> [11] colorspace_1.3-1 httpuv_1.3.5 plyr_1.8.4
>>> pkgconfig_2.0.1 xtable_1.8-2
>>> [16] lazyeval_0.2.0 mime_0.5 memoise_1.0.0
>>> tools_3.3.2 hms_0.3
>>> [21] munsell_0.4.3 lambda.r_1.1.9 rlang_0.1.1
>>> RCurl_1.95-4.8 labeling_0.3
>>> [26] bitops_1.0-6 tcltk_3.3.2 gtable_0.2.0
>>> reshape2_1.4.2 R6_2.2.0
>>> [31] bindr_0.1 futile.options_1.0.0 stringi_1.1.2
>>> Rcpp_0.12.12.1
>>>
>>> [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Unif
>>> orm.html
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list