[Rd] Extreme bunching of random values from runif with Mersenne-Twister seed
Tirthankar Chakravarty
tirthankar.lists at gmail.com
Fri Nov 3 19:30:03 CET 2017
Bill,
Appreciate the point that both you and Serguei are making, but the sequence
in question is not a selected or filtered set. These are values as observed
in a sequence from a mechanism described below. The probabilities required
to generate this exact sequence in the wild seem staggering to me.
T
On Fri, Nov 3, 2017 at 11:27 PM, William Dunlap <wdunlap at tibco.com> wrote:
> Another other generator is subject to the same problem with the same
> probabilitiy.
>
> > Filter(function(s){set.seed(s, kind="Knuth-TAOCP-2002");runif(1,17,26)>25.99},
> 1:10000)
> [1] 280 415 826 1372 2224 2544 3270 3594 3809 4116 4236 5018 5692 7043
> 7212 7364 7747 9256 9491 9568 9886
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Fri, Nov 3, 2017 at 10:31 AM, Tirthankar Chakravarty <
> tirthankar.lists at gmail.com> wrote:
>
>>
>> Bill,
>>
>> I have clarified this on SO, and I will copy that clarification in here:
>>
>> "Sure, we tested them on other 8-digit numbers as well & we could not
>> replicate. However, these are honest-to-goodness numbers generated by a
>> non-adversarial system that has no conception of these numbers being used
>> for anything other than a unique key for an entity -- these are not a
>> specially constructed edge case. Would be good to know what seeds will and
>> will not work, and why."
>>
>> These numbers are generated by an application that serves a form, and
>> associates form IDs in a sequence. The application calls our API depending
>> on the form values entered by users, which in turn calls our R code that
>> executes some code that needs an RNG. Since the API has to be stateless, to
>> be able to replicate the results for possible debugging, we need to draw
>> random numbers in a way that we can replicate the results of the API
>> response -- we use the form ID as seeds.
>>
>> I repeat, there is no design or anything adversarial about the way that
>> these numbers were generated -- the system generating these numbers and
>> the users entering inputs have no conception of our use of an RNG -- this
>> is meant to just be a random sequence of form IDs. This issue was
>> discovered completely by chance when the output of the API was observed to
>> be highly non-random. It is possible that it is a 1/10^8 chance, but that
>> is hard to believe, given that the API hit depends on user input. Note also
>> that the issue goes away when we use a different RNG as mentioned below.
>>
>> T
>>
>> On Fri, Nov 3, 2017 at 9:58 PM, William Dunlap <wdunlap at tibco.com> wrote:
>>
>>> The random numbers in a stream initialized with one seed should have
>>> about the desired distribution. You don't win by changing the seed all the
>>> time. Your seeds caused the first numbers of a bunch of streams to be
>>> about the same, but the second and subsequent entries in each stream do
>>> look uniformly distributed.
>>>
>>> You didn't say what your 'upstream process' was, but it is easy to come
>>> up with seeds that give about the same first value:
>>>
>>> > Filter(function(s){set.seed(s);runif(1,17,26)>25.99}, 1:10000)
>>> [1] 514 532 1951 2631 3974 4068 4229 6092 6432 7264 9090
>>>
>>>
>>>
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>>
>>> On Fri, Nov 3, 2017 at 12:49 AM, Tirthankar Chakravarty <
>>> tirthankar.lists at gmail.com> wrote:
>>>
>>>> This is cross-posted from SO (https://stackoverflow.com/q/4
>>>> 7079702/1414455),
>>>> but I now feel that this needs someone from R-Devel to help understand
>>>> why
>>>> this is happening.
>>>>
>>>> We are facing a weird situation in our code when using R's [`runif`][1]
>>>> and
>>>> setting seed with `set.seed` with the `kind = NULL` option (which
>>>> resolves,
>>>> unless I am mistaken, to `kind = "default"`; the default being
>>>> `"Mersenne-Twister"`).
>>>>
>>>> We set the seed using (8 digit) unique IDs generated by an upstream
>>>> system,
>>>> before calling `runif`:
>>>>
>>>> seeds = c(
>>>> "86548915", "86551615", "86566163", "86577411", "86584144",
>>>> "86584272", "86620568", "86724613", "86756002", "86768593",
>>>> "86772411",
>>>> "86781516", "86794389", "86805854", "86814600", "86835092",
>>>> "86874179",
>>>> "86876466", "86901193", "86987847", "86988080")
>>>>
>>>> random_values = sapply(seeds, function(x) {
>>>> set.seed(x)
>>>> y = runif(1, 17, 26)
>>>> return(y)
>>>> })
>>>>
>>>> This gives values that are **extremely** bunched together.
>>>>
>>>> > summary(random_values)
>>>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>>>> 25.13 25.36 25.66 25.58 25.83 25.94
>>>>
>>>> This behaviour of `runif` goes away when we use `kind =
>>>> "Knuth-TAOCP-2002"`, and we get values that appear to be much more
>>>> evenly
>>>> spread out.
>>>>
>>>> random_values = sapply(seeds, function(x) {
>>>> set.seed(x, kind = "Knuth-TAOCP-2002")
>>>> y = runif(1, 17, 26)
>>>> return(y)
>>>> })
>>>>
>>>> *Output omitted.*
>>>>
>>>> ---
>>>>
>>>> **The most interesting thing here is that this does not happen on
>>>> Windows
>>>> -- only happens on Ubuntu** (`sessionInfo` output for Ubuntu & Windows
>>>> below).
>>>>
>>>> # Windows output: #
>>>>
>>>> > seeds = c(
>>>> + "86548915", "86551615", "86566163", "86577411", "86584144",
>>>> + "86584272", "86620568", "86724613", "86756002", "86768593",
>>>> "86772411",
>>>> + "86781516", "86794389", "86805854", "86814600", "86835092",
>>>> "86874179",
>>>> + "86876466", "86901193", "86987847", "86988080")
>>>> >
>>>> > random_values = sapply(seeds, function(x) {
>>>> + set.seed(x)
>>>> + y = runif(1, 17, 26)
>>>> + return(y)
>>>> + })
>>>> >
>>>> > summary(random_values)
>>>> Min. 1st Qu. Median Mean 3rd Qu. Max.
>>>> 17.32 20.14 23.00 22.17 24.07 25.90
>>>>
>>>> Can someone help understand what is going on?
>>>>
>>>> Ubuntu
>>>> ------
>>>>
>>>> R version 3.4.0 (2017-04-21)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>> Running under: Ubuntu 16.04.2 LTS
>>>>
>>>> Matrix products: default
>>>> BLAS: /usr/lib/libblas/libblas.so.3.6.0
>>>> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
>>>> [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] parallel stats graphics grDevices utils datasets
>>>> methods base
>>>>
>>>> other attached packages:
>>>> [1] RMySQL_0.10.8 DBI_0.6-1
>>>> [3] jsonlite_1.4 tidyjson_0.2.2
>>>> [5] optiRum_0.37.3 lubridate_1.6.0
>>>> [7] httr_1.2.1 gdata_2.18.0
>>>> [9] XLConnect_0.2-12 XLConnectJars_0.2-12
>>>> [11] data.table_1.10.4 stringr_1.2.0
>>>> [13] readxl_1.0.0 xlsx_0.5.7
>>>> [15] xlsxjars_0.6.1 rJava_0.9-8
>>>> [17] sqldf_0.4-10 RSQLite_1.1-2
>>>> [19] gsubfn_0.6-6 proto_1.0.0
>>>> [21] dplyr_0.5.0 purrr_0.2.4
>>>> [23] readr_1.1.1 tidyr_0.6.3
>>>> [25] tibble_1.3.0 tidyverse_1.1.1
>>>> [27] rBayesianOptimization_1.1.0 xgboost_0.6-4
>>>> [29] MLmetrics_1.1.1 caret_6.0-76
>>>> [31] ROCR_1.0-7 gplots_3.0.1
>>>> [33] effects_3.1-2 pROC_1.10.0
>>>> [35] pscl_1.4.9 lattice_0.20-35
>>>> [37] MASS_7.3-47 ggplot2_2.2.1
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] splines_3.4.0 foreach_1.4.3 AUC_0.3.0
>>>> modelr_0.1.0
>>>> [5] gtools_3.5.0 assertthat_0.2.0 stats4_3.4.0
>>>> cellranger_1.1.0
>>>> [9] quantreg_5.33 chron_2.3-50 digest_0.6.10
>>>> rvest_0.3.2
>>>> [13] minqa_1.2.4 colorspace_1.3-2 Matrix_1.2-10
>>>> plyr_1.8.4
>>>> [17] psych_1.7.3.21 XML_3.98-1.7 broom_0.4.2
>>>> SparseM_1.77
>>>> [21] haven_1.0.0 scales_0.4.1 lme4_1.1-13
>>>> MatrixModels_0.4-1
>>>> [25] mgcv_1.8-17 car_2.1-5 nnet_7.3-12
>>>> lazyeval_0.2.0
>>>> [29] pbkrtest_0.4-7 mnormt_1.5-5 magrittr_1.5
>>>> memoise_1.0.0
>>>> [33] nlme_3.1-131 forcats_0.2.0 xml2_1.1.1
>>>> foreign_0.8-69
>>>> [37] tools_3.4.0 hms_0.3 munsell_0.4.3
>>>> compiler_3.4.0
>>>> [41] caTools_1.17.1 rlang_0.1.1 grid_3.4.0
>>>> nloptr_1.0.4
>>>> [45] iterators_1.0.8 bitops_1.0-6 tcltk_3.4.0
>>>> gtable_0.2.0
>>>> [49] ModelMetrics_1.1.0 codetools_0.2-15 reshape2_1.4.2
>>>> R6_2.2.0
>>>>
>>>> [53] knitr_1.15.1 KernSmooth_2.23-15 stringi_1.1.5
>>>> Rcpp_0.12.11
>>>>
>>>>
>>>>
>>>> Windows
>>>> -------
>>>>
>>>> > sessionInfo()
>>>> R version 3.3.2 (2016-10-31)
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>> Running under: Windows >= 8 x64 (build 9200)
>>>>
>>>> locale:
>>>> [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
>>>> LC_MONETARY=English_India.1252
>>>> [4] LC_NUMERIC=C LC_TIME=English_India.1252
>>>>
>>>> attached base packages:
>>>> [1] graphics grDevices utils datasets grid stats
>>>> methods base
>>>>
>>>> other attached packages:
>>>> [1] bindrcpp_0.2 h2o_3.14.0.3 ggrepel_0.6.5
>>>> eulerr_1.1.0 VennDiagram_1.6.17
>>>> [6] futile.logger_1.4.3 scales_0.4.1 FinCal_0.6.3
>>>> xml2_1.0.0 httr_1.3.0
>>>> [11] wesanderson_0.3.2 wordcloud_2.5 RColorBrewer_1.1-2
>>>> htmltools_0.3.6 urltools_1.6.0
>>>> [16] timevis_0.4 dtplyr_0.0.1 magrittr_1.5
>>>> shiny_1.0.5 RODBC_1.3-14
>>>> [21] zoo_1.8-0 sqldf_0.4-10 RSQLite_1.1-2
>>>> gsubfn_0.6-6 proto_1.0.0
>>>> [26] gdata_2.17.0 stringr_1.2.0 XLConnect_0.2-12
>>>> XLConnectJars_0.2-12 data.table_1.10.4
>>>> [31] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-8
>>>> readxl_0.1.1 googlesheets_0.2.1
>>>> [36] jsonlite_1.5 tidyjson_0.2.1 RMySQL_0.10.9
>>>> RPostgreSQL_0.4-1 DBI_0.5-1
>>>> [41] dplyr_0.7.2 purrr_0.2.3 readr_1.1.1
>>>> tidyr_0.7.0 tibble_1.3.3
>>>> [46] ggplot2_2.2.0 tidyverse_1.0.0 lubridate_1.6.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] gtools_3.5.0 assertthat_0.2.0 triebeard_0.3.0
>>>> cellranger_1.1.0 yaml_2.1.14
>>>> [6] slam_0.1-40 lattice_0.20-34 glue_1.1.1
>>>> chron_2.3-48 digest_0.6.12.1
>>>> [11] colorspace_1.3-1 httpuv_1.3.5 plyr_1.8.4
>>>> pkgconfig_2.0.1 xtable_1.8-2
>>>> [16] lazyeval_0.2.0 mime_0.5 memoise_1.0.0
>>>> tools_3.3.2 hms_0.3
>>>> [21] munsell_0.4.3 lambda.r_1.1.9 rlang_0.1.1
>>>> RCurl_1.95-4.8 labeling_0.3
>>>> [26] bitops_1.0-6 tcltk_3.3.2 gtable_0.2.0
>>>> reshape2_1.4.2 R6_2.2.0
>>>> [31] bindr_0.1 futile.options_1.0.0 stringi_1.1.2
>>>> Rcpp_0.12.12.1
>>>>
>>>> [1]: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/Unif
>>>> orm.html
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>>
>>
>
[[alternative HTML version deleted]]
More information about the R-devel
mailing list