[R] fixed set.seed + kmeans output disagree on distinct platforms
Martin Maechler
m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Wed Sep 4 10:41:34 CEST 2024
>>>>> Bert Gunter
>>>>> on Tue, 3 Sep 2024 23:32:25 -0700 writes:
> I have no clue, but I did note that you are using different versions of
> BLAS/LAPACK on the different platforms. Could that be (part) of the issue?
Good catch! My gut feeling would say "yes!" that is almost surely part of
the issue.
> Cheers,
> Bert
Additionally, careful reading of the help page (*before* any post ..)
would have shown
Note:
The clusters are numbered in the returned object, but they are a
_set_ and no ordering is implied. (Their apparent ordering may
differ by platform.)
Martin
> On Tue, Sep 3, 2024 at 10:24 PM Iago Giné Vázquez <iago.gine using sjd.es> wrote:
>> Hi all,
>>
>> I build a dataset processing in the same way the same data in Windows than
>> in Linux.
>>
>> The output of Windows processing is:
>> https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads
>> The output of Linux processing is:
>> https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads
>>
>> exdata=as.matrix(read.csv("
>> https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads",
>> header=FALSE))
>> exdata2=as.matrix(read.csv("
>> https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads",
>> header=FALSE))
>>
>> They are not identical (`identical(exdata,exdata2)` is FALSE), but they
>> are essentially equal (`all.equal(exdata,exdata2)` is TRUE). If I run
>>
>> set.seed(20232260)
>> exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750)
>>
>> I get
>>
>> exkmns$centers
>> V1 V2 V3 V4 V5 V6
>> 1 -0.4910731 -0.2662055 0.57928758 0.14267293 -0.03013791 0.106472717
>> 2 0.5301237 0.2815620 -0.23898532 1.00979412 -0.26123328 0.068099931
>> 3 0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855
>> 4 -0.2616257 0.5680582 0.55387437 -0.09562789 -0.01706577 -0.028248679
>> 5 -0.4820078 -0.1667370 -0.46533618 -0.05271446 0.05477352 0.005236259
>> 6 0.6455994 -0.1396674 0.05988547 -0.15557399 0.62766365 0.031051986
>> 7 0.1072127 0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130
>>
>> both in Windows (1) and in Linux (2, 3) up to rows order. If I run in
>> Linux in my computer (2)
>>
>> set.seed(20232260)
>> exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750)
>>
>> then, I get
>>
>> exkmns2$centers
>> V1 V2 V3 V4 V5 V6
>> 1 0.64559941 -0.1396674 0.05988547 -0.15557399 0.62766365 0.03105199
>> 2 -0.26162573 0.5680582 0.55387437 -0.09562789 -0.01706577 -0.02824868
>> 3 0.53012369 0.2815620 -0.23898532 1.00979412 -0.26123328 0.06809993
>> 4 0.03409765 0.3492520 -0.36910409 -0.40721418 -0.21482793 0.03073180
>> 5 -0.58527394 -0.1790337 -0.46778956 0.03573883 0.15473589 -0.07980379
>> 6 -0.49107314 -0.2662055 0.57928758 0.14267293 -0.03013791 0.10647272
>> 7 0.22552984 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.10753886
>>
>> therefore, all rows essentially equal except for rows 5 and 7 of first
>> dataset (5 and 4 of second dataset). With a bit more detail:
>>
>> *
>> Row 0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855
>> belongs to exdata (and exdata2) and is center of both outputs
>> *
>> Row 0.1072127 0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130
>> belongs to the dataset and it is only center of exdata output
>> *
>> Row -0.4820078 -0.1667370 -0.46533618 -0.05271446 0.05477352 0.005236259
>> does not belong to the dataset and it is only center of exdata output
>> *
>> Row -0.58527394 -0.1790337 -0.46778956 0.03573883 0.15473589 -0.07980379
>> belongs to the dataset and it is only center for exdata2 on Linux in my
>> computer
>> *
>> Row 0.03409765 0.3492520 -0.36910409 -0.40721418 -0.21482793 0.03073180
>> does not belong to the dataset and it is only center for exdata2 on Linux
>> in my computer
>> *
>> All other 4 rows (1,2,4 and 6 of first output) do not belong to the
>> dataset and are common centers.
>>
>> Even, further, if I run
>>
>> set.seed(20232260)
>> exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750)
>>
>> in posit.cloud (3), I get the same result than above. However, if I run
>> (both in posit.cloud or in Windows)
>>
>> set.seed(20232260)
>> exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750)
>>
>> then I get
>>
>>
>> exkmns2$centers
>> V1 V2 V3 V4 V5 V6
>> 1 0.6426035 -0.1449498 0.05843435 -0.1527968 0.62943077 0.02984948
>> 2 -0.4092382 -0.3740695 0.69597037 0.1956896 -0.05026200 -0.01453132
>> 3 0.1072127 0.5538876 -0.33117098 -0.4320920 -0.18646403 -0.08127313
>> 4 0.2255298 -0.5165964 -0.02498471 -0.2043827 -0.41224195 -0.10753886
>> 5 0.5301237 0.2815620 -0.23898532 1.0097941 -0.26123328 0.06809993
>> 6 -0.5223387 -0.1484517 -0.38982567 -0.0341488 0.06446446 0.03622056
>> 7 -0.2701703 0.5263218 0.52942311 -0.1112202 -0.03460591 0.03577287
>>
>> So only its rows 4 and 5 are common centers to both of previous outputs
>> and row 3 is common width exdata centers.
>>
>> Does all this have any sense?
>>
>> Thanks!
>>
>> Iago
>>
>> (1)
>> R version 4.4.1 (2024-06-14 ucrt)
>> Platform: x86_64-w64-mingw32/x64
>> Running under: Windows 10 x64 (build 19045)
>>
>> Matrix products: default
>>
>> (2)
>> R version 4.4.1 (2024-06-14)
>> Platform: x86_64-pc-linux-gnu
>> Running under: Debian GNU/Linux 12 (bookworm)
>>
>> Matrix products: default
>> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
>> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so;
>> LAPACK version 3.11.0
>>
>> (3)
>> R version 4.4.1 (2024-06-14)
>> Platform: x86_64-pc-linux-gnu
>> Running under: Ubuntu 20.04.6 LTS
>>
>> Matrix products: default
>> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/
>> libopenblasp-r0.3.8.so; LAPACK version 3.9.0
>>
More information about the R-help
mailing list