[R] fixed set.seed + kmeans output disagree on distinct platforms

Iago Giné Vázquez |@go@g|ne @end|ng |rom @jd@e@
Wed Sep 4 11:18:55 CEST 2024


Thanks both Bert and Martin,

However  exkmns2$centers is common in posit.cloud  - LAPACK version 3.9.0- and in Windows -LAPACK 3.12.0-, while  distinct with my Linux settings -LAPACK version 3.11.0- (I don't know the BLAS version used by R in windows). It is a bit strange...

Iago

________________________________
De: Martin Maechler <maechler using stat.math.ethz.ch>
Enviat el: dimecres, 4 de setembre de 2024 10:41
Per a: Bert Gunter <bgunter.4567 using gmail.com>
A/c: Iago Giné Vázquez <iago.gine using sjd.es>; r-help using r-project.org <r-help using r-project.org>
Tema: Re: [R] fixed set.seed + kmeans output disagree on distinct platforms

>>>>> Bert Gunter
>>>>>     on Tue, 3 Sep 2024 23:32:25 -0700 writes:

    > I have no clue, but I did note that you are using different versions of
    > BLAS/LAPACK on the different platforms. Could that be (part) of the issue?

Good catch!  My gut feeling would say "yes!" that is almost surely part of
the issue.

    > Cheers,
    > Bert

Additionally, careful reading of the help page (*before* any post ..)
would have shown

   Note:

        The clusters are numbered in the returned object, but they are a
        _set_ and no ordering is implied.  (Their apparent ordering may
        differ by platform.)


Martin



    > On Tue, Sep 3, 2024 at 10:24 PM Iago Giné Vázquez <iago.gine using sjd.es> wrote:

    >> Hi all,
    >>
    >> I build a dataset processing in the same way the same data in Windows than
    >> in Linux.
    >>
    >> The output of Windows processing is:
    >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads
    >> The output of Linux processing is:
    >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads
    >>
    >> exdata=as.matrix(read.csv("
    >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata.csv?ref_type=heads",
    >> header=FALSE))
    >> exdata2=as.matrix(read.csv("
    >> https://gitlab.com/iagogv/repdata/-/raw/main/exdata2.csv?ref_type=heads",
    >> header=FALSE))
    >>
    >> They are not identical (`identical(exdata,exdata2)` is FALSE), but they
    >> are essentially equal (`all.equal(exdata,exdata2)` is TRUE). If I run
    >>
    >> set.seed(20232260)
    >> exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750)
    >>
    >> I get
    >>
    >> exkmns$centers
    >> V1         V2          V3          V4          V5           V6
    >> 1 -0.4910731 -0.2662055  0.57928758  0.14267293 -0.03013791  0.106472717
    >> 2  0.5301237  0.2815620 -0.23898532  1.00979412 -0.26123328  0.068099931
    >> 3  0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855
    >> 4 -0.2616257  0.5680582  0.55387437 -0.09562789 -0.01706577 -0.028248679
    >> 5 -0.4820078 -0.1667370 -0.46533618 -0.05271446  0.05477352  0.005236259
    >> 6  0.6455994 -0.1396674  0.05988547 -0.15557399  0.62766365  0.031051986
    >> 7  0.1072127  0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130
    >>
    >> both in Windows  (1) and in Linux (2, 3) up to rows order. If I run in
    >> Linux in my computer (2)
    >>
    >> set.seed(20232260)
    >> exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750)
    >>
    >> then, I get
    >>
    >> exkmns2$centers
    >> V1         V2          V3          V4          V5          V6
    >> 1  0.64559941 -0.1396674  0.05988547 -0.15557399  0.62766365  0.03105199
    >> 2 -0.26162573  0.5680582  0.55387437 -0.09562789 -0.01706577 -0.02824868
    >> 3  0.53012369  0.2815620 -0.23898532  1.00979412 -0.26123328  0.06809993
    >> 4  0.03409765  0.3492520 -0.36910409 -0.40721418 -0.21482793  0.03073180
    >> 5 -0.58527394 -0.1790337 -0.46778956  0.03573883  0.15473589 -0.07980379
    >> 6 -0.49107314 -0.2662055  0.57928758  0.14267293 -0.03013791  0.10647272
    >> 7  0.22552984 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.10753886
    >>
    >> therefore, all rows essentially equal except for rows 5 and 7 of first
    >> dataset (5 and 4 of second dataset).  With a bit more detail:
    >>
    >> *
    >> Row 0.2255298 -0.5165964 -0.02498471 -0.20438275 -0.41224195 -0.107538855
    >> belongs to exdata (and exdata2) and is center of both outputs
    >> *
    >> Row 0.1072127  0.5538876 -0.33117098 -0.43209203 -0.18646403 -0.081273130
    >> belongs to the dataset and it is only center of exdata output
    >> *
    >> Row -0.4820078 -0.1667370 -0.46533618 -0.05271446  0.05477352  0.005236259
    >> does not belong to the dataset and it is only center of exdata output
    >> *
    >> Row -0.58527394 -0.1790337 -0.46778956  0.03573883  0.15473589 -0.07980379
    >> belongs to the dataset and it is only center for exdata2 on Linux in my
    >> computer
    >> *
    >> Row 0.03409765  0.3492520 -0.36910409 -0.40721418 -0.21482793  0.03073180
    >> does not belong to the dataset and it is only center for exdata2 on Linux
    >> in my computer
    >> *
    >> All other 4 rows (1,2,4 and 6 of first output) do not belong to the
    >> dataset and are common centers.
    >>
    >> Even, further, if I run
    >>
    >> set.seed(20232260)
    >> exkmns <- kmeans(exdata, centers = 7, iter.max = 2000, nstart = 750)
    >>
    >> in  posit.cloud (3), I get the same result than above. However, if I run
    >> (both in posit.cloud or in Windows)
    >>
    >> set.seed(20232260)
    >> exkmns2 <- kmeans(exdata2, centers = 7, iter.max = 2000, nstart = 750)
    >>
    >> then I get
    >>
    >>
    >> exkmns2$centers
    >> V1         V2          V3         V4          V5          V6
    >> 1  0.6426035 -0.1449498  0.05843435 -0.1527968  0.62943077  0.02984948
    >> 2 -0.4092382 -0.3740695  0.69597037  0.1956896 -0.05026200 -0.01453132
    >> 3  0.1072127  0.5538876 -0.33117098 -0.4320920 -0.18646403 -0.08127313
    >> 4  0.2255298 -0.5165964 -0.02498471 -0.2043827 -0.41224195 -0.10753886
    >> 5  0.5301237  0.2815620 -0.23898532  1.0097941 -0.26123328  0.06809993
    >> 6 -0.5223387 -0.1484517 -0.38982567 -0.0341488  0.06446446  0.03622056
    >> 7 -0.2701703  0.5263218  0.52942311 -0.1112202 -0.03460591  0.03577287
    >>
    >> So only its rows 4 and 5 are common centers to both of previous outputs
    >> and row 3 is common width exdata centers.
    >>
    >> Does all this have any sense?
    >>
    >> Thanks!
    >>
    >> Iago
    >>
    >> (1)
    >> R version 4.4.1 (2024-06-14 ucrt)
    >> Platform: x86_64-w64-mingw32/x64
    >> Running under: Windows 10 x64 (build 19045)
    >>
    >> Matrix products: default
    >>
    >> (2)
    >> R version 4.4.1 (2024-06-14)
    >> Platform: x86_64-pc-linux-gnu
    >> Running under: Debian GNU/Linux 12 (bookworm)
    >>
    >> Matrix products: default
    >> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
    >> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.21.so;
    >> LAPACK version 3.11.0
    >>
    >> (3)
    >> R version 4.4.1 (2024-06-14)
    >> Platform: x86_64-pc-linux-gnu
    >> Running under: Ubuntu 20.04.6 LTS
    >>
    >> Matrix products: default
    >> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/
    >> libopenblasp-r0.3.8.so;  LAPACK version 3.9.0
    >>

	[[alternative HTML version deleted]]



More information about the R-help mailing list