[R] What is mclust up to? Different clusters found if x and y interchanged

Tue Apr 20 19:36:12 CEST 2010

Thanks Prof. Fraley.  This is very helpful.  I had looked at the way mclust
gets its start and that looked like a possible source of the differences,
but had not had time to get up to speed and figure it out.  I was thinking
it was more of a set.seed sort of thing.  Now I know more.  Thanks so much.
Bryan

On 4/20/10 11:41 AM, "Chris Fraley" <fraley at u.washington.edu> wrote:

> 
> 
> This is a really interesting case, but it is probably not that unusual.
> 
> mod23 <- Mclust(tur[,2:3])
> mod32 <- Mclust(tur[,3:2])
> 
> mod23[c("modelName","G")]
> mod32[c("modelName","G")]
> 
> The 2:3 case is choosing the VVV model (more parameters per cluster) with 3
> clusters.
> The 3:2 case is choosing the EEV model (more parsimonious), but with 7
> clusters.
> 
> If you look at the BIC curves, you can see that mclust is indeed choosing the
> model
> corresponding to the maximum BIC in each case.
> 
> plot( mclustBIC( tur[,2:3], modelNames = c("EEV", "VVV"))
> plot( mclustBIC( tur[,3:2], modelNames = c("EEV", "VVV"))
> 
> The default initialization for Mclust is via hierarchical clustering, and
> that's
> why the results differ
> 
> hc23 <- hcVVV( tur[,2:3]) # default initialization for mod23
> hc32 <- hcVVV( tur[,3:2]) # default initialization for mod32
> 
> all.equal(hc32,hc23) # they differ!
> 
> The hierarchical clustering merges two groups at each stage.
> If there is more than one choice at any particular stage for
> which the merge criteria are equal or within roundoff error,
> a different pair could be chosen for merging if columns (or rows)
> are permuted. This will affect later merges, and could ultimately
> affect the mclust results.
> 
> If you initialize both col permutations in the same way, you'll get
> the same results.
> 
> mod23mod <- Mclust(tur[,2:3], initialization = list( hcPairs = hc32)) # uses
> mod32 initialization
> mod32mod <- Mclust(tur[,3:2], initialization = list( hcPairs = hc23)) # uses
> mod23 initialization
> 
> mod23mod[c("modelName","G")] # same as mod32
> mod32mod[c("modelName","G")] # same as mod23
> 
> Hope this helps,
> 
> Chris Fraley
> 
> 
> On Tue, 20 Apr 2010, Bryan Hanson wrote:
> 
>> Prof. Fraley, I wonder if you would have a couple of moments to respond to
>> this question I put to the R help list.  Thanks in advance for your time.
>> 
>> Bryan
>> *************
>> Bryan Hanson
>> Acting Chair
>> Professor of Chemistry & Biochemistry
>> DePauw University, Greencastle IN USA
>> 
>> 
>> ------ Forwarded Message
>> From: Bryan Hanson <hanson at depauw.edu>
>> Date: Mon, 19 Apr 2010 19:31:03 -0400
>> To: <hanson at gapps.depauw.edu>, R Help <r-help at stat.math.ethz.ch>
>> Subject: [R] What is mclust up to? Different clusters found if x and y
>> interchanged
>> 
>> Hello All...
>> 
>> I gave a task to my students that involved using mclust to look for clusters
>> in some bivariate data of isotopes vs various mining locations.  They
>> discovered something I didn¹t expect; the data (called tur) is appended
>> below.
>> 
>> p <- qplot(x = dD, y = dCu65, data = tur, color = mine)
>> print(p) # simple bivariate plot of the data; looks fine
>> 
>> mod1 <- Mclust(tur[,2:3])
>> mod1$G
>> mod2 <- Mclust(tur[,3:2])
>> mod2$G
>> 
>> One can use coordProj to see the clusters found, but the basic result is
>> that mclust found 3 clusters in mod1, but if you interchange the x and y
>> columns it finds 7 clusters (mod2).  Since this is bivariate data, I find
>> this result pretty strange.  The only thing I can think of is that it has
>> something to do with how the algorithm is seeded, but that isn't too
>> convincing.  Since I couldn't believe my eyes, I made up some bivariate data
>> with known clusters (not given here) and interchanged the x and y columns
>> and got the same clusters.  I'm at a loss.  Hopefully someone can set my
>> understanding on the right path.
>> 
>> Thanks for any insight!  Bryan
>> 
>>> sessionInfo()
>> R version 2.10.1 (2009-12-14)
>> x86_64-apple-darwin9.8.0
>> 
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>> 
>> attached base packages:
>> [1] datasets  tools     grid      graphics  grDevices utils     stats
>> methods   base
>> 
>> other attached packages:
>> [1] faraway_1.0.4      GGally_0.1         xtable_1.5-6       mvbutils_2.5.1
>> ggplot2_0.8.7
>> [6] digest_0.4.2       reshape_0.8.3      proto_0.3-8        ChemoSpec_1.43
>> R.utils_1.4.0
>> [11] R.oo_1.7.1         R.methodsS3_1.2.0  rgl_0.91           lattice_0.18-3
>> mvoutlier_1.4
>> [16] plyr_0.1.9         RColorBrewer_1.0-2 chemometrics_0.8   som_0.3-5
>> robustbase_0.5-0-1
>> [21] rpart_3.1-46       pls_2.1-0          pcaPP_1.8          mvtnorm_0.9-9
>> nnet_7.3-1
>> [26] mclust_3.4.4       MASS_7.3-5         lars_0.9-7         e1071_1.5-23
>> class_7.3-2
>> 
>> And the data:
>> 
>>> dput(tur)
>> structure(list(mine = structure(c(13L, 13L, 13L, 13L, 13L, 13L,
>> 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
>> 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
>> 6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
>> 1L, 1L, 1L, 1L, 1L, 1L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
>> 9L, 9L, 9L, 9L, 12L, 12L, 12L, 12L, 12L, 8L, 8L, 8L, 8L, 8L,
>> 8L, 8L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 10L, 10L, 10L, 10L, 10L,
>> 10L, 10L, 10L, 10L, 10L, 4L, 4L, 4L, 4L, 4L, 11L, 11L, 11L, 11L,
>> 11L, 11L, 11L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L,
>> 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Cas",
>> "CH", "CR", "FM", "GU", "KG", "KM", "LV", "MC", "MZ", "N8", "OR",
>> "SB", "TF"), class = "factor"), dCu65 = c(6.1, 6.5, 5.7, 6.4,
>> 6.1, 5.9, 6.5, 5.7, 6, 6.4, 6.8, 6.9, 6.7, 6.7, 5.8, 5.3, 6.7,
>> 5.7, 6.7, 5.6, 6.4, 6.4, 6.5, 7.6, 6.7, 6.4, 6.4, 7, 7, 5.9,
>> 5.3, 6.1, 11.8, 12.1, 12.4, 1, 1.8, 1.4, 1.2, 1.2, 1.2, 1.1,
>> 1.2, 1.8, 1, 0.8, 0.7, 1.2, 1.5, 1, 0.2, 1.4, 0.3, 1.1, 0, -0.5,
>> -1.4, -0.2, 1.1, 0.9, 0.1, 1.4, -1.2, -0.1, -0.4, -0.8, 3, 3.3,
>> 3.9, 3.9, 3.7, 9.5, 9.6, 9.6, 8.3, 10.3, 8.4, 9.4, 0.8, 0.9,
>> 1.3, 0.2, 2.2, 0.7, 1.8, 4.5, 4.5, 4.6, 4.5, 4.6, 3.9, 4.2, 4.9,
>> 4.9, 4.5, 16.7, 17.4, 17.5, 17.5, 17.4, 1.9, 2.4, 1.7, 1.2, 1.5,
>> 1.1, 1.8, 3.1, 3.1, 3, 4.1, 4.2, 2.7, 4.5, 0.4, 1.2, 3, 3.5,
>> 1.2, 0.8, 0.9, 2, -0.8, 5.8, 0.6, 1, 0.9, 4.4, 2.3, -0.1), dD = c(-66L,
>> -81L, -81L, -71L, -68L, -79L, -73L, -77L, -74L, -71L, -73L, -80L,
>> -76L, -80L, -74L, -77L, -74L, -77L, -74L, -82L, -79L, -72L, -74L,
>> -71L, -76L, -72L, -76L, -73L, -75L, -78L, -81L, -77L, -77L, -73L,
>> -79L, -104L, -98L, -98L, -109L, -106L, -99L, -107L, -100L, -98L,
>> -95L, -95L, -95L, -97L, -98L, -121L, -119L, -118L, -121L, -121L,
>> -106L, -116L, -122L, -112L, -125L, -109L, -114L, -118L, -102L,
>> -109L, -101L, -101L, -74L, -78L, -63L, -67L, -75L, -65L, -60L,
>> -62L, -67L, -59L, -62L, -68L, -61L, -58L, -69L, -57L, -59L, -67L,
>> -73L, -81L, -90L, -87L, -82L, -80L, -80L, -81L, -83L, -90L, -89L,
>> -107L, -122L, -128L, -113L, -124L, -93L, -89L, -92L, -85L, -79L,
>> -84L, -83L, -148L, -142L, -127L, -130L, -123L, -131L, -85L, -87L,
>> -90L, -81L, -65L, -81L, -119L, -121L, -94L, -90L, -117L, -95L,
>> -69L, -122L, -103L, -85L, -106L)), .Names = c("mine", "dCu65",
>> "dD"), class = "data.frame", row.names = c(NA, -130L))
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> ------ End of Forwarded Message
>> 
>> 
>> 
>>