[R] agnes clustering and NAs

Fri Jan 28 12:34:26 CET 2011

>>>>> Gavin Simpson <gavin.simpson at ucl.ac.uk>
>>>>>     on Fri, 28 Jan 2011 09:23:05 +0000 writes:

    > On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote:
    >> Hello,
    >> 
    >> Yes, that's right, it is a values matrix. Not a dissimilarity matrix.
    >> 
    >> i.e.
    >> 
    >> > str(iMatrix)
    >> num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ...
    >> - attr(*, "dimnames")=List of 2
    >> ..$ : NULL
    >> ..$ : chr [1:56] "-8100" "-7900" "-7700" "-7500" ...

Ok, so in the end you want to draw a dendrogram for  23'371
observational units, really ?

I think I would not use a hierarchical clustering method for so
many units, but rather  clara() or maybe pam() or then model
based or other methods, rather than fully hierarchical ones....
...
but yes, that's not the issue here, and see further down ...

BTW:  The object 'iMatrix' you provided for download has only 50
      columns, not 56...
    >> 
    >> For the snippet of checking for NAs, I get all TRUEs, so I have at least one NA in each column.

    GS> Sorry, my bad. Try this:

    GS> apply(iMatrix, 1, function(x) all(is.na(x)))

    GS> will check that you have no fully `NA` rows.

    GS> Also look at str(iMatrix) for potential problems.

    GS> Finally, try:

    GS> out <- dist(iMatrix) any(is.na(out))

    GS> should repeat what agnes is doing to compute the
    GS> dissimilarity matrix.  If that returns TRUE, go and find
    GS> which samples are giving NA dissimilarity and why.

    GS> The issue is not NA in the input data, but that your
    GS> input data is leading to NA in the computed
    GS> dissimilarities. This might be due to NA's in your input
    GS> data, where a pair of samples has no common set of data
    GS> for example.

Yes, that's right on spot, thank you Gavin.

This is indeed to true:  
It *does* allow for NA's (in the data matrix), but if the
pattern of NA's is such that the dissimilarity between two
observations becomes undefined, namely e.g. if they have no
common non-missings, then ``that's too much''.

In general, I'd recommend to use 
  dm <- daisy(....,...) 
trying methods, that are better with NAs, e.g. Gower's metric,
until dm() has {nearly} no NAs,
and then figure out some imputation to replace all NA's in   dm
by "reasonable values",
then do clustering with the resulting dissimilarity "matrix" dm.

HOWEVER, in your case, dm would correspond to 
 23371 x 23371 dissimilarity matrix,
stored as a double precision matrix (on a 64-bit platform)
that's an object of size 4.4 GBytes, not very convenient to work
with.
as dissimilarity object it will only be about half of that size,
but that's still ``a bit large''..
As I said above, for such data, I would never do fully
hierarchical clustering,
but rather something else.

Martin Maechler, ETH Zurich

    GS> HTH
    GS> G

    >> The part of the agnes documentation I was referring to is :
    >> 
    >> "In case of a matrix or data frame, each row corresponds to an observation, and each column corresponds to a variable. All variables must be numeric.  Missing values (NAs) are allowed."
    >> 
    >> So, I'm under the impression it handles NAs on its own ?
    >> 
    >> - Dario.
    >> 
    >> ---- Original message ----
    >> >Date: Thu, 27 Jan 2011 12:53:27 +0000
    >> >From: Gavin Simpson <gavin.simpson at ucl.ac.uk>  
    >> >Subject: Re: [R] agnes clustering and NAs  
    >> >To: Uwe Ligges <ligges at statistik.tu-dortmund.de>
    >> >Cc: D.Strbenac at garvan.org.au, r-help at r-project.org
    >> >
    >> >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote:
    >> >> 
    >> >> On 27.01.2011 05:00, Dario Strbenac wrote:
    >> >> > Hello,
    >> >> >
    >> >> > In the documentation for agnes in the package 'cluster', it says that NAs are allowed, and sure enough it works for a small example like :
    >> >> >
    >> >> >> m<- matrix(c(
    >> >> > 1, 1, 1, 2,
    >> >> > 1, NA, 1, 1,
    >> >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE)
    >> >> >> agnes(m)
    >> >> > Call:    agnes(x = m)
    >> >> > Agglomerative coefficient:  0.1614168
    >> >> > Order of objects:
    >> >> > [1] 1 2 3
    >> >> > Height (summary):
    >> >> >     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    >> >> >    1.155   1.247   1.339   1.339   1.431   1.524
    >> >> >
    >> >> > Available components:
    >> >> > [1] "order"  "height" "ac"     "merge"  "diss"   "call"   "method" "data"
    >> >> >
    >> >> > But I have a large matrix (23371 rows, 50 columns) with some NAs in it and it runs for about a minute, then gives an error :
    >> >> >
    >> >> >> agnes(iMatrix)
    >> >> > Error in agnes(iMatrix) :
    >> >> >    No clustering performed, NA-values in the dissimilarity matrix.
    >> >> >
    >> >> > I've also tried getting rid of rows with all NAs in them, and it still gave me the same error. Is this a bug in agnes() ? It doesn't seem to fulfil the claim made by its documentation.
    >> >> 
    >> >> 
    >> >> I haven't looked in the file, but you need to get rid of all NA, or in 
    >> >> other words, all rows that contain *any* NA values.
    >> >
    >> >If one believes the documentation, then that only applies to the case
    >> >where `x` is a dissimilarity matrix. `NA`s are allowed if x is the raw
    >> >data matrix or data frame.
    >> >
    >> >The only way the OP could have gotten that error with the call shown is
    >> >if iMatrix were not a dissimilarity matrix inheriting from class "dist",
    >> >so `NA`s should be allowed.
    >> >
    >> >My guess would be that the OP didn't get rid of all the `NA`s.
    >> >
    >> >Dario: what does:
    >> >
    >> >sapply(iMatrix, function(x) any(is.na(x)))
    >> >
    >> >or if iMatrix is a matrix:
    >> >
    >> >apply(iMatrix, 2, function(x) any(is.na(x)))
    >> >
    >> >say?
    >> >
    >> >G
    >> >
    >> >> Uwe Ligges
    >> >> 
    >> >> 
    >> >> 
    >> >> > The matrix I'm using can be obtained here :
    >> >> > http://129.94.136.7/file_dump/dario/iMatrix.obj
    >> >> >
    >> >> > --------------------------------------
    >> >> > Dario Strbenac
    >> >> > Research Assistant
    >> >> > Cancer Epigenetics
    >> >> > Garvan Institute of Medical Research
    >> >> > Darlinghurst NSW 2010
    >> >> > Australia
    >> >> >

    >> >-- 
    >> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
    >> > Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
    >> > ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
    >> > Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
    >> > Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
    >> > UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
    >> >%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%