[R] Cluster analysis, factor variables, large data set

Thu Mar 31 21:17:40 CEST 2011

On Thu, Mar 31, 2011 at 08:48:02PM +0200, Hans Ekbrand wrote:
> On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote:
> > Dear Hans,
> > 
> > clara doesn't require a distance matrix as input (and therefore
> > doesn't require you to run daisy), it will work with the raw data
> > matrix using
> > Euclidean distances implicitly.
> > I can't tell you whether Euclidean distances are appropriate in this
> > situation (this depends on the interpretation and variables and
> > particularly on how they are scaled), but they may be fine at least
> > after some transformation and standardisation of your variables.
> 
> The variables are unordered factors, stored as integers 1:9, where 
> 
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
> 
> Does euclidean distances make sense on unordered factors coded as
> integers?

To be clear, here is an extract

> my.df.full[900:910, 16:19]
    PL210F.first.year PL210G.first.year PL210H.first.year PL210I.first.year
900                 2                 2                 1                 2
901                 1                 1                 1                 1
902                 1                 1                 1                 1
903                 2                 2                 2                 2
904                 1                 1                 1                 1
905                 2                 2                 2                 2
906                 7                 8                 2                 7
907                 5                 5                 5                 5
908                 1                 1                 1                 1
909                 1                 1                 1                 1
910                 1                 1                 1                 1

> class(my.df.full[,16])
[1] "integer"