[R] Cluster analysis, factor variables, large data set
Hans Ekbrand
hans at sociologi.cjb.net
Thu Mar 31 21:17:40 CEST 2011
On Thu, Mar 31, 2011 at 08:48:02PM +0200, Hans Ekbrand wrote:
> On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote:
> > Dear Hans,
> >
> > clara doesn't require a distance matrix as input (and therefore
> > doesn't require you to run daisy), it will work with the raw data
> > matrix using
> > Euclidean distances implicitly.
> > I can't tell you whether Euclidean distances are appropriate in this
> > situation (this depends on the interpretation and variables and
> > particularly on how they are scaled), but they may be fine at least
> > after some transformation and standardisation of your variables.
>
> The variables are unordered factors, stored as integers 1:9, where
>
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
>
> Does euclidean distances make sense on unordered factors coded as
> integers?
To be clear, here is an extract
> my.df.full[900:910, 16:19]
PL210F.first.year PL210G.first.year PL210H.first.year PL210I.first.year
900 2 2 1 2
901 1 1 1 1
902 1 1 1 1
903 2 2 2 2
904 1 1 1 1
905 2 2 2 2
906 7 8 2 7
907 5 5 5 5
908 1 1 1 1
909 1 1 1 1
910 1 1 1 1
> class(my.df.full[,16])
[1] "integer"
More information about the R-help
mailing list