[R] Importance of levels in a factor variable

Saeed Abu Nimeh sabunime at gmail.com
Thu Aug 26 21:40:25 CEST 2010


I have a dataset of multiple variables and a response. For example,
> str(x)
'data.frame':   3557238 obs. of  44 variables:
 $ response :  Factor w/ 2 levels
 $ var2: Factor w/5000 levels


If var2 for example is a factor with 5000 levels, what is the best
approach to determine which of these levels is the most important to
include in building the model, and which ones to discard. Assuming
there is a way to do that, is it accurate to only include the
important levels and discard the rest for that variable when building
the model.
Thansk,
Saeed

---
> sessionInfo()
R version 2.10.1 (2009-12-14)
x86_64-pc-linux-gnu
32 GB RAM



More information about the R-help mailing list