[R] stringsAsFactor global option (was "character coerced to a factor")

hadley wickham h.wickham at gmail.com
Mon Apr 23 15:30:25 CEST 2007


>   A place where factors really are a pain is when the patient id is a character
> string.  When, for instance, you subset the data to do an analysis of only
> the females, having the data set `remember' all of the male id's (the original
> levels) is non-productive in dozens of ways.  For other variables factors
> work well and have some nice properties.  In general, I've found in my work
> (medical research) that factors are beneficial for about 1/5 of the character
> variables, a PITA for 1/4, and a wash for the rest; so prefer to do any
> transformations myself.

It seems to me that the most importance difference between factors and
character vectors is that factors also store the range of the
variable.  You could imagine doing something similar for continuous
variables.  This would have the interesting property that plots of
subsets would have the same range as plots of the original data.  I'd
imagine, just as with factors, this would be useful and frustrating in
equal parts.

In terms of which should be the default, I can imagine two arguments:

 * keep to the original format of the data as closely as possible:
character vectors should be the default

 * maintain as much information about the original data as possible:
factors should be the default.

Hadley



More information about the R-help mailing list