[R] Opinion: Why I find factors convenient to use

Nutter, Benjamin NutterB at ccf.org
Mon Aug 20 14:59:03 CEST 2012


Whether I use stringsAsFactors=FALSE or stringsAsFactors=TRUE tends to rely on where my data are coming from.  If the data are coming from our Oracle databases (well controlled data), I import the with stringsAsFactors=TRUE and everything is great.  If the data are given to me by a fellow in the form of an Excel spreadsheet, I have a good cry and then set stringsAsFactors=FALSE.  Regardless, before I get to analyzing the data, I convert them all to factors.  I imagine people's preferences for the default setting are strongly tied to the quality of the data with which they tend to work.

I would prefer the default argument be left as it is, however.  Mostly because
1) I feel like it assumes you are importing data for analysis and not for data management; and more importantly
2) Changing the default would mean I have to change the way I approach data import--and I don't like to change.

  Benjamin Nutter |  Biostatistician     |  Quantitative Health Sciences
  Cleveland Clinic    |  9500 Euclid Ave.  |  Cleveland, OH 44195  | (216) 445-1365


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Rui Barradas
Sent: Monday, August 20, 2012 8:03 AM
To: S Ellison
Cc: r-help
Subject: Re: [R] Opinion: Why I find factors convenient to use

Hello,

Em 20-08-2012 12:30, S Ellison escreveu:
>   
>
>> -----Original Message-----
>> Over the years, many people -- including some who I would consider 
>> real expeRts -- have criticized factors and advocated the use 
>> (sometimes exclusively) of character vectors instead.
> Exclusive use of character vectors is not going to do the job.
>
> The concept of a factor is fundamental to a lot of statistics; a programming environment that does not implement factors and their associated special behaviour is probably not a statistical programming language.
>
> Special behaviours I have in mind include:
> - Level order can be arbitrarily specified for display purposes
> - A control level can be intentionally chosen for contrasts
> - the option of "ordered" factors (for example, for polr and the like)
>
> So I think the language does and will require a 'factor' type in one form or another.
>
>   _When_ you decide to convert a character input to a factor is, of course, up to the user,and for cleanup it's very often better to stick with character early and convert to factor a bit later. But personally, I think that there is sufficient control over the coding of data to allow user discretion. and on the whole, it seems to me that character input gets used as factor data so much of the time when it is used at all that the default stringsAsFactors=TRUE setting seems the more sensible default.

I disagree with this last point. Just think of the number of questions to this list about, say, dates. When read from file using one of the forms of read.table, they usually cause problems. Unless the user is an experienced one, in which case he/she might not have a question to ask.
Besides, the default TRUE is contradictory with "stick with character early and convert to factor a bit later". With both "early" and "later".
A different thing is to have a very used function's default behavior change from one version of R to the next one. What about all the code in use? Maybe it's better to leave it be.

Rui Barradas
>
> S Ellison
>
> *******************************************************************
> This email and any attachments are confidential. Any 
> use...{{dropped:8}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

===================================


 Please consider the environment before printing this e-mail

Cleveland Clinic is ranked one of the top hospitals in America by U.S.News & World Report (2010).  
Visit us online at http://www.clevelandclinic.org for a complete listing of our services, staff and locations.


Confidentiality Note:  This message is intended for use only by the individual or entity to which it is addressed and may contain information that is privileged, confidential, and exempt from disclosure under applicable law.  If the reader of this message is not the intended recipient or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited.  If you have received this communication in error,  please contact the sender immediately and destroy the material in its entirety, whether electronic or hard copy.  

Thank you.




More information about the R-help mailing list