[Rd] stringsAsFactors

Mon Feb 11 23:46:49 CET 2013

On Feb 11, 2013, at 18:50 , Duncan Murdoch wrote:

> 
> I do think that it's unfortunate that we don't get the same result in both cases, and I'd like to have gotten the predictions you suggested, but I don't think that's going to happen.  The reason for the difference is that the subsetting is done before the conversion to a factor, but I think that is unavoidable without really big changes.

It's logically impossible I'd say. If you want to do conversion from character to factor on an as-needed basis, you _will_ have issues with subsetting operations affecting the set of levels. 

The logical way out is to define factors before subsetting. As far as possible, create them up front. Doing it automagically in read.table is far from infallible, but at least has some chance of getting in roughly right. In my view, this is actually a pretty strong argument for keeping stringsAsFactors==TRUE. 

(Praeterea censeo: The real issue is that plain-text data file formats contain insufficient metadata, so what we probably should do is to start thinking about ways to encode type and level set information in the files themselves.) 

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com