[R] Data type in a data frame

William Dunlap wdunlap at tibco.com
Tue Oct 23 20:55:01 CEST 2012


> When read into a data.frame, R defaults to reading character strings as
> factors. If you don't want that, use option stringsAsFactors = FALSE.

This is somewhat tangential, but if you plan on using
  predict(fit,newdata=nd)
after fitting a model like
  fit <- lm(y~x, data=d)
be sure you have converted character columns in nd and d into factors.
Otherwise you are likely to get errors from predict().   You will get
a warning when fitting the model if you use character columns, but
the results are ok until you use predict() on the result.

E.g.,
> d <- data.frame(y=1:10, cGroup=rep(c("A","B","C"),c(3,4,3)), fGroup=factor(rep(c("A","B","C"),c(3,4,3))), stringsAsFactors=FALSE)
> fitChar <- lm(y ~ cGroup - 1, data=d[1:9,])
Warning message:
In model.matrix.default(mt, mf, contrasts) :
  variable 'cGroup' converted to a factor
> fitFactor <- lm(y ~ fGroup - 1, data=d[1:9,])
> coef(fitChar)
cGroupA cGroupB cGroupC 
    2.0     5.5     8.5 
> coef(fitFactor)
fGroupA fGroupB fGroupC 
    2.0     5.5     8.5
> # so far things are ok, but ...
> predict(fitChar, newdata=d[10,])
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
  variable 'cGroup' converted to a factor
> predict(fitFactor, newdata=d[10,])
 10 
8.5
> predict(fitChar, newdata=d[c(1,10),])
Error in predict.lm(fitChar, newdata = d[c(1, 10), ]) : 
  subscript out of bounds
In addition: Warning message:
In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
  variable 'cGroup' converted to a factor
> predict(fitFactor, newdata=d[c(1,10),])
  1  10 
2.0 8.5


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Rui Barradas
> Sent: Tuesday, October 23, 2012 11:16 AM
> To: asafwe
> Cc: r-help at r-project.org
> Subject: Re: [R] Data type in a data frame
> 
> Hello,
> 
> When read into a data.frame, R defaults to reading character strings as
> factors. If you don't want that, use option stringsAsFactors = FALSE.
> Using your dataset,
> 
> 
> dat1 <- read.table(text = "
> Observation   Gender  Dosage  Alertness
> 1             m       a               8
> 2             m       a              12
> 3             m       a              13
> 4             m       a              12
> 5             m       b               6
> 6             m       b               7
> ", header = TRUE)
> str(dat2)
> 
> dat2 <- read.table(text = "
> Observation   Gender  Dosage  Alertness
> 1             m       a               8
> 2             m       a              12
> 3             m       a              13
> 4             m       a              12
> 5             m       b               6
> 6             m       b               7
> ", header = TRUE, stringsAsFactors = FALSE)
> str(dat2)
> 
> 
> This is decided based on the setting of (which you can change)
> 
> options("stringsAsFactors")
> 
> Hope this helps,
> 
> Rui Barradas
> Em 23-10-2012 15:43, asafwe escreveu:
> > Hi all,
> >
> > How does R know to regard a variable as a factor and not a character?
> > For example, consider the following table:
> >
> > Observation                Gender                Dosage
> > Alertness
> > 1                               m                        a
> > 8
> > 2                               m                        a
> > 12
> > 3                               m                        a
> > 13
> > 4                               m                        a
> > 12
> > 5                               m                        b
> > 6
> > 6                               m                        b
> > 7
> >
> > When read into a dataframe, will "m", "a", "b" be regarded as a factor or as
> > a character? How does R decide?
> >
> > Thanks a lot in advance,
> >
> > Asaf
> >
> >
> >
> > --
> > View this message in context: http://r.789695.n4.nabble.com/Data-type-in-a-data-
> frame-tp4647161.html
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list