[R] Data type in a data frame
William Dunlap
wdunlap at tibco.com
Tue Oct 23 20:55:01 CEST 2012
> When read into a data.frame, R defaults to reading character strings as
> factors. If you don't want that, use option stringsAsFactors = FALSE.
This is somewhat tangential, but if you plan on using
predict(fit,newdata=nd)
after fitting a model like
fit <- lm(y~x, data=d)
be sure you have converted character columns in nd and d into factors.
Otherwise you are likely to get errors from predict(). You will get
a warning when fitting the model if you use character columns, but
the results are ok until you use predict() on the result.
E.g.,
> d <- data.frame(y=1:10, cGroup=rep(c("A","B","C"),c(3,4,3)), fGroup=factor(rep(c("A","B","C"),c(3,4,3))), stringsAsFactors=FALSE)
> fitChar <- lm(y ~ cGroup - 1, data=d[1:9,])
Warning message:
In model.matrix.default(mt, mf, contrasts) :
variable 'cGroup' converted to a factor
> fitFactor <- lm(y ~ fGroup - 1, data=d[1:9,])
> coef(fitChar)
cGroupA cGroupB cGroupC
2.0 5.5 8.5
> coef(fitFactor)
fGroupA fGroupB fGroupC
2.0 5.5 8.5
> # so far things are ok, but ...
> predict(fitChar, newdata=d[10,])
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
variable 'cGroup' converted to a factor
> predict(fitFactor, newdata=d[10,])
10
8.5
> predict(fitChar, newdata=d[c(1,10),])
Error in predict.lm(fitChar, newdata = d[c(1, 10), ]) :
subscript out of bounds
In addition: Warning message:
In model.matrix.default(Terms, m, contrasts.arg = object$contrasts) :
variable 'cGroup' converted to a factor
> predict(fitFactor, newdata=d[c(1,10),])
1 10
2.0 8.5
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Rui Barradas
> Sent: Tuesday, October 23, 2012 11:16 AM
> To: asafwe
> Cc: r-help at r-project.org
> Subject: Re: [R] Data type in a data frame
>
> Hello,
>
> When read into a data.frame, R defaults to reading character strings as
> factors. If you don't want that, use option stringsAsFactors = FALSE.
> Using your dataset,
>
>
> dat1 <- read.table(text = "
> Observation Gender Dosage Alertness
> 1 m a 8
> 2 m a 12
> 3 m a 13
> 4 m a 12
> 5 m b 6
> 6 m b 7
> ", header = TRUE)
> str(dat2)
>
> dat2 <- read.table(text = "
> Observation Gender Dosage Alertness
> 1 m a 8
> 2 m a 12
> 3 m a 13
> 4 m a 12
> 5 m b 6
> 6 m b 7
> ", header = TRUE, stringsAsFactors = FALSE)
> str(dat2)
>
>
> This is decided based on the setting of (which you can change)
>
> options("stringsAsFactors")
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 15:43, asafwe escreveu:
> > Hi all,
> >
> > How does R know to regard a variable as a factor and not a character?
> > For example, consider the following table:
> >
> > Observation Gender Dosage
> > Alertness
> > 1 m a
> > 8
> > 2 m a
> > 12
> > 3 m a
> > 13
> > 4 m a
> > 12
> > 5 m b
> > 6
> > 6 m b
> > 7
> >
> > When read into a dataframe, will "m", "a", "b" be regarded as a factor or as
> > a character? How does R decide?
> >
> > Thanks a lot in advance,
> >
> > Asaf
> >
> >
> >
> > --
> > View this message in context: http://r.789695.n4.nabble.com/Data-type-in-a-data-
> frame-tp4647161.html
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list