[R] question about categorical variables in R

Jim Lemon drjimlemon at gmail.com
Sat Sep 12 07:12:36 CEST 2015


Hi Lida,
Given that this is such a common question and the R FAQ doesn't really
answer it, perhaps a brief explanation will help. In R the factor class is
a sort of combination of the literal representation of the data and a
sequence of numbers beginning at 1 that are alphabetically ordered by
default. For example, suppose you read in what you think are a set of
numbers like this:

x<-read.table(text="1 2 3
+ 4 5 6
+ 7 . 9")
x
 V1 V2 V3
1  1  2  3
2  4  5  6
3  7  .  9

Now look at the classes of the columns:

sapply(x,class)
       V1        V2        V3
"integer"  "factor" "integer"

Somehow that second column has become a factor. This is because "." cannot
be represented as a number and I didn't tell R that it should be regarded
as a missing value (na.strings="."). R has taken the literal values in that
column

levels(x$V2)
[1] "." "2" "5"

and attached numbers to those values their alphabetic order.

as.numeric(x$V2)
[1] 2 3 1

You can get the original numbers back like this:

as.numeric(as.character(x$V2))
[1]  2  5 NA
Warning message:
NAs introduced by coercion

and R helpfully tells you that it couldn't coerce "." to a number.

In your example, the factor is created for you

mf<-factor(c("male","female"))
> mf
[1] male   female
Levels: female male

but as you can see, the default order of the factor may not be what you
think

as.numeric(mf)
[1] 2 1

For a more complete account of factors, see "An Introduction to R" section
4 "Ordered and unordered factors".

Jim

On Sat, Sep 12, 2015 at 12:45 AM, Lida Zeighami <lid.zigh at gmail.com> wrote:

> Hi dear experts,
> I have a general question in R, about the categorical variable such as
> Gender(Male or Female)
> If I have this column in my data and wanted to do regression model or feed
> the data to  seqmeta packages (singlesnp, skat meta) , would you please let
> me know should I code them first ( male=0 and female=1) or R programming do
> it for me?
> Because when I didn't code them, R still can do the analysis without any
> error but I'm not sure it's correct or not?
> Thanks
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list