[R] dummy variable encoding
    Richard.Cotton at hsl.gov.uk 
    Richard.Cotton at hsl.gov.uk
       
    Fri Mar  6 11:05:27 CET 2009
    
    
  
> > The best encoding depends upon which language you would like to 
manipulate 
> > the variable in.  In R, genders are most naturally represented as 
factors. 
> >  That means that in an external data source (like a spreadsheet of 
data), 
> > you should ideally have the gender recorded as human-understandable 
text 
> > ("male" and "female", or "M" and "F").  Once the data is read into R, 
by 
> > default R will convert the string to factors (keeping the human 
readable 
> > labels).  This way you avoid having to remember that 1 means male (or 
> > whatever).
> >
> > If you were manipulating the data in a different language that didn't 
have 
> > factors, then it might be more appropriate to use an integer.  Which 
> > integers you use doesn't matter, you need to have a look-up table to 
know 
> > what each number refers to, whatever you choose.
> >
> Yes, that's what I thought. However somebody told me that it is better
> to use 1/2 rather than 0/1 for a 2 level factor such as gender, and I've
> no idea why. I told them it didn't matter, but have since seen quite a
> few examples where they use 1/2 (admittedly in SPSS).
The only benefit that I can see of using 1/2 instead of 0/1 is fairly 
minor.
If you have cases where there are missing values, and you are working in a 
language that doesn't support NA values for integers (or factors; I'm 
thinking of something like C), then you could encode your genders as
0: not recorded
1: female
2: male
Then you can include logic like
if(gender)
{ 
   do something
}
The alternative encoding of 0/1, would be something like
-1: not recorded
0: female
1: male
This makes the code slightly less pretty.
if(gender != -1)
{ 
   do something
}
Again, none of this really applies to R, since you should be using factors 
for this sort of variable.
Regards,
Richie.
Mathematical Sciences Unit
HSL
------------------------------------------------------------------------
ATTENTION:
This message contains privileged and confidential inform...{{dropped:20}}
    
    
More information about the R-help
mailing list