[R] Converting factors back to numbers. Trouble with SPSS import data
Thomas Lumley
tlumley at u.washington.edu
Mon Feb 20 02:16:20 CET 2006
On Sun, 19 Feb 2006, Paul Johnson wrote:
> I'm using Fedora Core 4, R-2.2.
>
> The basic question is: can one recover the numerical values used in
> SPSS after importing data into R with read.spss from the foreign
> library? Here's why I ask.
>
> My colleague sent an SPSS data set. I must replicate some results she
> calculated in SPSS and one problem is that the numbers used in SPSS
> for variable values are not easily recovered in R.
>
> I'm comparing 2 imported datasets, "eldat" (read.spss with No
> convert-to-factors) and
> "eldatfac" (read.spss with convert-to-factors)
>
> If I bring in the data without conversion to factors:
>
> library(foreign)
> eldat <- read.spss("18CitySCBSsorted.sav", use.value.labels=F,
> to.data.frame=T)
>
> I can see the variable HAPPY is coded 0, 1, 2, 3. Those are the
> numbers that SPSS
> uses as contrast values when it runs a regression with HAPPY.
So, bring in the data without conversion to factors.
Factors in R are not just labels for arbitrary numeric variables. They are a special type of variable for categorical data that happen to be implemented with the numbers 1,2,3,...
If that isn't what you want, don't use factors. read.spss will still return all the labels as attributes of the returned data frame.
> In contrast, allow R to translate the variables with a few value
> labels into factors.
>
> library(foreign)
> eldatfac <- read.spss("18CitySCBSsorted.sav",
> max.value.labels=7,to.data.frame=T)
>
> Consider the first 50 observations on the variable HAPPY
>
>> f<- eldatfac$HAPPY[1:50]
>> f
> [1] Happy Happy Very happy Happy Very happy
> [6] Very happy Happy Very happy Happy Very happy
> [11] Happy Happy Not very happy Very happy Very happy
> [16] Happy Happy Very happy Happy Happy
> [21] Not very happy Happy Happy Very happy Happy
> [26] Happy Happy Happy Happy Happy
> [31] Happy Happy Happy Happy Happy
> [36] Happy Very happy Very happy Happy Very happy
> [41] Very happy Very happy Happy Very happy Very happy
> [46] Happy Happy Happy Very happy Very happy
> 6 Levels: Not happy at all Not very happy Happy Very happy ... Refused
>
>> levels(f)
> [1] "Not happy at all" "Not very happy" "Happy" "Very happy"
> [5] "Don't know" "Refused"
>
>
> I need the numerical values back in order to have a regression like
> SPSS. Isn't this what ?factor says one ought to do? Why are these all
> missing?
>
>> as.numeric(levels(f))[f]
> [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
> [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NANA
No, this is not what ?factor says you should do. This is what you do if your levels are numbers (in character form) and you want those numbers. "Happy" is not a number.
>> as.numeric(f)
> [1] 3 3 4 3 4 4 3 4 3 4 3 3 2 4 4 3 3 4 3 3 2 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 4 4
> [39] 3 4 4 4 3 4 4 3 3 3 4 4
>
> Comparing against the "as.numeric" output from the unconverted factor,
> I can see the levels are just one digit different.
Yes, because SPSS used the codes 0,1,2,3 and R uses 1,2,3,4. You could just subtract 1 if you want the numbers to be smaller by 1.
>> g <- eldat$HAPPY[1:50]
>> as.numeric(g)
> [1] 2 2 3 2 3 3 2 3 2 3 2 2 1 3 3 2 2 3 2 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3
> [39] 2 3 3 3 2 3 3 2 2 2 3 3
>
> I'm more worried about the kinds of variables that are coded
> irregularly 1, 3, 7, 11 in the SPSS scheme.
>
If you want to keep the numeric values, don't change them to factors. That's why there is an option.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list