[R] spss imports--trouble with to.data.frame
Paul Johnson
pauljohn32 at gmail.com
Fri Nov 13 23:29:21 CET 2009
My students are working with several SPSS dataset provided by the
European Social Survey. If you register your name, you can download it
too. This is the 2004 data, for example:
http://ess.nsd.uib.no/ess/round2/
I cannot give you the European Survey dataset, but you can download it
for free if you like, and then you could run these commands to
re-produce this weird pattern described below.
library(foreign)
d2 <- read.spss("ESS3e03_2.por")
warnings()
str(d2$HAPPY)
d2 <- as.data.frame(d2)
str(d2$HAPPY)
d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
warnings()
str(d2$HAPPY)
Here's my info for this example:
> sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-pc-linux-gnu
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreign_0.8-38
The weirdness that follows is the difference between
d2 <- read.spss( ... , to.data.frame=T)
and
d2 <- read.spss ()
d2 <- as.data.frame(d2)
The former causes all data to become <NA> but the latter seems mostly OK.
> library(foreign)
> d2 <- read.spss("ESS3e03_2.por")
warnings()
There were 12 warnings (use warnings() to see them)
> Warning messages:
1: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
2: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
3: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know", ... :
duplicated levels will not be allowed in factors anymore
4: In `levels<-`(`*tmp*`, value = c("No second language mentioned", ... :
duplicated levels will not be allowed in factors anymore
5: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl", ... :
duplicated levels will not be allowed in factors anymore
6: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
folkskola/grundskola\"", ... :
duplicated levels will not be allowed in factors anymore
7: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
8: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
9: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt", ... :
duplicated levels will not be allowed in factors anymore
10: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti", ... :
duplicated levels will not be allowed in factors anymore
11: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias", ... :
duplicated levels will not be allowed in factors anymore
12: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige", ... :
duplicated levels will not be allowed in factors anymore
> str(d2$HAPPY)
Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...
> d2 <- as.data.frame(d2)
> str(d2$HAPPY)
Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ...
That appears valid. On my first effort, I had tried to get the data
frame in a single shot with read.spss
> d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
There were 15 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
longer object length is not a multiple of shorter object length
2: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
longer object length is not a multiple of shorter object length
3: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
longer object length is not a multiple of shorter object length
4: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
5: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... :
duplicated levels will not be allowed in factors anymore
6: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know", ... :
duplicated levels will not be allowed in factors anymore
7: In `levels<-`(`*tmp*`, value = c("No second language mentioned", ... :
duplicated levels will not be allowed in factors anymore
8: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl", ... :
duplicated levels will not be allowed in factors anymore
9: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad
folkskola/grundskola\"", ... :
duplicated levels will not be allowed in factors anymore
10: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
11: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators,
senior officials and managers", ... :
duplicated levels will not be allowed in factors anymore
12: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt", ... :
duplicated levels will not be allowed in factors anymore
13: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti", ... :
duplicated levels will not be allowed in factors anymore
14: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias", ... :
duplicated levels will not be allowed in factors anymore
15: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige", ... :
duplicated levels will not be allowed in factors anymore
> str(d2$HAPPY)
Factor w/ 13 levels "Extremely unhappy",..: NA NA NA NA NA NA NA NA NA NA ...
Oh, heck, all the values are missing!! Somehow, putting
"to.data.frame" inside the read.spss causes a different outcome than
using as.data.frame after reading in the data.
The symptoms of this in R-2.9 are a little different, but the
conclusion the same. Help?
In case you are a student who wants to work with this data, I can
share to you the large script that I have been accumulating so that
you might "play along". It turns out to be surprisingly difficult to
"recode" these factor variables that have levels like "none", "1",
"2",..."9", "total".
## Paul Johnson
## November 13, 2009
## A question arose in the lab. A student asks "I want
## to compare the answers from two different editions
## of the European Social Survey.
## I will add this to Stuff Worth Knowing later, but
## I can share this tutorial to you right now.
## From this website:
## http://ess.nsd.uib.no/ess
## Download those European Social Survey Datasets into a directory.
## In a terminal, use the unzip command:
## unzip ESS3e03_2.spss.zip
## unzip ESS2e03_1.spss.zip
## Then run the following in R.
library(foreign)
d2 <- read.spss("ESS3e03_2.por",to.data.frame=T)
d2 <- read.spss("ESS3e03_2.por")
warnings()
### You can try to go into a data frame in one
### step, that's an option in read.spss. But
### we saw warnings, and wanted to be careful.
d2 <- as.data.frame(d2)
d2$whichSurvey <- 2
d3 <- read.spss("ESS2e03_1.por")
d3 <- as.data.frame(d3)
d3$whichSurvey <- 3
namesd2 <- names(d2)
namesd3 <- names(d3)
commonNames <- intersect( namesd3, namesd2)
combod23 <- rbind(d2[ , commonNames], d3[, commonNames])
save(combod23, file="combod23.Rda")
## Error
##Warning messages:
##1: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
## invalid factor level, NAs generated
##2: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
## invalid factor level, NAs generated
##3: In `[<-.factor`(`*tmp*`, ri, value = c(1, 1, 1, 1, 1, 1, 1, 1, 1, :
## invalid factor level, NAs generated
## That worries me a little bit. The warnings did too.
## Inspect a few lines in the result.
combod23[1:4, ]
## fix doesn't work for me, did not bother to investigate.
##> fix(combod23)
##Error in edit.data.frame(get(subx, envir = parent), title = subx, ...) :
## can only handle vector and factor elements
## That means some data from hell came into this thing.
## I suspect that combod23 is OK.
## The memory use on this exercise is huge! Try to help it
rm (d2)
rm (d3)
## But I worry. I have 2 ways that I use to try to figure this
## out. One is to open the dataset in a clone of SPSS called
## "PSPP". Actually, the executable is "psppire".
##
## The other thing I do is open the same data again in
## a numeric format, and compare the 2 combined data frames
## This is also a useful exercise because it helps you
## understand what a "factor" is in R.
dn2 <- read.spss("ESS3e03_2.por", use.value.labels = F)
dn2 <- as.data.frame(dn2)
dn2$whichSurvey <- 2
dn3 <- read.spss("ESS2e03_1.por", use.value.labels = F)
dn3 <- as.data.frame(dn3)
dn3$whichSurvey <- 3
## Might be smart to compare
# dn2$HAPPY[1:50]
# d2$HAPPY[1:50]
namesdn2 <- names(dn2)
namesdn3 <- names(dn3)
commonNNames <- intersect( namesdn3, namesdn2 )
combodn23 <- rbind(dn2[ , commonNNames], dn3[, commonNNames])
save(combodn23, file="combodn23.Rda")
table( combod23$HAPPY, combodn23$HAPPY)
## In summary, whenever I want to use a variable from
## the combined data frame, I would probably compare
## against combodn23 just to feel safe.
## Note, after when you come back to work on this project again, you
## might as well just reload the saved copies of combod23 and
## combodn23.
## load("combod23.Rda")
## load("combodn23.Rda")
## That will put you at the current spot, no need to redo the merge
## Now, about "recoding". If you just want numerical
## data, you might consider using combodn23.
## But if you want some factors and some numberical
## variables, then you might need to recode to reclaim
## values.
## HAPPY turns out to be an interesting example of a
## PAIN IN THE ASS because in SPSS, it is scored from
## 0 to 10, but they give value labels only for scores
## 1= Extremely unhappy
## and
## 10= Extremely happy
##
## And the SPSS column has no labels for values 1-9.
## If SPSS gave NO labels at all, then this would come
## into R as a numeric variable. BUT, because there are
## 2 levels named, then R makes a factor out of it.
## When R turns it into a factor, you
## end up with a nutty looking factor, which has
## levels you don't really appreciate.
levels(combod23$HAPPY)
# [1] "Extremely unhappy" "1" "2"
# [4] "3" "4" "5"
# [7] "6" "7" "8"
#[10] "9" "Extremely happy" "Refusal"
#[13] "Don't know" "No answer"
## Create a new variable to play with
combod23$HAPPY2 <- combod23$HAPPY
## Change Extremely Unhappy to text "0"
levels(combod23$HAPPY)[1] <- "0"
## Change Extremely Happy to "10"
levels(combod23$HAPPY)[11] <- "10"
HELL <- levels(combod23$HAPPY)
### Look at HELL
HELL
combod23$HAPPY2[combod23$HAPPY %in% HELL[12:14] ] <- NA
##CHECK RESULT
table(combod23$HAPPY, combod23$HAPPY2)
## Eliminate the unused levels from HAPPY2
combod23$HAPPY2 <- factor(combod23$HAPPY2)
### Same is found with
## combo23$HAPPY2 <- combo23$HAPPY2[ , drop=T]
## Use the "factor trick" to
## reset the variable back to numeric:
combod23$HAPPYN <- as.numeric(HELL)[combod23$HAPPYN]
##CHECK RESULT
table(combod23$HAPPY, combod23$HAPPY2)
## CHECK by comparing against numeric data from spss
table(combodn23$HAPPY, combod23$HAPPYN)
## Next, a student asks "how can I make that same recode
## on a lot of variables?" I'm going to have to leave
## that one unanswered. I think the answer will probably
## be to get a list of variables, then use "lapply" to
## do the same thing to each variable in turn. But
## I have not written up a simple, understandable example
## yet
## After the data is all recoded and homogenized, then we
## could run any analysis we want, and throw in the variable
## "whichSurvey" to see if there is a difference beteween the
## two models.
## Example, choose your y and x1 and x2, then
## mod <- lm(y~ (x1+x2)*whichSurvey, data=combod23)
## or if you think the difference is just in the intercept:
## mod <- lm(y~ x1+x2 + whichSurvey, data=combod23)
--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas
More information about the R-help
mailing list