[R] SPSS data import: problems & work arounds for GSS surveys

Paul Johnson pauljohn32 at gmail.com
Tue Mar 3 04:57:33 CET 2009


I'm using R 2.8.1 on Ubuntu 8.10.  I'm writing partly to ask what's
wrong, partly to tell other users who search that there is a work
around.

The General Social Survey is a long standing series of surveys
provided by NORC (National Opinion Research Center).  I have
downloaded some years of the survey data in SPSS format (here's the
site: http://www.norc.org/GSS+Website/Download/SPSS+Format/).  When I
try to import using foreign, I get an error like so:

> library(foreign)
> dat <- read.spss("gss2006.sav", to.data.frame=T, trim.factor.names=T)
Error in inherits(x, "factor") : object "cp" not found
In addition: Warning messages:
1: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File contains duplicate label for value 99.9 for variable TVRELIG
2: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File contains duplicate label for value 99.9 for variable SEI
3: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File contains duplicate label for value 99.9 for
variable FIRSTSEI
4: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File contains duplicate label for value 99.9 for variable PASEI
5: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File contains duplicate label for value 99.9 for variable MASEI
6: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File contains duplicate label for value 99.9 for variable SPSEI
7: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File contains duplicate label for value 0.75 for
variable YEARSJOB
8: In read.spss("gss2006.sav", to.data.frame = T, trim.factor.names = T) :
  gss2006.sav: File-indicated character representation code (1252)
looks like a Windows codepage

No dat object is created from this.


I have found a work around.  I installed PSPP version 0.6.0 and used
it to open the sav file, and then re-save it in SPSS sav  format.
That creates an SPSS file that foreign's function can open.

I still see the warnings about redundant value labels, but as far as I
can see these are harmless.  A working object is obtained like so:

> dat <- read.spss("gss-pspp.sav")
Warning messages:
1: In read.spss("gss-pspp.sav") :
  gss-pspp.sav: File contains duplicate label for value 99.9 for
variable TVRELIG
2: In read.spss("gss-pspp.sav") :
  gss-pspp.sav: File contains duplicate label for value 0.75 for
variable YEARSJOB
3: In read.spss("gss-pspp.sav") :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable SEI
4: In read.spss("gss-pspp.sav") :
  gss-pspp.sav: File contains duplicate label for value 99.9 for
variable FIRSTSEI
5: In read.spss("gss-pspp.sav") :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable PASEI
6: In read.spss("gss-pspp.sav") :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable MASEI
7: In read.spss("gss-pspp.sav") :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable SPSEI


There is still some trouble with the importation of this SPSS file,
however.  It has the symptoms of being a non-rectangular data array, I
think.  What do you think about these warnings:

> dat <- read.spss("gss-pspp.sav",to.data.frame=T)
There were 22 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In read.spss("gss-pspp.sav", to.data.frame = T) :
  gss-pspp.sav: File contains duplicate label for value 99.9 for
variable TVRELIG
2: In read.spss("gss-pspp.sav", to.data.frame = T) :
  gss-pspp.sav: File contains duplicate label for value 0.75 for
variable YEARSJOB
3: In read.spss("gss-pspp.sav", to.data.frame = T) :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable SEI
4: In read.spss("gss-pspp.sav", to.data.frame = T) :
  gss-pspp.sav: File contains duplicate label for value 99.9 for
variable FIRSTSEI
5: In read.spss("gss-pspp.sav", to.data.frame = T) :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable PASEI
6: In read.spss("gss-pspp.sav", to.data.frame = T) :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable MASEI
7: In read.spss("gss-pspp.sav", to.data.frame = T) :
  gss-pspp.sav: File contains duplicate label for value 99.9 for variable SPSEI
8: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
9: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
10: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
11: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
12: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
13: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
14: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
15: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
16: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
17: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
18: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
19: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
20: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
21: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length
22: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] :
  longer object length is not a multiple of shorter object length


While puzzling over this, I have tested the SPSS functions in the
package memisc. This has some truly handy features!  Read ?importer
and you'll see it can generate a list of variables as well as a
codebook. It can also handle an SPSS portable file.
Importer works a little bit like SPSS, actually, because the metadata
is accessed, but the data is not really loaded until later (as far as
I can tell, one must run either subset or as.data.set to force the
actual data read). One can generate the description and codebook
without accessing the data.

> idat <- spss.system.file("gss2006.sav")
> show(idat)

SPSS system file 'gss2006.sav'
	with 5137 variables and 4510 observations

A subset function can access the particular variables from the data.


> idat2 <- subset(idat,  select=c(gunlaw))
> idat2

Data set with 4510 observations and 1 variables

   gunlaw
1  OPPOSE
2    *NAP
3    *NAP
4   FAVOR
5   FAVOR
6    *NAP
7   FAVOR
8    *NAP
9   FAVOR
10  FAVOR
11  FAVOR
12  FAVOR
13  FAVOR
14   *NAP
15   *NAP
16   *NAP
17  FAVOR
18   *NAP
19  FAVOR
20   *NAP
21   *NAP
22 OPPOSE
23   *NAP
24   *NAP
25   *NAP
.. ......
(25 of 4510 observations shown)

and the function "as.data.set" will force a full read of all the data columns:


> idat3 <- as.data.set(idat)
>

> table(idat3$gunlaw, idat2$gunlaw)

       0    1    2    8    9
  0 2507    0    0    0    0
  1    0 1568    0    0    0
  2    0    0  395    0    0
  8    0    0    0   35    0
  9    0    0    0    0    5


So, in conclusion, I've found troubles with read.spss in foreign, but
have been able to work around that by accessing data with PSPP or the
functions from the memisc package.   The only advantage of using the
PSPS program (its GUI is psppire) is that you can see the data in a
rectangular spreadsheet that is more-or-less searchable.  It has that
same hard-to-use interface pioneered at SPSS (it hides variable names
and displays descriptions in choosers). But the rectangular display in
PSPP is nice.

pj

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas




More information about the R-help mailing list