[Rd] read.spss issues
Jeroen Ooms
jeroen.ooms at stat.ucla.edu
Wed Feb 15 07:05:29 CET 2012
Someone supplied me with a small SPSS datafile that caused a buffer
overflow and then a crash when reading it in R. It seems like a pretty
serious issue to me. Unfortunately I can't supply the dataset at hand
and I have a hard time reproducing it with a toy example. But I found
at least 2 issues that might be related.
The first one is that when the spss dataset has a 'string' variable
that is longer than 200 characters, it generates a bunch of warnings
and then additional variables in the dataset. E.g:
library(foreign)
x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/longstring.sav");
str(x);
The second problem is that the spss dataformat allows to specify
'duplicate labels', whereas this is not allowed for factors. read.spss
does not deal with this and creates a bad factor
x <- read.spss("http://www.stat.ucla.edu/~jeroen/spss/duplicate_labels.sav",
use.value.labels=T);
levels(x$opinion);
which causes issues downstream. I am not sure if this is an issue in
read.spss() or as.factor(), but I guess it might be wise to try to
detect duplicate levels and assign them all with one and the same
integer value when converting to a factor.
Thank you,
Jeroen
More information about the R-devel
mailing list