[R] Studdy Missing Data, differentiate between a percent with in the valid answers and with in the different missing answers
Ericka Lundström
e at it.dk
Mon Mar 3 08:21:32 CET 2008
Hi R experts
I'm trying to emigrate from SPSS to R, thou I have some problems whit
getting R to distinguish between the different kind of missing.
I want to distinguish between data that are missing because a
respondent refused to answer and data that are missing because the
question didn't apply to that respondent. In other words I wante to
create data values where I control what are valid and what are
missing observations så I can study both the valid and the missing
observations.
SPSS dos this in a quite smooth way, look something like this in SPSS:
Get paid appropriately, considering efforts and achievements
N Valid 947
Missing 558
Valid Cumulative
Frequency Percent Percent Percent
Valid Agree strongly 98 6,5 10,3 10,3
Agree 408 27,1 43,1 53,4
Neither agree
nor disagree 126 8,4 13,3 66,7
Disagree 259 17,2 27,3 94,1
Disagree strongly 56 3,7 5,9 100,0
Total 947 62,9 100,0
Missing
Not applicable 534 35,5
Don't know 1 ,1
No answer 23 1,5
Total 558 37,1
Total 1505 100,0
(If the table get messy and you can’t read it in your email program
there is a nice formatted SPSS table here https://stat.ethz.ch/
pipermail/r-help/1998-October/002942.html whare K. Mueller ask a
almost similar question in 1998!)
SPSS is metacategorizing or recognizing if my variables are Missing
or Valid. This means that, besides differentiating between missing
and valid, the categories within missing are treated separately.
# At the moment I'm only able to get this information from R:
> describe(ess3dk$PDAPRP)
ess3dk$PDAPRP : Get paid appropriately, considering efforts and
achievements
n missing unique
1505 0 8
Agree strongly (98, 7%),
Agree (408, 27%)
Neither agree nor disagree (126, 8%),
Disagree (259, 17%)
Disagree strongly (56, 4%),
Not applicable (534, 35%)
Don't know (1, 0%),
No answer (23, 2%)
# Then I can recode 'Not applicable', 'Don't know' and 'No answer' as
missing:
> ess3dk[ess3dk$PDAPRP=="Not applicable" | ess3dk$PDAPRP=="Don't
know" | ess3dk$PDAPRP=="No answer","PDAPRP"] <- NA
# But that just pile 'Not applicable', 'Don't know' and 'No answer'
together in ‘missing’:
> describe(ess3dk$PDAPRP)
ess3dk$PDAPRP : Get paid appropriately, considering efforts and
achievements
n missing unique
947 558 5
Agree strongly (98, 10%),
Agree (408, 43%)
Neither agree nor disagree (126, 13%),
Disagree (259, 27%)
Disagree strongly (56, 6%)
Is there a smart way in R to differentiate between missing and valid
and at the same time treat both the categories within missing and
valid as answers (like SPSS did above)?
I'm using a SPSS data set (.sav/.por) from The European Social Survey
(the ESS) http://ess.nsd.uib.no/index.jsp?
module=download&year=2007&country=&download=%5CDirect+Data+download%
5C2007%5C01%23ESS3+-+integrated+file%2C+edition+2.0%5C.%
5CESS3e02.spss.zip which I import via the spss.get like this:
> ess3dk<- spss.get("filename.sav", lowernames=FALSE, datevars =
NULL, use.value.labels = TRUE, to.data.frame = TRUE, max.value.labels
= Inf, force.single=TRUE, allow=NULL, charfactor=FALSE)
I have read the help file in spss.get and read.spss to see it this
subject was mentioned and I have looked around this malinglist. I
have found one question that is almost similar, here https://
stat.ethz.ch/pipermail/r-help/1998-October/002942.html (from October
1998!) but there is no one answer anywhere.
Here are some self contained reproducible code:
dataFrame <- data.frame(ONE = c(2, 1, 3, 2, NA, 4, 2), TWO = c("yes",
"?", "No", "X", "No", "?", "X"), AGE = c(42, 18, 49, 62,NA, 19, 82))
# I create a simpel dataframe
describe(dataFrame$TWO) # then I have a look at the “TWO”-column.
Here I can see every answer.
dataFrame[dataFrame$TWO== "?" | dataFrame$TWO== "X", "TWO" ] <- NA #
Now i classify the answers "X" and "?" as missing, bacause I want to
know the valid percent (yes and no) but I don’t want to delete the
"X" and the “?” answers.
describe(dataFrame$TWO) # then I have a another look at the “TWO”-
column. Now I can't see how many answered "X" and how many answered "?"
# my question is if it's possible in R to work whit a metacategory of
valid and not valid answers, as described above. In other words I
want to, as possible in SPSS, distinguish between a percent with in
the valid answers and a percent over all.
I normally use this method to quickly get an overview of missing and
valid answers and the internal percentile distribution within the
missing and valid answers, so I would like to find some smart
solution to this problem. I would really appreciate a answer or some
help to get my in the right direction.
Thanks in advance
Regards
Ericka Lujndström
More information about the R-help
mailing list