[R] problem with svyby and NAs (survey package)
Thomas Lumley
tlumley at uw.edu
Sat Apr 14 23:28:23 CEST 2012
On Sat, Apr 14, 2012 at 5:44 AM, <A.F.Fenton at lse.ac.uk> wrote:
> Hello
>
> I'm trying to get the proportion "true" for dichotomous variable for
> various subgroups in a survey.
>
> This works fine, but obviously doesn't give proportions directly:
> svytable(~SurvYear+problem.vandal, seh.dsn, round=TRUE)
> problem.vandal
> SurvYear FALSE TRUE
> 1995 8906 786
> 1997 17164 2494
> 1998 17890 1921
> 1999 18322 1669
> 2001 17623 2122
> ...
>
> Note some years are missing - they are part of the dataset, but all
> responses are NA (the question wasn't asked).
>
> However, this gives an error, and I'd like to understand why - it works
> for variables without missing years:
>
> svyby(~problem.vandal, ~SurvYear, seh.dsn, svymean, na.rm=TRUE)
> Error in tapply(1:NROW(x), list(factor(strata)), function(index) { :
> arguments must have same length
>
> The error only occurs when na.rm=TRUE and there are no observations in
> one year.
The error occurs because you are asking for the mean of a vector of
all NAs. svyby() just calls svymean() on each subset of the data.
In your reproducible example,
svymean(~problem, subset(foo.dsn, year==2004), na.rm=TRUE)
will give the same error, and a work-around is to use subset(foo.dsn,
year!=2004) in the call to svyby()
Now, svymean() is entitled to be a bit upset: you asked for the mean
of the all the non-missing values, but you didn't give it any
non-missing values. What should it do? It obviously can't return a
sensible proportion, because it got given no data.
It could just return NaN as the answer, as mean() does, but that
wouldn't help you here since svyby() is expecting a vector of two
proportions and a covariance matrix for them.
Obviously it would be possible to rewrite svymean() to handle empty
data, and I'll do that, but that doesn't solve the general problem of
what happens when svyby() asks for something impossible. It would
also be possible for svyby() to trap errors and treat them as empty
results, but that would have the disadvantage of making debugging a
lot harder.
-thomas
--
Thomas Lumley
Professor of Biostatistics
University of Auckland
More information about the R-help
mailing list