[R] A comment about R:

Wed Jan 4 20:43:08 CET 2006

I'm someone who from time to time comes to R to do applied stats for social
science research.  I think the R language is excellent--much better than
Stata for writing complex statistical programs.  I am thrilled that I can do
complex stats readily in R--sem, maximum likelihood, bootstrapping, some
Bayesian analysis.  I wish I could make R my main statistical package, but
find that a few stats that are important to my work are difficult to find or
produce in R.  Before I list some examples, I recognize that people view R
not as a statistical package but rather as a statistical programming
environment.  That said, however, it seems, from my admittedly limited
perspective, that it would be fairly easy to make a few adjustments to R
that would make it a lot more practical and friendly for a broader range of
people--including people like me who from time to time want to do
statistical programming but more often need to run canned procedures.  I'm
not a statistician, so I don't want to have to learn everything there is to
know about common procedures I use, including how to write them from
scratch.  I want to be able to focus my efforts on more novel problems w/o
reinventing the wheel.  I would also prefer not to have to work through a
couple books on R or S+ to learn how to meet common needs in R.  If R were
extended a bit in the direction of helping people like me, I wonder whether
it would not acquire a much broader audience.  Then again, these may just be
the rantings of someone not sufficiently familiar w/ R or the community of
stat package users--so take my comments w/ a grain of salt.

Some examples of statistics I typically use that are difficult to find and /
or produce or produce in a usefully formatted way in R--

Ex. 1)  Wald tests of linear hypotheses after max. likelihood or even after
a regression.  "Wald" does not even appear in my standard R package on a
search.  There's no comment in the lm help or optim help about what function
to use for hypothesis tests.  I know that statisticians prefer likelihood
ratio tests, but Wald tests are still useful and indeed crucial for
first-pass analysis.  After searching with Google for some time, I found
several Wald functions in various contributed R packages I did not have
installed.  One confusion was which one would be relevant to my needs.  This
took some time to resolve.  I concluded, perhaps on insufficient evidence,
that package car's Wald test would be most helpful.  To use it, however, one
has to put together a matrix for the hypotheses, which can be arduous for a
many-term regression or a complex hypothesis.  In comparison, in Stata one
simply states the hypothesis in symbolic terms.  I also don't know for
certain that this function in car will work or work properly w/ various
kinds of output, say from lm or from optim.  To be sure, I'd need to run
time-consuming tests comparing it with Stata output or examine the
function's code.  In Stata the test is easy to find, and there's no
uncertainty about where it can be run or its accuracy.  Simply having a
comment or "see also" in lm help or mle or optim help pointing the user to
the right Wald function would be of enormous help.

Ex. 2) Getting neat output of a regression with Huberized variance matrix.
I frequently have to run regressions w/ robust variances.  In Stata, one
simply adds the word "robust" to the end of the command or
"cluster(cluster.variable)" for a cluster-robust error.  In R, there are two
functions, robcov and hccm.  I had to run tests to figure out what the
relationship is between them and between them and Stata (robcov w/o cluster
gives hccm's hc0; hccm's hc1 is equivalent to Stata's 'robust' w/o cluster;
etc.).  A single sentence in hccm's help saying something to the effect that
statisticians prefer hc3 for most types of data might save me from having to
scramble through the statistical literature to try to figure out which of
these I should be using.  A few sentences on what the differences are
between these methods would be even better.  Then, there's the problem of
output.  Given that hc1 or hc3 are preferred for non-clustered data, I'd
need to be able to get regression output of the form summary(lm) out of
hccm, for any practical use.  Getting this, however, would require
programming my own function.  Huberized t-stats for regressions are
commonplace needs, an R oriented a little toward more everyday needs would
not require programming of such needs.  Also, I'm not sure yet how well any
of the existing functions handle missing data.

Ex. 3)  I need to do bootstrapping w/ clustered data, again a common
statistical need.  I wasted a good deal of time reading the help contents of
boot and Bootstrap, only to conclude that I'd need to write my own, probably
inefficient, function to bootstrap clustered data if I were to use boot.
It's odd that boot can't handle this more directly.  After more digging, I
learned that bootcov in package Design would handle the cluster bootstrap
and save the parameters.  I wouldn't have found this if I had not needed
bootcov for another purpose.  Again, maybe a few words in the boot help
saying that 'for clustered data, you could use bootcov or program a function
in boot' would be very helpful.  I still don't know whether I can feed the
results of bootcov back into functions in the boot package for further
analysis.

My 2 bits for what they're worth,

Peter