[R] Re: Thanks Frank, setting graph parameters,and why social scientists don't use R

Tue Aug 17 18:20:23 CEST 2004

A few comments:

First, your remarks are interesting and, I would say, mainly well founded. However, I think they are in many respects irrelevant, although they do point to the much bigger underlying issue, which Roger Peng also hinted at in his reply.

I think they are sensible because R IS difficult; the documentation is often challenging, which is not surprising given (a) the inherent complexity of R; (b) the difficulty in writing good documentation, especially when many of the functions being documented are inherently technical, so subject matter knowledge (CS, statistics, numerical analysis ,...) must be assumed; (c) the documentation has been written by a variety of mostly statistical types as a sidelight of their main professional activities -- none of these writers are ** professional documenters ** (whatever that may mean)
and some of them even speak ENglish as a second or third language. My own take is that the documentation for Core R and many of the packages is remarkably well done given these realities, and my hat is off to those who have produced it. Nevertheless, I agree, it is challenging -- it MUST be.

But they are irrelevant because the fundamental issue **is** that there is an inherent tension between ease of use and power/flexibility. Writing good GUI's for anything is hard, very hard. For a project such as R, it doesn't make sense, although it may to write GUI's for small subsets of R targeted at specific audiences (as in BioConductor, RCommander, etc.). But even this is hard to do well and takes a lot of time and effort. So, IMHO, there never will be nor ever should/could be an overall GUI for R: it is too complex and needs to be too extensible and flexible to constrain it in
that way.

However, I believe the larger question that both you and Roger Peng hint at is more important: not "How does a social scientist learn to use R," but how does any scientist/technologist for whom experimental design and data analysis forms a large component of their work gain the necessary technical background in statistics and related disciplines (linear algebra, numerical analysis, ...) to ** know how to use the statistical tools they need that R provides.**  Software like SPSS must assume a limited collection of methods to present to their customers in an effective GUI. Their strategy
**must** be (this is NOT a criticism) to "dumb it down" so that they can provide coherent albeit limited data analysis strategies. As you have explicitly stated, users who wish to venture outside those narrow paradigms are simply out of luck. R was designed from the outset not to be so constrained, but the cost is that you must know a good deal to use it effectively. It is obvious from the questions posted to this list that even something as "simple" as lm() often demands from users technical statistical understanding far beyond what they have. So we see fairly frequently indications
of misunderstanding and confusion in using R. But the problem isn't R -- it's that users don't know enough statistics.

I wish I could say I had an answer for this, but I don't have a clue. I do not thing it's fair to expect a mechnical engineer or psychologist or biologist to have the numerous math and statistical courses and experience in their training that would provide the base they need. For one thing, they don't have the time in their studies for this; for another, they may not have the background or interest -- they are, after all, mechanical engineers or biologists, not statisticians. Unfortunately, they could do their jobs as engineers and scientists a lot better if they did know more
statistics.  To me, it's a fundamental conundrum, and no one is to blame. It's just the reality, but it is the source for all kinds of frustrations on both sides of the statistical divide, which both you and Roger expressed in your own ways.

Obviously, all of this is just personal ranting, so I would love to hear alternative views. An thanks again for your clear and interesting comments.

Cheers,
Bert

david_foreman at doctors.org.uk wrote:

> First, many thanks to Frank Harrell for once again helping me out.  This actually relates to the next point, which is my contribution to the 'why don't social scientists use R' discussion.  I am a hybrid social scientist(child psychiatrist) who trained on SPSS.  Many of my difficulties in coming to terms with R have been to do with trying to apply the logic underlying SPSS, with dire results.  You do not want to know how long I spent looking for a 'recode' command in R, to change factor names and classes.....
>
> I think the solution is to combine a graphical interface that encourages command line use (such as Rcommander) with the analyse(this) paradigm suggested, but also explaining how one can a) display the code on a separate window ('page' is only an obvious command once you know it), and b) how one can then save one's modification, make it generally available, and not overwrite the unmodified version (again, thanks, Frank).  Finally, one would need to change the emphasis in basic statistical teaching from 'the right test' to 'the right model'.  That should get people used to R's logic.
>
> If a rabbit starts to use R, s/he is likely to head for the help files associated with each function, which can assume that the reader can make sense of gnomic utterances like "Omit 'var' to impute all variables, creating new variables in 'search' position 'where'".  I still don't know what that one means (as I don't understand search positions, or why they're important).  This can be very offputting, and could lead the rabbit to return to familiar SPSS territory.
>
> Finally, friendlier error messages would also help. It took me 3 days, and opening every function I could, to work out that '...cannot find function xxx.data.frame...' meant that MICE was unable to make a polychotomous logistic imputation model converge for the variable immediately preceding it.
>
> I am now off to the help files and FAQs to find out how to change graph parameters, as the plot.mids function in MICE a) doesn't allow one to select a subset of variables, and b) tells me that the graph it wants to produce on the whole of my 26 variable dataset is too big to fit on the (windows) plotting device.  Unless anyone wants to tell me how/where? (which of course is why, in the end, R is EASIER to use than SPSS)

--

Bert Gunter

Non-Clinical Biostatistics
Genentech
MS: 240B
Phone: 650-467-7374

"The business of the statistician is to catalyze the scientific learning process."

 -- George E.P. Box