[R] I have a dream of creating a program on statistical analyses.

Bill Venables venables at acland.qld.cmis.csiro.au
Sat Jul 1 04:08:12 CEST 2000

Nathapong Samlamjiag writes:

> My name is Nathapong Samlamjiag.  I am a student from Bangkok,
> Thailand. I am studying Political Science, of which a few
> courses concern statistics and research methodology. In
> Thailand, it is believed that SPSS is the most popular program
> used to conduct statistical analyses.  Although I admit that
> SPSS is an excellent program capable of performing ming
> numerous techniques, I personally believe that SPSS is not
> researcher-friendly.  

In a sense you are right: many statistical packages attempt to be
user-friendly by making it dead easy for people to conduct
standard stock analyses, like regression analyses, at the expense
of making it tedious, difficult or even impossible to adapt the
analysis to suit the realities of the situation under study.

> For example, before researchers conduct a regression analysis,
> they must first check certain assumptions of regression.  Some
> assumptions are very likely to be violated but can usually be
> corrected by data transformations.  Thus, a regression analysis
> should be carried out in particular steps: First, relevant
> assumptions are tested; then, data transformations are
> conducted; and finally the regression analysis is carried.

Actually, no, it's much more complicated than that.  The
important distributional properties are of the residuals after
the regression, not of the marginal distribution of the response
variable.  This is what makes checking the assumptions very
tricky: you cannot know beforehand whether an apparent
non-normality of the response variable, for example, is due to
ignored covariates (at that stage) or to a real violation of
assumptions.  It's a chicken-and-egg situation requiring
sometimes very subtle judgments by the analyst using as much
ancillary information about the situation as possible.

Transformation of the response (or predictors) is not a universal
panacea, either.  It might be more appropriate to shift to a
different kind of model, such as a generalized linear model, to
handle some kinds of distributional properties, particularly if
ultimately you need an analysis in the original scale.

> Because it is usually difficult and inconvenient to test and
> correct assumptions in SPSS, many (Thai) researchers tend to
> ignore assumptions of a statistical technique.  

In that they are certainly not alone!

> Ignoring assumptions that may be untrue lead to research
> conclusions that may be unsound.  As far as I know, at present
> there are no statistical programs that are easy to not only
> conduct statistical techniques but also check (and, if
> necessary, correct) the techniques' assumptions.  With this in
> mind, I have a dream of creating a statistical program capable
> of helping political-science researchers conduct statistical
> analysis that will yield valid conclusions.

Much as regret having to put a curb on such enthusiasm, I have to
suggest you think very carefully about this before sinking too
much energy into it.  This amounts to a statistical expert
system, something that has been tried several times before,
always with disappointing results.  The consensus seems to be
that the statistical contribution to a piece of research is a
genuine contribution requiring just as much judgment and
creativity as any other part of the work and not something that
can be automated (at least not yet).  What you are suggesting
sounds dangerously like just a different kind of SPSS where
instead of the preferences built into that package the researcher
gets your prejudices and preferences, which may be a little more
elaborate but are in the end just as inflexible.

The closest we have come to providing an optimal support system
for the data analyst seems in fact to be software environments
like R which provide coherent suites of tools that can be used as
they are as components or easily extended.  The researcher has a
number of choices: do a proper course in applied statistics (not
just a quickie on methods) and become familiar enough with the
real data analysis issues to handle it with the help of an
environment like R (or even SPSS for that matter), collaborate
with a statistician, employ a consultant or just wing it and hope
for the best when it comes to referees.

> Therefore, I have searched the Internet to find information
> about computer programming. I have found that Visual Studio 6
> can be used to easily create a desktop application and that R
> language is a programming language for statistical computation
> and graphics.  

As I point out above, it is that but much more as well.  It is a
complete software environment for data analysis and graphics that
offers about as much support for the analyst as can reasonably be
offered without inhibiting the creativity or detracting from the
responsibility of the analysis process.

> Accordingly, I have begun dreaming of using Visual Studio 6 and
> R language to develop the desired program for statistical
> analyses.  However, I feel doubtful about some issues.  Thus, I
> would like to ask you a few questions and request your
> invaluable suggestions:

> 1. I tend to develop my statistical stand-alone program by
> first using Visual Basic to create a user-friendly graphical
> user interface and Visual C++ to create a database abase
> component and then using R language to conduct statistical
> computation.  Thus, I would like to know whether R-language
> functions can be called by an application developed by the
> BASIC and C++ languages?

This is a matter of some interest quite apart from your project.
It is (tangentially) aligned with the work of the Omegahat
project.  You might like to look at http://www.omegahat.org/

> 2. I know that R language is available as Free Software under
> the terms of the Free Software Foundation's GNU General Public
> License in source code form. I have read GNU General Public
> License Version 2, June 1991, but I still do not understand it
> completely. Thus, I would like to ask you whether my
> statistical program which calls R-language functions must also
> be a freeware. Must my statistical program be distributed
> freely?  Can I sell my program? More specifically, I am not
> sure about what I can do and what I cannot do without
> infringing on the copyrights.  Would you please clarify this
> concern of mine?  Thank you very much for your  R language. I
> will look forward to your suggestions.  

Sorry, I'm no lawyer either...  One thing that is clear, though,
is if you do issue your code under the GPL as well and make it
freely available, you will not violate the conditions.

Bill Venables,      Statistician,     CMIS Environmetrics Project
CSIRO Marine Labs, PO Box 120, Cleveland, Qld,  AUSTRALIA.   4163
Tel: +61 7 3826 7251           Email: Bill.Venables at cmis.csiro.au    
Fax: +61 7 3826 7304      http://www.cmis.csiro.au/bill.venables/

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list