SUMMARY: [R] Comparison of SAS & R/Splus

Fri Sep 5 01:06:02 CEST 2003

My thanks to Drs. Armstrong, Bates, Harrell, Liaw, Lumley, 
Prager, Schwartz, and Mr. Wang for their replies.  I have
pasted my original message and their replies below.

After viewing http://www.itl.nist.gov/div898/strd/ as suggested
by Dr. Schwartz, it occurred to me that it might be educational
to search for some data repositories on google. I was able to find 
some,though I'm sure many of the R listserv readers are already 
aware of them:

http://kdd.ics.uci.edu/
http://www.ics.uci.edu/~mlearn/MLOther.html
http://www.ldeo.columbia.edu/datarep/
http://data.geocomm.com/
http://libraries.mit.edu/gis/data/repository.html
http://nssdc.gsfc.nasa.gov/

  -david paul

-----Original Message-----
I am one of only 5 or 6 people in my organization making the 
effort to include R/Splus as an analysis tool in everyday 
work - the rest of my colleagues use SAS exclusively.

Today, one of them made the assertion that he believes the 
numerical algorithms in SAS are superior to those in Splus 
and R -- ie, optimization routines are faster in SAS, the 
SAS Institute has teams of excellent numerical analysts that 
ensure its superiority to anything freely available, PROC 
NLMIXED is more flexible than nlme( ) in the sense that it 
allows a much wider array of error structures than can be used 
in R/Splus, &etc.  

I obviously do not subscribe to these views and would like 
to refute them, but I am not a numerical analyst and am still 
a novice at R/Splus.  Do there exist refereed papers comparing the 
numerical capabilities of these platforms?  If not, are there 
other resources I might look up and pass along to my colleagues?
---------------------------

This link might give you some insight, but SAS is not one of the 
packages benchmarked here.

http://www.sciviews.org/other/benchmark.htm

   [Whit Armstrong]

---------------------------

I don't have papers comparing the numerical capabilities but I say 
bunk to your colleagues.  The last time I looked, SAS still relies 
on the out of date Gauss-Jordan sweep operator in many key places, 
in place of the QR decomposition that R and S-Plus use in regression.  
And SAS being closed source makes it impossible to see how it 
really does calculations in some cases.

See http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf Section 
1.6 for a comparison of S and SAS (though this doesn't address 
numerical reliability).  Overall, SAS is about 11 years behind R 
and S-Plus in statistical capabilities (last year it was about 10 
years behind) in my estimation.

Frank Harrell
SAS User, 1969-1991
---
Frank E Harrell Jr    Professor and Chair            School of Medicine
                      Department of Biostatistics    Vanderbilt University

---------------------------

Too bad your colleagues weren't at the "State of Statistical 
Software" session at JSM.  I was there.  It was so packed that 
people ran out of standing room.  The three speakers are all R 
advocates (Jan De Leeuw, Luke Tierney and Duncan Temple Lang).  
The most interesting thing (to me) about the session is that the 
discussant is a person from SAS (first name Wolfgang).  I just 
had to hear what he'd say.

The SAS person essentially said that the numerical accuracy of 
R (probability functions, especially) is unmatched because the 
routines were written by authority figures in the area.  (That's 
one advantage he said R has, but also said that the fact that 
that code is open, even SAS is looking at the R source, and that, 
to him, is a disadvantage.  He obviously missed the point of 
open source.)  One of the criticisms he had for R, compared to 
SAS, is that R may not have undergone extensive QA tests.  He 
said that SAS now probably has only a handful of PROC developers 
(not exactly the "team" your colleague imagined), but 5-6 times 
more software testers.

I think hearing from the horse's mouth beats reading articles 
in the journal for this sort of things.  There was a recent article 
in American statistician bashing the numerical instability and bad 
quality of RNG in JMP (a SAS product).  SAS posted a "white paper" 
on their web site refuting some those claims (but they did changed 
the RNG to Mersenne Twister in JMP5), comparing JMP with Excel and 
SAS.  I must say that comparison isn't convincing, as neither Excel 
nor SAS can really be trusted as gold standard.

Andy [Liaw]

---------------------------

In follow up to Frank's reply, allow me to point you to some 
additional papers and articles on numerical accuracy issues. I 
have not reviewed these in some time and they may be a bit dated 
relative to current versions. These do not cover R specifically, 
but do address S-Plus and SAS. This is not an exhaustive list by 
any means, but many of the papers do have other references that 
may be of value.

1. http://www.stat.uni-muenchen.de/~knuesel/elv/accuracy.html

2. http://www.amstat.org/publications/tas/mccull-1.pdf

3. http://www.amstat.org/publications/tas/mccull.pdf

4. http://www.npl.co.uk/ssfm/download/documents/cmsc06_00.pdf

Another option is that NIST has reference datasets available for 
comparison at:

http://www.itl.nist.gov/div898/strd/

These would allow you to conduct your own comparisons if you desire.

HTH,

Marc Schwartz
(Also a former SAS user)

---------------------------

I can't say for the optimisation routines, but I have found this...

When I was doing my MSc thesis, using tree-based models and neural 
networks for classifications, I discovered something interesting.

Using SAS Enterprise Miner (SAS EM), its Tree Node is far more efficient 
than the rpart package.  Using the same (or very similar at least) parameter

settings, SAS EM can produce a tree in about 1 minute while it would take 
rpart 5 ~ 6 minutes (same data, same machine....).  Having said that, I 
still prefer rpart as it can draw a beautiful tree, whereas it is very 
difficult to fit the graphical tree produced by SAS EM into one A4 page -- 
in the end I had to use the text tree.

However, the Neural Network node in SAS EM is less efficient than nnet.  
The time it takes to fit a neural network in R using nnet is much 
faster....

Cheers,

  Kevin [Wang]

---------------------------

I suspect it will be difficult to find the answer to your colleagues' 
assertions without doing your own studies.  How important is it to you 
to settle this disagreement?

One could always name the many leading statisticians who contribute to 
R, but I don't think that name-dropping settles anything.

Nonetheless, even if SAS were faster, that would be only part of the 
issue.  As you know, R offers vastly better exploratory graphics, better 
graphics overall, far more flexible programming, user extensibility, and 
more natural programming access to the results of previous 
computations.  So even if your colleagues were right in their 
assertions, they would be overlooking many capabilities of the S 
language that are not readily available in SAS.

IMO, SAS shines in its ability to read files in almost any format, to 
handle gigantic data sets without burping, and to produce formatted 
cross-tabulations and other highly structured text reports.  However, if 
your colleagues work at all in data exploration, they are ignoring 
important tools by not exploring R or S-Plus.

Michael Prager, Ph.D.
NOAA Center for Coastal Fisheries and Habitat Research Beaufort, 
North Carolina  28516 http://shrimp.ccfhrb.noaa.gov/~mprager/
DISCLAIMER: Opinions expressed are personal, not official. No 
government endorsement of any commercial product is made or implied.

---------------------------

Although they are out of date, there are some comparisons of accuracy in

 McCullough, B. D. (1998), "Assessing the reliability of statistical
 software: Part I", The American Statistician, 52, 358-366.

 McCullough, B. D. (1999), "Assessing the reliability of statistical
 software: Part II", The American Statistician, 53, 149-159.

Regarding PROC NLMIXED versus nlme, there are a lot of differences 
between them.  I don't think that PROC NLMIXED will handle nested 
random effects while nlme does.  However, nlme assumes the underlying 
noise is Gaussian while PROC NLMIXED allows Gaussian or binomial or 
Poisson.  PROC NLMIXED uses adaptive Gaussian quadrature to evaluate the 
marginal log-likelihood whereas nlme uses a less accurate evaluation but 
better parameterizations of the variance of the random effects.  I think 
it would be difficult to declare one to be superior to the other.

   [Douglas Bates]

---------------------------

While I don't subscribe to the general theory, they have a point about 
PROC NLMIXED.  It does more accurate calculations for generalised linear 
mixed models than are currently available in R/S-PLUS, and for logistic 
random effects models the difference can sometimes be large enought to 
matter.

	-Thomas [Lumley]