[R] Reasons to Use R

Thu Apr 12 01:04:52 CEST 2007

On 4/11/07, Robert Duval <rduval at gmail.com> wrote:
> So I guess my question is...
>
> Is there any hope of R being modified on its core in order to handle
> more graciously large datasets? (You've mentioned SAS and SPSS, I'd
> add Stata to the list).
>
> Or should we (the users of large datasets) expect to keep on working
> with the present tools for the time to come?

We're certainly aware of the desire of many users to be able to handle
large data sets.  I have just spent a couple of days working with a
student from another department who wanted to work with a very large
data set that was poorly structured.  Most of my time was spent trying
to convince her about the limitations in the structure of her data and
what could realistically be expected to be computed with it.

If your purpose is to perform data manipulation and extraction on
large data sets then I think that it is not unreasonable to be
expected to learn to use SQL. I find it convenient to use R to do data
manipulation because I know the language and the support tools well
but I don't expect to do data cleaning on millions of records with it.
 I am probably too conservative in what I will ask R to handle for me
because I started using S on a Vax-11/750 that had 2 megabytes of
memory and it's hard to break old habits.

I think the trend in working with large data sets in R will be toward
a hybrid approach of using a database for data storage and retrieval
plus R for the model definition and computation.  Miguel Manese's
SQLiteDF package and some of the work in Bioconductor are steps in
this direction.

However, as was mentioned earlier in this thread, there is an
underlying assumption with R that the user is thinking about the
analysis as he/she is doing it. We sometimes see questions about "I
have a data set with (some large number) of records on several hundred
or thousands of variables" and I want to fit a generalized linear
model to it.

I would be hard pressed to think of a situation where I wanted
hundreds of variables in a statistical model unless they are generated
from one or more factors that have many levels.  And, in that case, I
would want to use random effects rather than fixed effects in a model.
 So just saying that the big challenge is to fit some kind of model
with lots of coefficients to a very large number of observations may
be missing the point.  Defining the model better may be the point.

Let me conclude by saying that these are general observations and not
directed to you personally, Robert.  I don't know what you want R to
do graciously to large data sets so my response is more to the general
point that there should always be a balance between thinking about the
structure of the data and the model and brute force computation.  One
can do data analysis by using the computer as a blunt instrument with
which to bludgeon the problem to death but one can't do elegant data
analysis like that.

>
> robert
>
> On 4/11/07, Marc Schwartz <marc_schwartz at comcast.net> wrote:
> > On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote:
> > > On Wed, 2007-04-11 at 17:56 +0200, Bi-Info
> > > (http://members.home.nl/bi-info) wrote:
> > > > I certainly have that idea too. SPSS functions in a way the same,
> > > > although it specialises in PC applications. Memory addition to a PC is
> > > > not a very expensive thing these days. On my first AT some extra memory
> > > > cost 300 dollars or more. These days you get extra memory with a package
> > > > of marshmellows or chocolate bars if you need it.
> > > > All computations on a computer are discrete steps in a way, but I've
> > > > heard that SAS computations are split up in strictly divided steps. That
> > > > also makes procedures "attachable" I've been told, and interchangable.
> > > > Different procedures can use the same code which alternatively is
> > > > cheaper in memory usages or disk usage (the old days...). That makes SAS
> > > > by the way a complicated machine to build because procedures who are
> > > > split up into numerous fragments which make complicated bookkeeping. If
> > > > you do it that way, I've been told, you can do a lot of computations
> > > > with very little memory. One guy actually computed quite complicated
> > > > models with "only 32MB or less", which wasn't very much for "his type of
> > > > calculations". Which means that SAS is efficient in memory handling I
> > > > think. It's not very efficient in dollar handling... I estimate.
> > > >
> > > > Wilfred
> > >
> > > <snip>
> > >
> > > Oh....SAS is quite efficient in dollar handling, at least when it comes
> > > to the annual commercial licenses...along the same lines as the
> > > purported efficiency of the U.S. income tax system:
> > >
> > >   "How much money do you have?  Send it in..."
> > >
> > > There is a reason why SAS is the largest privately held software company
> > > in the world and it is not due to the academic licensing structure,
> > > which constitutes only about 12% of their revenue, based upon their
> > > public figures.
> >
> > Hmmm......here is a classic example of the problems of reading pie
> > charts.
> >
> > The figure I quoted above, which is from reading the 2005 SAS Annual
> > Report on their web site (such as it is for a private company) comes
> > from a 3D exploded pie chart (ick...).
> >
> > The pie chart uses 3 shades of grey and 5 shades of blue to
> > differentiate 8 market segments and their percentages of total worldwide
> > revenue.
> >
> > I mis-read the 'shade of grey' allocated to Education as being 12%
> > (actually 11.7%).
> >
> > A re-read of the chart, zooming in close on the pie in a PDF reader,
> > appears to actually show that Education is but 1.8% of their annual
> > worldwide revenue.
> >
> > Government based installations, which are presumably the other notable
> > market segment in which substantially discounted licenses are provided,
> > is 14.6%.
> >
> > The report is available here for anyone else curious:
> >
> >   http://www.sas.com/corporate/report05/annualreport05.pdf
> >
> > Somebody needs to send SAS a copy of Tufte or Cleveland.
> >
> > I have to go and rest my eyes now...  ;-)
> >
> > Regards,
> >
> > Marc
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>