[R] Newbie questions

Tue Oct 31 07:58:44 CET 2000

On Mon, 30 Oct 2000, Zsombor Cseres-Gergely wrote:

To chip in a few points not already answered:

> I am new to R, but a fairly `old' user of Stata. I read posts asking about
> survey methods and large datasets in the archive, so I will not ask those
> questions again. But some still remain:

> - If not, is it a design goal of the developers to do speed/memory
>   optimization (apart from dynamic memory allocation, which, as I understand
>   orthogonal to this problem)

I think dynamic memory allocation is very pertinent. At present you need to
allocate to R at start up the maximum memory needed, and if large that can
hit performance badly on some systems.  Under the system under test for
1.2.0 you only get large memory usage if you need it, and (hopefully
when the tuning is finished) not when you don't.

> - Since sometimes I need to use modestly really large datasets (60000*300
>   matrix), I wonder if I can do that in R at all? More adequately: is R
>   scalable without limits by brute force (adding more CPU/RAM)?

`Really large' is relative.  That's a 144Mb dataset and it should run
happily in 512Mb or so (at least on Linux).  We are starting to get
datasets 10x that.  As I understand it Stata is on Windows, and there are
seem to be some problems with scaling on Windows (that was not designed
with very large processes in mind).

> - I noted, that R can use SQL datasources. Since it is really the case that
>   one have to use both huge amount of records _and_ variables, an SQL+R
>   combination might be one for me. Is it right? How fast would this be?

That's certainly what we are looking at, as well as auxiliary awk scripts
(I would have used Perl, but the student knows awk) to extract things
from the dataset before reaing into R.

> - Browsing the package lists, I have not seen a library for hypothesis
>   testing. Everybody builds it from primitives or serious people do not do
>   this at all?

There is package ctest shipping with R, but also there is quite a lot in
the last point: we do find we hardly ever use it.  With large problems,
the multiple-testing problems get to be quite serious.  In a recent
paper, we are adjusting for 50,000 simultaneous tests.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._