[R] data manipulation function descriptions
Luke Tierney
luke at stat.uiowa.edu
Fri Feb 14 15:39:02 CET 2003
On Fri, 14 Feb 2003 ripley at stats.ox.ac.uk wrote:
> On Thu, 13 Feb 2003, kjetil brinchmann halvorsen wrote:
>
> > On 13 Feb 2003 at 17:09, Jason Bond wrote:
>
> > > case switch
> > [R-core : switch should be better
> > announced. It is for
> > instance not
> > mentioned in "An
> > introduction to R"]
>
> Well, that is an *introduction*, not a programmer's guide. You will find
> switch() is rarely used in R: it is a bit peculiar in its semantics, and
> something definitely not to be considered introductory.
>
> On the original question, I think it would be a mistake to translate what
> you know. R is a vector language, not a pairlist language, and I see
> quite a bit of evidence of convoluted solutions in its internals dating
> from when R was the second. Chapter 2 of Venables & Ripley (2002) (as in
> the R FAQ) is devoted to using S/R for data manipulation.
As someone reasonably familiar with both languages I have to disagree
with several points here. First and foremost, despite differences in
surface syntax, as languages xlispstat and R are much more alike than
they are different. xlispstat is much closer to R than S-plus because
both xlispstat and R use lexical scope, a feature of R that is still
not used as much as it could be. The main language differences are
the limited form of lazy evaluation used in R, which you can usully
ignore, and the fact that R does not provide mutable data structures,
which is also rarely an issue. There are other differences, but these
are the main ones that affect coding practices I think.
The basic xlispstat data handling functions mentioned in the original
post are quite similar to corresponding basic functions in R. This is
not by accident: the choice of functions included in xlispstat was
heavily influenced by what was then called the "New S" language. As a
result, if you want to create an R version of an xlispstat function
you can often do far worse than start with a fairly direct
transliteration. In my view at least, good coding practices in
xlispstat are good coding practices for any high level mostly
functional language and carry over quite well to R.
I am sorry if the following seems a bit harsh, but I, and many others
who have worked with lisp, find it extremely frustrating to read
statements about lisp like the one above that suggest that lisp is a
pairlist language only, especially when these statements come from
people I thought knew better. Lisp dates back to the 1950's. The
only other language of any consequence still in use from that era is
FORTRAN. No one would now claim that a major flaw in FORTAN is the
lack of an if-then-else construct. That was true in the early days
but has not been for several decades. But for some reason many people
seem very happy to very authoritatively make statements about lisp
that, if they were ever true at all, have not been so for a very long
time indeed. Pairlists are a very useful data structure for
expressing many algorithms in a functional style. That is why they
were one of the first data structures in Lisp, and that is why they
are available in virtually all other high level functional languages
(ML, Haskell, Miranda, Clean, ...). Pailrists are NOT the only data
structure in Lisp. For many years Lisp has also supported vectors and
arrays, both generic and typed (and other data structures). Vectors
and pairlists are collectively referred to as sequences, and, if I
remember correctly, all the functions listed in the original post
except mapcar are designed to work on all kinds of sequences (the
sequence version of mapcar is map). Code written in xlispstat in
terms of sequence functions can often be translated quite easily to R,
and the resulting code will be quite consistent with good R coding
practices.
R does not provide a pairlist data structure. This creates a dilemma
when translating some list-based xlispstat code, or, more importantly,
when implementing an algorithm for which parilists are the natural
data structure to use. There are two choices: use a vector based
algorithm that may be a bit less natural but fits better with the
basic R data structures, or build your own pairlist abstraction for
this particular problem and write the algorithm the more natural way.
I have used both approaches on different occasions. I usually prefer
to write an algorithm in the most natural way for the algorithm, since
that usually maximizes the probability that my code is actually
correct. If this approach requires some additional abstract data
types, be they pairlists or anything else, then I develop and test
them separately and write the main code in terms of these
abstractions. Occasianally, but not all that often, this results in
code that is slower than I like; then I may profile and optimize the
critical bits by using more efficient data structures if that turns
out to be the issue.
I really don't think it is reasonable to say R was ever
pairlist-based. At one point in time generic vectors (the things
returned by list(...)) were pairlists as opposed to vectors which
they are now, but numeric data have always been true vectors.
Pairlists were and still are used internally for many things. In some
cases other data structures would have advantages, and I suspect we
will slowly move in that direction. But for some things the pairlist
really is a good fit. I also have seen some convoluted code in the R
internals--I'm sure I have written some of it. I can't speak for
others, but in my case I am reluctant to blame pairlists or any other
aspect of the R internals for convolutions in my code. {One of the
things anyone releasing open source code has to come to terms with is
that you have to release all your code, but the bits of which you may
feel justifialy proud as well as ...]
luke
--
Luke Tierney
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke at stat.uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
More information about the R-help
mailing list