[R] Help with R

Gabor Grothendieck ggrothendieck at gmail.com
Thu May 5 16:36:27 CEST 2005

On 5/5/05, Ted Harding <Ted.Harding at nessie.mcc.ac.uk> wrote:
> On 05-May-05 Peter Dalgaard wrote:
> > [...]
> > Both systems are victims of the curse of the rectangular data set to
> > some extent. Prototypically, you record the sex of a rat along with
> > every single measurement on it, as if the rat could change sex at
> > millisecond resolution. This probably applies to all current
> > statistical systems, but there is some hope that R's more flexible
> > data structures can be leveraged to better handle multilevel data.
> > (Cue Probabilistic Relational Models a.m. Getoor et al., which Peter
> > Green brought up at the recent gR meeting.)
> I would agree with this hope. Indeed I was reminded of the issue
> by Alessandro Carletti's recent query about extracting features
> from the data at different marine sampling stations.
> My involvement goes back to the days (around 1980) when, with
> Jan Boëtius, I was examining Johannes Schmidt's data on eel larvae
> obtained during his Atlantic cruises to investigate the "spawning
> question" of the European eel (funded by the Carlsberg Foundation,
> Peter!).
> Each Cruise consisted of a series of Stations by a given Ship
> at different Geographic positions, at each of which a number of Hauls
> would be made in different Years and different Months on different
> Days at different Times of day, using different Equipments and at
> different Depths or ranges of Depth, and of different Durations,
> and at different Speeds, resulting in capture of none or several
> specimens each of which would be examined for length, numbers of
> myomeres (muscle segments), and other features, along with hydrographic
> measurements.
> This could have been embodied in a huge "rectangular table" with of
> course much repetition of all the information that remains constant
> for each specimen in a haul. The specimen-specific data consisted of
> only 2-4 items, while the "constant" data consisted of 12-15
> items. There were nearly 20,000 larvae, so the "rectangular table"
> could have occupied well over a Megabyte.
> The alternative is a "list" representation, like:
> Investigation = list(Cruises)
> Cruise = list(Ship,list(Stations))
> Station = list((Position,list(Hauls))
> Haul = list((Year,Month,Day,Time,Duration,(Equipment data),(Depths),
>            Speed,list(Specimens))
> Specimen=list(Length,Myomeres,...)
> In the end, the "list-like" view was the one adopted (I was limited
> to CP/M BASIC in some 48K of free RAM, with 256KB floppies, in those
> days), though not fully formally programmed (some of the "list
> parsing" was done by hand, i.e. replacing one floppy with another),
> though the BASIC program did retain the previously read data
> for a given Station when reading in new Haul data, and the Haul
> data when reading in Specimen data.
> Later, when I began to study C, I realised that the language
> was well adapted to implementing such structures in a program,
> though by then following this up would have been motivated by
> curiosity rather than needing to get the job done (it already
> was done).
> Now, in R, I see that in principle such data representations
> are well integrated into the language, and I've been yet again
> tempted to look at the question!
> However, while representing the raw data in such a form is
> well supported by R, it seems to me that extracting data
> in a way adapted to different analyses requires users to
> create their own methods, using the list-access primitives .
> For example, to study the changes in the distribution of
> lengths of specimens in relation to Position and Date
> (which was one of the important issues in that investigation),
> I don't think there are any "list processing" functions
> available in R which, given the list-based structure described
> above, would allow a simple query of the form
>  means( Length , ~ Position:Date , data=Cruise )
> It's quite feasible to write one's own; but I think Peter's
> hope (expressed in excerpt above) looks like a first call
> for thinking about general methods for this sort of thing.

The Green Book defines a recursive apply function, rapply,
that provides a general means of traversing that
sort of structure.

More information about the R-help mailing list