[R] Help with R

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Thu May 5 15:17:00 CEST 2005


On 05-May-05 Peter Dalgaard wrote:
> [...]
> Both systems are victims of the curse of the rectangular data set to
> some extent. Prototypically, you record the sex of a rat along with
> every single measurement on it, as if the rat could change sex at
> millisecond resolution. This probably applies to all current
> statistical systems, but there is some hope that R's more flexible
> data structures can be leveraged to better handle multilevel data.
> (Cue Probabilistic Relational Models a.m. Getoor et al., which Peter
> Green brought up at the recent gR meeting.)

I would agree with this hope. Indeed I was reminded of the issue
by Alessandro Carletti's recent query about extracting features
from the data at different marine sampling stations.

My involvement goes back to the days (around 1980) when, with
Jan Boëtius, I was examining Johannes Schmidt's data on eel larvae
obtained during his Atlantic cruises to investigate the "spawning
question" of the European eel (funded by the Carlsberg Foundation,
Peter!).

Each Cruise consisted of a series of Stations by a given Ship
at different Geographic positions, at each of which a number of Hauls
would be made in different Years and different Months on different
Days at different Times of day, using different Equipments and at
different Depths or ranges of Depth, and of different Durations,
and at different Speeds, resulting in capture of none or several
specimens each of which would be examined for length, numbers of
myomeres (muscle segments), and other features, along with hydrographic
measurements.

This could have been embodied in a huge "rectangular table" with of
course much repetition of all the information that remains constant
for each specimen in a haul. The specimen-specific data consisted of
only 2-4 items, while the "constant" data consisted of 12-15
items. There were nearly 20,000 larvae, so the "rectangular table"
could have occupied well over a Megabyte.

The alternative is a "list" representation, like:

Investigation = list(Cruises)
Cruise = list(Ship,list(Stations))
Station = list((Position,list(Hauls))
Haul = list((Year,Month,Day,Time,Duration,(Equipment data),(Depths),
            Speed,list(Specimens))
Specimen=list(Length,Myomeres,...)

In the end, the "list-like" view was the one adopted (I was limited
to CP/M BASIC in some 48K of free RAM, with 256KB floppies, in those
days), though not fully formally programmed (some of the "list
parsing" was done by hand, i.e. replacing one floppy with another),
though the BASIC program did retain the previously read data
for a given Station when reading in new Haul data, and the Haul
data when reading in Specimen data.

Later, when I began to study C, I realised that the language
was well adapted to implementing such structures in a program,
though by then following this up would have been motivated by
curiosity rather than needing to get the job done (it already
was done).

Now, in R, I see that in principle such data representations
are well integrated into the language, and I've been yet again
tempted to look at the question!

However, while representing the raw data in such a form is
well supported by R, it seems to me that extracting data
in a way adapted to different analyses requires users to
create their own methods, using the list-access primitives .

For example, to study the changes in the distribution of
lengths of specimens in relation to Position and Date
(which was one of the important issues in that investigation),
I don't think there are any "list processing" functions
available in R which, given the list-based structure described
above, would allow a simple query of the form

  means( Length , ~ Position:Date , data=Cruise )

It's quite feasible to write one's own; but I think Peter's
hope (expressed in excerpt above) looks like a first call
for thinking about general methods for this sort of thing.

Best wishes to all,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 05-May-05                                       Time: 13:28:29
------------------------------ XFMail ------------------------------




More information about the R-help mailing list