[R] Hierarchical data sets: which software to use?

Douglas Bates bates at stat.wisc.edu
Fri Feb 5 17:22:28 CET 2010


On Sun, Jan 31, 2010 at 10:24 PM, Anton du Toit <atdutoitrhelp at gmail.com> wrote:
> Dear R-helpers,
>
> I’m writing for advice on whether I should use R or a different package or
> language. I’ve looked through the R-help archives, some manuals, and some
> other sites as well, and I haven’t done too well finding relevant info,
> hence my question here.
>
> I’m working with hierarchical data (in SPSS lingo). That is, for each case
> (person) I read in three types of (medical) record:
>
> 1. demographic data: name, age, sex, address, etc
>
> 2. ‘admissions’ data: this generally repeats, so I will have 20 or so
> variables relating to their first hospital admission, then the same 20 again
> for their second admission, and so on
>
> 3. ‘collections’ data, about 100 variables containing the results of a
> battery of standard tests. These are administered at intervals and so this
> is repeating data as well.
>
> The number of repetitions varies between cases, so in its one case per line
> format the data is non-rectangular.
>
> At present I have shoehorned all of this into SPSS, with each case on one
> line. My test database has 2,500 variables and 1,500 cases (or persons), and
> in SPSS’s *.SAV format is ~4MB. The one I finally work with will be larger
> again, though likely within one order of magnitude. Down the track, funding
> permitting, I hope to be working with tens of thousands of cases.

Although this may not be helpful for your immediate goal, storing and
manipulating data of this size and complexity (and, I expect, cost for
collection) really calls for tools like relational databases.  A
single flat file of 2500 variables by 1500 cases is almost never the
best way to organize such data.  A normalized representation as a
collection of interlinked tables in a relational data base is much
more effective and less error prone.  The widespread use of
spreadsheets or SPSS data sets or SAS data sets which encourage the
"single table with a gargantuan number of columns, most of which are
missing data in most cases" approach to organization of longitudinal
data is regrettable.

For later analysis in R it is better to start with "long" form of the
data, as opposed to the "wide" form, even if it means repeating
demographic information over several occasions.  Using a relational
database allows for a long view to be generated without the
possibility of inconsistency in the demographics.  I am using the
descriptions "long" and "wide" in the sense that they are used in the
reshape help page.  See

?reshape

in R.  The long view is also called the subject/occasion view in the
sense that each row corresponds to one subject on one occasion.

Robert Gentleman's book "R Programming for Bioinformatics" provides
background on linking R to relational databases.


As I said at the beginning, you may not want to undertake the
necessary study and effort to reorganize your data for this specific
project but if you do this a lot you may want to consider it.

> I am wondering if I should keep using SPSS, or try something else.
>
> The types of analysis I’ll typically will have to do will involve comparing
> measurements at different times, e.g. before/ after treatment. I’ll also
> need to compare groups of people, e.g. treatment / no treatment. Regression
> and factor analyses will doubtless come into it at some point too.
>
> So:
>
> 1. should I use R or try something else?
>
> 2. can anyone advise me on using R with the type of data I’ve described?
>
>
> Many thanks,
>
> Anton du Toit
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



More information about the R-help mailing list