[R] Hierarchical data sets: which software to use?

kMan kchamberln at gmail.com
Sun Feb 14 02:09:28 CET 2010


Dear Anton,

4Mb is not a lot of data. A Gb still wouldn't be that troublesome in a flat
file. Your data can be migrated to a relational database at a future point. 

Sincerely,
KeithC.

-----Original Message-----
From: Anton du Toit [mailto:atdutoitrhelp at gmail.com] 
Sent: Saturday, February 13, 2010 3:54 AM
To: r-help
Subject: Re: [R] Hierarchical data sets: which software to use?

Hi Douglas,

Thanks for your helpful response. I've commented on some of the points you
raised below:

> Although this may not be helpful for your immediate goal, storing and
manipulating data of this size and complexity (and, I expect, cost for
collection) really calls for tools like relational databases.  A single flat
file of 2500 variables by 1500 cases is almost never the best way to
organize such data.  A normalized representation as a collection of
interlinked tables in a relational data base is much more effective and less
error prone.  The widespread use of spreadsheets or SPSS data sets or SAS
data sets which encourage the "single table with a gargantuan number of
columns, most of which are missing data in most cases" approach to
organization of longitudinal data is regrettable.

I'm both relieved and daunted by this. Daunted because it means I'll need to
learn another package (probably postGreSQL or MySQL?), but relieved because
constructing a 2500 by 1500 file seemed intuitively wrong, as well as
introducing the possibility of errors unnecessarily--surely it makes more
sense to leave the data as is.

As far as immediate goals go--I am at the beginning of a thesis, and I have
more research planned after that, so I want to get things right from the
start.


> For later analysis in R it is better to start with "long" form of the
data, as opposed to the "wide" form, even if it means repeating demographic
information over several occasions.  Using a relational database allows for
a long view to be generated without the possibility of inconsistency in the
demographics.  I am using the descriptions "long" and "wide" in the sense
that they are used in the reshape help page.  See

?reshape

> in R.  The long view is also called the subject/occasion view in the
sense that each row corresponds to one subject on one occasion.

> Robert Gentleman's book "R Programming for Bioinformatics" provides
background on linking R to relational databases.

Thanks--I'll look this one up.


> As I said at the beginning, you may not want to undertake the
necessary study and effort to reorganize your data for this specific project
but if you do this a lot you may want to consider it.

As above: a stitch in time, I suppose.

Thanks again.

Anton

On Sat, Feb 6, 2010 at 3:22 AM, Douglas Bates <bates at stat.wisc.edu> wrote:
> On Sun, Jan 31, 2010 at 10:24 PM, Anton du Toit <atdutoitrhelp at gmail.com>
wrote:
>> Dear R-helpers,
>>
>> I’m writing for advice on whether I should use R or a different 
>> package or language. I’ve looked through the R-help archives, some 
>> manuals, and some other sites as well, and I haven’t done too well 
>> finding relevant info, hence my question here.
>>
>> I’m working with hierarchical data (in SPSS lingo). That is, for each 
>> case
>> (person) I read in three types of (medical) record:
>>
>> 1. demographic data: name, age, sex, address, etc
>>
>> 2. ‘admissions’ data: this generally repeats, so I will have 20 or so 
>> variables relating to their first hospital admission, then the same 
>> 20 again for their second admission, and so on
>>
>> 3. ‘collections’ data, about 100 variables containing the results of 
>> a battery of standard tests. These are administered at intervals and 
>> so this is repeating data as well.
>>
>> The number of repetitions varies between cases, so in its one case 
>> per line format the data is non-rectangular.
>>
>> At present I have shoehorned all of this into SPSS, with each case on 
>> one line. My test database has 2,500 variables and 1,500 cases (or 
>> persons), and in SPSS’s *.SAV format is ~4MB. The one I finally work 
>> with will be larger again, though likely within one order of 
>> magnitude. Down the track, funding permitting, I hope to be working with
tens of thousands of cases.
>
> Although this may not be helpful for your immediate goal, storing and 
> manipulating data of this size and complexity (and, I expect, cost for
> collection) really calls for tools like relational databases.  A 
> single flat file of 2500 variables by 1500 cases is almost never the 
> best way to organize such data.  A normalized representation as a 
> collection of interlinked tables in a relational data base is much 
> more effective and less error prone.  The widespread use of 
> spreadsheets or SPSS data sets or SAS data sets which encourage the 
> "single table with a gargantuan number of columns, most of which are 
> missing data in most cases" approach to organization of longitudinal 
> data is regrettable.
>
> For later analysis in R it is better to start with "long" form of the 
> data, as opposed to the "wide" form, even if it means repeating 
> demographic information over several occasions.  Using a relational 
> database allows for a long view to be generated without the 
> possibility of inconsistency in the demographics.  I am using the 
> descriptions "long" and "wide" in the sense that they are used in the 
> reshape help page.  See
>
> ?reshape
>
> in R.  The long view is also called the subject/occasion view in the 
> sense that each row corresponds to one subject on one occasion.
>
> Robert Gentleman's book "R Programming for Bioinformatics" provides 
> background on linking R to relational databases.
>
>
> As I said at the beginning, you may not want to undertake the 
> necessary study and effort to reorganize your data for this specific 
> project but if you do this a lot you may want to consider it.
>
>> I am wondering if I should keep using SPSS, or try something else.
>>
>> The types of analysis I’ll typically will have to do will involve 
>> comparing measurements at different times, e.g. before/ after 
>> treatment. I’ll also need to compare groups of people, e.g. treatment 
>> / no treatment. Regression and factor analyses will doubtless come into
it at some point too.
>>
>> So:
>>
>> 1. should I use R or try something else?
>>
>> 2. can anyone advise me on using R with the type of data I’ve described?
>>
>>
>> Many thanks,
>>
>> Anton du Toit
 ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list