Anton du Toit
atdutoitrhelp at gmail.com
Sat Feb 13 11:54:07 CET 2010
Hi Douglas,
Thanks for your helpful response. I've commented on some of the points
you raised below:
> Although this may not be helpful for your immediate goal, storing and
manipulating data of this size and complexity (and, I expect, cost for
collection) really calls for tools like relational databases. A
single flat file of 2500 variables by 1500 cases is almost never the
best way to organize such data. A normalized representation as a
collection of interlinked tables in a relational data base is much
more effective and less error prone. The widespread use of
spreadsheets or SPSS data sets or SAS data sets which encourage the
"single table with a gargantuan number of columns, most of which are
missing data in most cases" approach to organization of longitudinal
data is regrettable.
I'm both relieved and daunted by this. Daunted because it means I'll
need to learn another package (probably postGreSQL or MySQL?), but
relieved because constructing a 2500 by 1500 file seemed intuitively
wrong, as well as introducing the possibility of errors
unnecessarily--surely it makes more sense to leave the data as is.
As far as immediate goals go--I am at the beginning of a thesis, and I
have more research planned after that, so I want to get things right
from the start.
> For later analysis in R it is better to start with "long" form of the
data, as opposed to the "wide" form, even if it means repeating
demographic information over several occasions. Using a relational
database allows for a long view to be generated without the
possibility of inconsistency in the demographics. I am using the
descriptions "long" and "wide" in the sense that they are used in the
reshape help page. See
?reshape
> in R. The long view is also called the subject/occasion view in the
sense that each row corresponds to one subject on one occasion.
> Robert Gentleman's book "R Programming for Bioinformatics" provides
background on linking R to relational databases.
Thanks--I'll look this one up.
> As I said at the beginning, you may not want to undertake the
necessary study and effort to reorganize your data for this specific
project but if you do this a lot you may want to consider it.
As above: a stitch in time, I suppose.
Thanks again.
Anton
On Sat, Feb 6, 2010 at 3:22 AM, Douglas Bates <bates at stat.wisc.edu> wrote:
