[R] Data views (Re: (Another) Bates fortune?)

Mon Feb 8 17:11:32 CET 2010

On Sun, Feb 7, 2010 at 2:40 PM, Emmanuel Charpentier
<charpent at bacbuc.dyndns.org> wrote:
> Note : this post has been motivated more by the "hierarchical data"
> subject than the aside joke of Douglas Bates, but might be of interest
> to its respondents.
>
> Le vendredi 05 février 2010 à 21:56 +0100, Peter Dalgaard a écrit :
>> Peter Ehlers wrote:
>> > I vote to 'fortunize' Doug Bates on
>> >
>> >  Hierarchical data sets: which software to use?
>> >
>> > "The widespread use of spreadsheets or SPSS data sets or SAS data sets
>> > which encourage the "single table with a gargantuan number of columns,
>> > most of which are missing data in most cases" approach to organization
>> > of longitudinal data is regrettable."
>> >
>> > http://n4.nabble.com/Hierarchical-data-sets-which-software-to-use-td1458477.html#a1470430
>> >
>> >
>>
>> Hmm, well, it's not like "long format" data frames (which I actually
>> think are more common in connection with SAS's PROC MIXED) are much
>> better. Those tend to replicate base data unnecessarily - "as if rats
>> change sex with millisecond resolution".

> [ Note to Achim Zeilis : the "rats changing sex with millisecond
> resolution" quote is well worth a nomination to "fortune" fame ; it
> seems it is not one already... ]
>
>>                                           The correct data structure
>> would be a relational database with multiple levels of tables, but, to
>> my knowledge, no statistical software, including R, is prepared to deal
>> with data in that form.

I think if you go back to my original reply you will see that my first
suggestion was to use an SQL data base.  I didn't mention views (in
the SQL sense) explicitly but those are a natural construction for
organizing longitudinal data.  The data can be stored as a set of
normalized tables in a data base but extracted as a data frame in the
long format.

> Well, I can think of two exceptions :
>
> - BUGS, in its various incarnations (WinBUGS, OpenBUGS, JAGS), does not
> require its data to come from the same source. For example, while
> programming a hierarchical model (a. k. a. mixed-effect model),
> individual level variables may come from one source and various group
> level variables may come from other sources. Quite handy : no previous
> merge() required. Now, writing (and debugging !) such models in BUGS
> is another story...
>
> - SAS has had this concept of "data view" for a long time, its most
> useful incarnation being a "data view" of an SQL view. Again, this
> avoids the need to actually merge the datasets (which, AFAICR, is a
> serious piece of pain in the @$$ in SAS (maybe that's the *real*
> etymology of the name ?)).
>
> This problem has bugged me for a while. I think that the concept of a
> "data view" is right (after all, that's one of the core concepts of SQL
> for a reason...), but that implementing it *cleanly* in R is probably
> hard work. Using a DBMS for maintaining tables and views and querying
> them "just at the right time" does help, but the ability of using these
> DBMS data without importing them in R is, AFAIK, currently lacking.

I think the issue is more than that.  Most model-fitting functions in
R incorporate a formula/data -> model.frame -> model.matrix sequence.
The symbolic analysis to create the model frame can, I think, be
applied to a view and the result stored back in an SQL data base.  (I
haven't looked at the code for model.frame in a long time but I think
it can be applied a row at a time or to chunks of rows.)   Some
auxiliary information, such as unused factor levels, could be
accumulated. The model.matrix function
can be a bit more problematic but it too could be applied to chunks
when generating dense model matrices, with the summary information
from the chunks being accumulated.  Updating sparse model matrices and
accumulating summary information is potentially more time-consuming
because you may need to update the form of the summary matrices as
well as the numerical values.

Of course in SAS these changes are easier because their least squares
calculations are based on accumulating summary matrices from the data
a row at a time.

I think there are three different cases for fitting models based on
large model matrices.  If you have very large n (number of rows) but
small to moderate p (number of columns) in the model matrix then you
can work on chunks of rows using dense matrices to accumulate summary
results.  Large n and large p producing a sparse model matrix could be
handled with the sparse.model.matrix function in the Matrix package
because in these cases the model frame and the sparse model matrix
tend to use smaller amounts of storage.  When you have a large n and a
large p creating a dense model matrix I think the best course is to
buy a machine with a 64-bit processor and a 64-bit operating system
and stuff it full of memory.  It depends on how large p^2 is compared
to np whether you are better off working in chunks or rows or not.

> One upon a time, a very old version of RPgSQL (a Bioconductor package),
> aimed to such a representation : it created objects inheriting from
> data.frame to represent Postgres-based data, allowing to use these data
> "transparently". This package dropped into oblivon when his creator and
> sole maintainer became unable to maintain it further.
>
> As far as I understand it, the DBI specification *might* allow the
> creation of such objects, but I am not aware of any driver actually
> implementing that.
>
> In fact, there are two elements of solution to this problem :
> a) creation of (abstract) objects representing data collections as data
> frames, with the same properties, but not requesting the creation of an
> actual data frame. As far as my (very poor) object-oriented knowledge
> goes, these objects should be, in C++/Python parlance, inherit from
> data.frame.
> b) creation of objects implementing various realizations of the objects
> created in a) : DBMS querying, actual data.frame querying (here I'm
> thinking of sqldf, which does this on the reverse direction, allowing
> querying R data frames to be queried in SQL. Quite handy...), etc ...
>
> I tried my hand once at building such a representation (for
> DBMS-deposited data), with partial success (read-only was OK, read-write
> was seriously buggy). But my S3 object-oriented code stinks, my Python
> is pytiful, and, as a public health measure,  I won't even try to
> qualify my C++... So I leave implementation to better programmers as an
> exercise (a term project, or even a master's thesis subject is probably
> closer to truth...).
>
> A third, much larger, (implementation) element, is lacking in this
> picture : the algorithms used on these data. SAS is notoriously good (in
> some simple cases, such as ordinary regression) at handling datasets
> larger than available memory because the algorithms have been written
> with punched cards (maybe even paper tape) in mind : *one* *sequential*
> read of the data was the only *practical* way to go back in those days.
> So all the matrices and vectors necessary to the computation
> (notionally, X'X and X'Y) were built in memory in *one* step.
>
> Such an organization is probably impossible with most "modern"
> algorithms : see Douglas Bates' description of the lmer() algorithms for
> a nice, big counter-example, or consider MCMC... But coming closer to
> such an organization *seems* possible : see for example biglm.

Actually the algorithms for linear mixed models in lmer could be
adapted to work with only the summary matrices.  The actual
calculations only reference the sparse model matrix Z for the random
effects in one place and that could be replaced with another
calculation involving Z'Z.  The model matrix X for the fixed effects
can be dense or sparse and all the calculations are based on
pre-computed products like Z'X, X'X and X'y.   In the extensions,
generalized linear mixed models or nonlinear mixed models, you can't
pre-compute the products because the products of interest involve
weights that vary at each iteration.

> So I think that data views are a a worthy but not-so-easy possible goal
> aimed at various data structure problems (including hierarchical data),
> but not *the* solution to data-representation problem in R.
>
> Any thoughts ?
>
>                                        Emmanuel Charpentier
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>