[R] Translating lm.object to SQL, C, etc function
j+rhelp at howard.fm
Sat Feb 15 00:07:02 CET 2003
On Fri, 14 Feb 2003 08:06:45 +0000 (GMT), ripley at stats.ox.ac.uk said:
> The issue here is that coef() tells you the coefficients in R's
> internal parametrization of the model, and that is of no use to you
> unless you have a means of creating a model matrix in C, SQL or (heaven
> forbid) Perl. The information needed to re-create a model matrix is
> stored in the lm fit, but in ways that are going to be hard to use
> anywhere else (since they include R functions). This is not perverse:
> what R does is very general, *far* more so than SPSS. Formulae in lm
> can include poly() and ns() terms, for example.
I understand that. And indeed a perfectly general function export is a
very big job. However, once we can export the model into a reasonably
generic textual form, simply including the text name of any R functions
in the export, then users can create special-case translators for the
parts that they need. We try to make this as easy for ourselves as
possible, for instance by doing all required transformations in SQL
(where possible) before importing to R, which means that all the terms in
the linear model are often untransformed variables. The only thing we
don't do in SQL normally is creating the contrasts, since this is
something that SQL is not well suited for.
> The only practical solution it seems to us is to ask R to create the
> model matrix for new data. Then the things you are talking about are
> just the colnames of that matrix, and don't need to be interpreted.
Yes, that makes things pretty easy then, but's it's not an option in all
cases. We need to embed our models into C code. Previously we had a
routine to take the SPSS output, convert it into C code, and then
recompile the C code into our simulation. The linear model is utilised in
the inner loop of the simulation so needs to be very fast; CORBA or SOAP
calls to uncompiled code in the inner loop slow things down a great deal.
In addition, the simulation is accessed by many people - requiring all of
them install R would make the roll-out procedure much more complex.
> You may want to read the sources to find out how R does it: that area
> is one of the most complex parts of the internals, and one in which
> bugs continue to emerge.
I'm glad to hear it is considered complex! ;-) I've actually been reading
that bit of the code quite a bit over the last two days and haven't been
getting that far. My lack of familiarity with the language, combined with
the lack of comments in that section of code, and the very
concise/non-descriptive variable names often used in the code, make this
even harder. Still, it's a useful exercise for learning more about the
> > The difficulty I am having is that the output of coef() is not really
> > parsable, since there is no marker in the name of an coefficient of
> > separate out the components. For instance, in SPSS the name of a
> > coefficient might be:
> > var1=[a]*var2=[b]*var3
> > ...which is easy to write a little script to pull that apart and
> > turn it into a line of SQL, C, or whatever. In S however the name
> > looks like:
> > var1avar2bvar3
> > ...which provides no way to pull the bits apart.
> I find that impossible to understand anyway, but doubt that it
> corresponds to SPSS. For a variable V, label Va does not mean V=[a]
> except in unusual special cases.
I should firstly mention that I got this slightly wrong - I showed above
the SPLUS output, not the R output. R actually looks like this:
The ':'s certainly help a lot, but still there's the problem of handling
factor levels, which are concatenated with the variable name without a
delimiter (at least, in all the linear models I've run so far, this is
I think with all the great feedback and ideas I've got so far on the list
and in private mail (thanks everyone!) I have enough information to make
a start. If I create anything that might be more generally useful I'll
post back of course.
jhoward at fastmail.fm
More information about the R-help