[R-sig-ME] Time series and (or?) repeated measures with many, random time points.

Fri May 20 21:45:05 CEST 2011

Here is how just about every dataset I need to analyze in the course
of my research looks:

|ID|F1|F2|N1|N2|T1|R1|T2|R2|T3|R3|
|S1|A |X |3.0|4.2|10|1.4|27|4.5|31|4.6|
|S2|A |Y |2.2|5.1|11|1.5|23|5.0|32|5.1|
|S3|B |X |3.1|4.8|10|1.3|20|4.8|31|4.7|
|S4|B |Y |2.6|3.9|12|1.8|22|4.9|34|5.2|
... etc. until, say, ...
|S25|B |X |2.1|3.3|10|1.6|21|4.4|33|4.8|

... with one line per test subject (with a unique ID), multiple
subjects represented in every combination of F1 and F2 categorical
(and usually fixed) variables. N1 and N2 are continuous numerical
results from tests performed on the subjects before the experiment and
might be useful to include in the model to see if they account for any
of the variation in the response variable. T1 ... TN and R1 ... RN are
timestamps and their corresponding numerical observations. Usually
there are many more than 3, and usually the timestamps for different
subjects don't match up precisely-- i.e. time is a numeric variable
here..
I think part of the difficulty I'm having in getting a handle on what
methods to use for analyzing this is that I don't even know the right
terminology to describe this problem-- from my online and offline
reading I conclude that this type of problem is called repeated
measures ANCOVA, or time series ANCOVA , or maybe both. What I really
need to do is find a text-book on the appropriate topic within the
field of linear models, preferably geared toward R. So, if anybody has
suggestions for books, or the correct terms for describing this
problem precisely and unambiguously so that I can find the right book,
that would be much appreciated.
So far I've tried the following:
# 1. The Manova approach:
Anova(lm(as.matrix(data[,c('R1','R2','R3')])~F1*F2*N1*N2,data),idata=data.frame(DAY=factor(c(1:3))),idesign=~DAY);
# Problem: this treats time as a factor, which for my real data would
saturate the model even if I somehow binned the timestamps so that all
IDs were present for all DAYs
# 2. The mixed-model approach:
# converting the data to 'long' format with the timestamps in variable
TIME and response variable RE then...
lme(RE~F1*F2*N1*N2*TIME, random=~TIME|ID)
# Problem: I have no idea how to test sphericity assumptions, nor how
to proceed if they are violated. Also, I'm not sure what the
implications are of specifying random as ~TIME|ID versus ~1|ID or
~1+TIME|ID, or of omitting TIME from the main model.
# 3. Simply fitting a linear model to each subject and then doing an
ordinary ANCOVA once with intercepts (or perhaps means) and once with
slopes as the response variables.
# Problem: Am I losing sensitivity by doing this? Am I introducing
bias? If this was the right way to do this, why would people bother
with the above two ways?
Does anybody have any better suggestions? Thanks for your time.

PS: I know that F1*F2*N1*N2 explodes into a huge number of terms that
could saturate the model. The issue of variable selection is one I'm
addressing separately. The only reason I put such a messy model here
is to emphasize that the solution I'm looking for needs to be capable
of handing models containing both categorical and numeric covariates.