[R] fda modeling

Mon May 28 19:14:05 CEST 2012

Hi, Troels:

       I'm still trying to understand the structure of your data.  
Please check the discussion below.  If what I suggest is correct, it 
should make the analysis much more routine and therefore easier 
requiring less time to analyze.

On 5/21/2012 1:33 PM, Troels Ring wrote:
> Dear friends - We have 25 rats, 14 of these subjected to partial 
> removal of kidney tissue, 11 to sham operation, and then followed for 
> 6 weeks. So far we have data on 26 urine metabolites measured by NMR 7 
> times during the observation. 

       So you collected urine samples at 7 different times on each rat 
throughout the experiment, separated out 26 different metabolites and 
measured each of those 7 using Nuclear Magnetic Resonance (NMR)?  What 
were the ages of the rats at the time of the operation and at the times 
that each of the 7 urine samples were collected?  In particular, were 
the 7 urine samples equally spaced?  If yes, that could simplify the 
analysis.  The greater the time differences between samples and between 
rats, the more difficult the analysis potentially.

       What were the ages of the 25 rats?  Were they all from the same 
litter?  If no, how were they related?  The worst possible case is that 
you have 14 from one litter and 11 from another.  If that's the case, 
then any difference you see between the two groups could be a litter 
effect.  If they are one rat from each of 25 litters, that would 
simplify the statistical analyses.  Scientifically, the best might be to 
have at 4 or 5 rats from each of 6 litters, assigned with at least 2 
rats to experiment and 2 to control from each litter.  You probably 
don't have that, but the litter effect is likely to be important and 
that needs to be part of the analysis, I think.

> I have smoothed the measurements by b.splines in fda including a 
> roughness penalty, and inspecting the mean curves for nephrectomized 
> and sham animals indicate differences for several of the metabolites. 
> Now the real idea is to use the NMR measurements to understand what 
> goes on in the kidneys since we know the partial removal of kidney 
> tissue will result in progressive damage in the kidneys - the nature 
> of that is what we want to understand. We have a blood sample from the 
> rats just prior to sacrifice, and the creatinine concentration there 
> is a good proxy for "renal function". 

       So you have one measure of creatinine for each rat measured just 
prior to sacrifice?

> So the course of concentrations of the metabolites are thought to be 
> valuable in understanding the physiology. Some of these are thought to 
> be correlated. We have two groups where sham animals have better renal 
> function than partially nephrectomized, but there is variation in both 
> groups which is also interesting - some animals progress more rapidly 
> after "the same operation"  than others - we would like to know why.
> The data are available (eventually - the resulting blood tests still 
> are missing) if anyone would like to have a look but the main issue is 
> if it is at all feasible to make fda work on such a problem.

       I suggest you forget about fda at least initially and start with 
simpler, more traditional tools.  Later, you may or may not want to 
return to fda.  I suggest you proceed as follows:

             I.  DATA CLEANING:  Make normal probability plots of 
everything:  I'd start with making one normal probability plot for each 
of the 26 metabolites.  Normally distributed data with approximately the 
same mean and standard deviation will look approximately like a straight 
line.  The scientist's dream with this is the image of two lines with a 
gap in the middle, with the two lines corresponding exactly to the two 
groups (nephrectomized vs. controls).  It's more likely that you will 
see mostly one distribution with a few observations away from a 
moderately straight line in the middle.  If you see this, you should 
check the records and samples for the deviant observations to see if you 
can find, e.g., a data entry error or a problem with mishandling a 
sample.  If you can't fix any observation that way, you should replace 
the numbers with NA (not available = missing).  Another possibility is 
you see several little clusters corresponding to the litters.  Or you 
might see curvature to the line;  with curvature, if all the numbers are 
positive, you should try normal plots of the logarithms.  If that helps 
straighten out the lines, you should analyze the logarithms not the raw 
numbers.  I usually do this with something like qqnorm(x, datax=TRUE).  
The use of "datax" means that with one or more outliers, the slope of 
the center portion will be closer to 45 degrees and therefore more 
easily processed with the naked eye.

             II.  UNIVARIATE ANALYSES:  After data cleaning, I'd then 
use something like lme{nlme} to analyze each response variable 
(metabolite or creatinine) separately.  I recommend lme, because it is 
exactly what is needed for this kind of thing AND there is a great book 
available to describe how to do it:  Pinheiro and Bates (2000) 
Mixed-Effects Models with S and S-PLUS (Springer).  This book has 
companion script files in the nmle package (similar to those with fda), 
which are quite valuable for understanding the book, because there were 
some changes in nlme, so in a very few cases, the code in the book 
doesn't work in R, but the code in the companion script file does.  
There is better software available today and there may be better books, 
but for what you have, I would probably not mess with anything else.  
The techniques described early in this book should help you analyze 
between treatment and between litter effects for all the different 
variables AND the impact of one or more variables on others.

             III.  MULTIVARIATE ANALYSES:

                   (a) Analyzing the variations over time of each 
metabolite for each rat in each litter and treatment group can be 
challenging.  In essence, you have 26 time series of length 7 for each 
rat.  This is very much the problem that pushed Doug Bates into studying 
mixed-effects models:  Earlier, he had studied nonlinear modeling, as 
documented in Bates and Watts (1988) Nonlinear Regression & Its 
Applications (Wiley).  Many of his datasets were metabolites collected 
over time like the data you have.  The second half of Pinheiro and Bates 
describes how to model mixed effects within nonlinear models like you 
have.  If you can NOT get simple linear or nonlinear models for each 
metabolite over time and the fda models provide something useful, you 
might look at the fdaMixed package available from CRAN.  I haven't used 
it, but the name makes it sound like it might help you.

                   (b) You probably will also want to do either 
principle components or factor analysis of the 26 different 
metabolites:  The first few principle components or factors will likely 
represent the major modes of behavior among the metabolites.  This could 
reduce the analysis from 26 different matabolites to 2 or 5 different 
primary modes of variation in the biochemistry -- possibly clustering 
the metabolites to simplify the analysis and strengthen the 
interpretation.  Then take the principle components or factors back to 
nlme to complete the analysis.

       I apologize for encouraging you too much to study fda 
techniques.  The above describes a standard analysis protocol that has 
been used with great success by many people.  Many of my data analysis 
failures involved jumping straight to a multivariate analysis before 
doing the simple things first ;-)

        Hope this helps.
       Spencer Graves

> Best wishes
> Troels Ring,
> Nephrology
> Aalborg,  Denmark
>