[R] multi-response regression with random forest

Tue Aug 30 17:39:14 CEST 2011

Dear list,

I performed a multivariate analysis on freshwater invertebrates data. So
I obtained coordinates of my samples on the axes defining the first
factorial plane (F1 and F2).

I would like to see if the positions on my factorial plan could be
linked to levels of impairment ('low' vs 'significant') for several
water quality pressure categories and which pressure categories were the
most important to explain my data.

I first used random forests (package randomForest) to independently
regressed the F1 and F2 coordinates against my pressures levels. These
models explained around 13% of the variability for the first axis and
1.5% or the second axis. 

I heard about multi-response modelizations and tried to model the
bi-variate response F1+F2 from the same set of pressure levels. This
time, the model explained around 37% of the variability, that was great.

But I don't understand what is precisely modeled in such multi-response
regressions with random forest, when I used the predict() function on my
data I obtained only one value for each sample. What correspond to this
prediction? F1, F2, some combination of the both?

Any advice and links to helpful litterature would be appreciated,

Thanks,

Cédric

___________________________________________________________________

Here is a small extract of my input data :

    ID          F1           F2      WQ1      WQ2      WQ3      WQ4
423007 -0.181720936 -0.031683254 Impaired Impaired Impaired Impaired
423432 -0.013823243 -0.044562244     Good     Good Impaired     Good
382886 -0.062171083  0.095592402     Good Impaired     Good Impaired
349067  0.165199490 -0.006247771 Impaired     Good Impaired     Good
350787 -0.086522253 -0.001156491     Good     Good Impaired     Good
423700 -0.094519496  0.058552236     Good     Good Impaired     Good
1473   -0.030547960  0.041201208     Good     Good Impaired     Good
422893 -0.381074618 -0.108488149     Good     Good     Good     Good
424323 -0.200710868  0.008960769     Good Impaired Impaired Impaired
351117 -0.026336697 -0.011788642     Good     Good Impaired     Good
423356 -0.095307898  0.032821813     Good     Good Impaired     Good
52      0.181933163 -0.070008234     Good     Good     Good     Good
529     0.201013553 -0.039925550     Good     Good     Good     Good
123     0.049202307 -0.255373209     Good     Good     Good     Good
424332 -0.201756587 -0.007161893     Good     Good Impaired     Good
423925  0.182053115 -0.163286598     Good     Good     Good     Good
422967  0.009489423  0.078132841     Good     Good Impaired     Good
423899  0.042904501  0.022193773     Good     Good     Good     Good
350912  0.031308796  0.066608196     Good     Good     Good     Good
422988 -0.049664431  0.063449869     Good     Good Impaired     Good

This is the formula I used for my model:

mod=randomForest((F1+F2)~., data=data, ntree = 500, mtry =
sqrt(ncol(data)-1))

The model summary:

Call:
 randomForest(formula = (F1 + F2) ~ ., data = data, ntree = 500,
mtry = sqrt(ncol(data) - 1)) 

               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 0.01772612
                    % Var explained: 37.98

And finally the predictions:

		          prediction
423007                  -0.256445319
423432                  -0.078636802
382886                  -0.088890538
349067                  -0.118654211
350787                  -0.112655013
423700                   0.018815905
1473                    -0.032085983
422893                  -0.303123232
424323                  -0.226793376
351117                   0.008599632
423356                  -0.038947801
52                       0.120712909
529                      0.043381647
123                     -0.087297539
424332                  -0.180140229
423925                   0.078654535
422967                  -0.012138644
423899                   0.078367004
350912                   0.078654535
422988                   0.014915818