[R] multi-response regression with random forest
Cédric Mondy
cedric.mondy at gmail.com
Tue Aug 30 17:39:14 CEST 2011
Dear list,
I performed a multivariate analysis on freshwater invertebrates data. So
I obtained coordinates of my samples on the axes defining the first
factorial plane (F1 and F2).
I would like to see if the positions on my factorial plan could be
linked to levels of impairment ('low' vs 'significant') for several
water quality pressure categories and which pressure categories were the
most important to explain my data.
I first used random forests (package randomForest) to independently
regressed the F1 and F2 coordinates against my pressures levels. These
models explained around 13% of the variability for the first axis and
1.5% or the second axis.
I heard about multi-response modelizations and tried to model the
bi-variate response F1+F2 from the same set of pressure levels. This
time, the model explained around 37% of the variability, that was great.
But I don't understand what is precisely modeled in such multi-response
regressions with random forest, when I used the predict() function on my
data I obtained only one value for each sample. What correspond to this
prediction? F1, F2, some combination of the both?
Any advice and links to helpful litterature would be appreciated,
Thanks,
Cédric
___________________________________________________________________
Here is a small extract of my input data :
ID F1 F2 WQ1 WQ2 WQ3 WQ4
423007 -0.181720936 -0.031683254 Impaired Impaired Impaired Impaired
423432 -0.013823243 -0.044562244 Good Good Impaired Good
382886 -0.062171083 0.095592402 Good Impaired Good Impaired
349067 0.165199490 -0.006247771 Impaired Good Impaired Good
350787 -0.086522253 -0.001156491 Good Good Impaired Good
423700 -0.094519496 0.058552236 Good Good Impaired Good
1473 -0.030547960 0.041201208 Good Good Impaired Good
422893 -0.381074618 -0.108488149 Good Good Good Good
424323 -0.200710868 0.008960769 Good Impaired Impaired Impaired
351117 -0.026336697 -0.011788642 Good Good Impaired Good
423356 -0.095307898 0.032821813 Good Good Impaired Good
52 0.181933163 -0.070008234 Good Good Good Good
529 0.201013553 -0.039925550 Good Good Good Good
123 0.049202307 -0.255373209 Good Good Good Good
424332 -0.201756587 -0.007161893 Good Good Impaired Good
423925 0.182053115 -0.163286598 Good Good Good Good
422967 0.009489423 0.078132841 Good Good Impaired Good
423899 0.042904501 0.022193773 Good Good Good Good
350912 0.031308796 0.066608196 Good Good Good Good
422988 -0.049664431 0.063449869 Good Good Impaired Good
This is the formula I used for my model:
mod=randomForest((F1+F2)~., data=data, ntree = 500, mtry =
sqrt(ncol(data)-1))
The model summary:
Call:
randomForest(formula = (F1 + F2) ~ ., data = data, ntree = 500,
mtry = sqrt(ncol(data) - 1))
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 0.01772612
% Var explained: 37.98
And finally the predictions:
prediction
423007 -0.256445319
423432 -0.078636802
382886 -0.088890538
349067 -0.118654211
350787 -0.112655013
423700 0.018815905
1473 -0.032085983
422893 -0.303123232
424323 -0.226793376
351117 0.008599632
423356 -0.038947801
52 0.120712909
529 0.043381647
123 -0.087297539
424332 -0.180140229
423925 0.078654535
422967 -0.012138644
423899 0.078367004
350912 0.078654535
422988 0.014915818
More information about the R-help
mailing list