[R] Repeated measures in Classification and Regresssion Trees

Bert Gunter gunter.berton at gene.com
Fri Feb 23 17:46:00 CET 2007


Andrew:

Good question! AFAIK most of the so-called "machine learning" machinery --
regression and classification trees, SVM's, neural nets, random forests,
and other more chic methods (I make no attempt to keep up with all of them)
-- ignore error structure; that is, they assume the data are at least
independent (not necessarily identically distributed). I don't think merely
exchangeable is good enough either, though I may be wrong about this.

But I believe you have put your finger on a key issue: although all this
"cool" methodology is usually not terribly concerned with inference
(x-validation and bootstrapping being the usual methodology rather than,
say, asymptotics), one wonders how biased the estimators are when there are
various correlations in the data. I suspect a lot, depending on the nature
of the correlations and the methods. I think the moral is: thermodynamics
still rules -- there's no free lunch. You are just as likely to produce
nonsense using all this "nonparametric" methodology as you are using
parametric methods if you ignore the error structure of the data.
Incidentally, I should point out that George Box fulminated on this very
issue about 50 years ago. In his statistics classes he always used to say
that all the fuss (then) about using non-parametric rank-based methods (e.g.
Mann-Whitney-Wilcoxon) rather than parametric t-statistics was silly since
the t-statistics were relatively insensitive to deopartures from normality
anyway and it was lack of independence, not exact normality, that was the
key practical issue, and both approaches were sensitive to that. He
published several papers to this effect, of course.

Needless to say, I would welcome other -- especially better informed and
contrary -- views on these issues, either on or off list.

Cheers,

Bert Gunter
Genentech Nonclinical Statistics
South San Francisco, CA 94404
650-467-7374


-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Andrew Park
Sent: Friday, February 23, 2007 7:51 AM
To: r-help at stat.math.ethz.ch
Subject: [R] Repeated measures in Classification and Regresssion Trees

Dear R members,

I have been trying to find out whether one can use multivariate
regression trees (for example mvpart) to analyze repeated measures data.
 As a non-parametric technique, CART is insensitive to most of the
assumptions of parametric regression, but repeated measures data raises
the issue of the independence of several data points measured on the
same subject, or from the same plot over time.

Any perspectives will be welcome,



Andy Park (Assistant Professor)

Centre for Forest Interdisciplinary Research (CFIR),
Department of Biology,
University of Winnipeg,
515 Portage Avenue,
Winnipeg, Manitoba, R3B 2E9,
Canada

Phone: (204) 786-9407

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list