[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis

Sat Mar 6 06:51:24 CET 2010

In the stats literature these are more often called permutation tests.  Looking up that term should give you some results (if not, I have some references, but they are at work and I am not, I could probably get them for you on Monday if you have not found anything before then).

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Damjan Krstajic
> Sent: Friday, March 05, 2010 5:39 PM
> To: r-help at r-project.org
> Subject: [R] scientific (statistical) foundation for Y-RANDOMIZATION in
> regression analysis
> 
> 
> Dear all,
> 
> I am a statistician doing research in QSAR, building regression models
> where the dependent variable is a numerical expression of some chemical
> activity and input variables are chemical descriptors, e.g. molecular
> weight, number of carbon atoms, etc.
> 
> I am building regression models and I am confronted with a widely a
> technique called Y-RANDOMIZATION for which I have difficulties in
> finding references in general statistical literature regarding
> regression analysis. I would be grateful if someone could point me to
> papers/literature in statistical regression analysis which give
> scientific (statistical) foundation for using Y-RANDOMIZATION.
> 
> Y-RANDOMIZATION is a widely used technique in QSAR community to unsure
> the robustness of a QSPR (regression) model. It is used after the
> "best" regression model is selected and to make sure that there are no
> chance correlations. Here is a short description. The dependent
> variable vector (Y-vector) is randomly shuffled and a new QSPR
> (regression) model is fitted using the original independent variable
> matrix. By repeating this a number of times, say 100 times, one will
> get hundred R2 and q2 (leave one out cross-validation R2) based on
> hundred shuffled Y. It is expected that the resulting regression models
> should generally have low R2 and low q2 values. However, if the
> majority of hundred regression models obtained in the Y-randomization
> have relatively high R2 and high q2 then it implies that an acceptable
> regression model cannot be obtained for the given data set by the
> current modelling method.
> 
> I cannot find any references to Y-randomization or Y-scrambling
> anywhere in the literature outside chemometrics/QSAR. Any links or
> references would be much appreciated.
> 
> Thanks in advance.
> 
> DK
> ----------------------------------------------
> Damjan Krstajic
> Director
> Research Centre for Cheminformatics
> Belgrade, Serbia
> 
> ----------------------------------------------
> 
> 
> _________________________________________________________________
> Tell us your greatest, weirdest and funniest Hotmail stories
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.