[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis

Max Kuhn mxkuhn at gmail.com
Sun Mar 7 20:38:27 CET 2010


It worth adding that, for evaluating the quality of a model,
randomization is fairly useless.

Basically, as long as your model is slightly better than noise, it can
show a significant difference from the average randomized model. In
the qsar studies that we do, the samples sizes can be in the hundreds
to hundred thousands. In those cases, the randomization method is
completely "over-powered" and will call mediocre models better than
random.

Cross-validation and bootstrap methods are probably a much better
method of estimating the performance (and it's uncertainty) in these
models.

My $ 0.02,

Max

On Sat, Mar 6, 2010 at 12:51 AM, Greg Snow <Greg.Snow at imail.org> wrote:
> In the stats literature these are more often called permutation tests.  Looking up that term should give you some results (if not, I have some references, but they are at work and I am not, I could probably get them for you on Monday if you have not found anything before then).
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Damjan Krstajic
>> Sent: Friday, March 05, 2010 5:39 PM
>> To: r-help at r-project.org
>> Subject: [R] scientific (statistical) foundation for Y-RANDOMIZATION in
>> regression analysis
>>
>>
>> Dear all,
>>
>> I am a statistician doing research in QSAR, building regression models
>> where the dependent variable is a numerical expression of some chemical
>> activity and input variables are chemical descriptors, e.g. molecular
>> weight, number of carbon atoms, etc.
>>
>> I am building regression models and I am confronted with a widely a
>> technique called Y-RANDOMIZATION for which I have difficulties in
>> finding references in general statistical literature regarding
>> regression analysis. I would be grateful if someone could point me to
>> papers/literature in statistical regression analysis which give
>> scientific (statistical) foundation for using Y-RANDOMIZATION.
>>
>> Y-RANDOMIZATION is a widely used technique in QSAR community to unsure
>> the robustness of a QSPR (regression) model. It is used after the
>> "best" regression model is selected and to make sure that there are no
>> chance correlations. Here is a short description. The dependent
>> variable vector (Y-vector) is randomly shuffled and a new QSPR
>> (regression) model is fitted using the original independent variable
>> matrix. By repeating this a number of times, say 100 times, one will
>> get hundred R2 and q2 (leave one out cross-validation R2) based on
>> hundred shuffled Y. It is expected that the resulting regression models
>> should generally have low R2 and low q2 values. However, if the
>> majority of hundred regression models obtained in the Y-randomization
>> have relatively high R2 and high q2 then it implies that an acceptable
>> regression model cannot be obtained for the given data set by the
>> current modelling method.
>>
>> I cannot find any references to Y-randomization or Y-scrambling
>> anywhere in the literature outside chemometrics/QSAR. Any links or
>> references would be much appreciated.
>>
>> Thanks in advance.
>>
>> DK
>> ----------------------------------------------
>> Damjan Krstajic
>> Director
>> Research Centre for Cheminformatics
>> Belgrade, Serbia
>>
>> ----------------------------------------------
>>
>>
>> _________________________________________________________________
>> Tell us your greatest, weirdest and funniest Hotmail stories
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 

Max



More information about the R-help mailing list