[R] bootstrapped eigenvector method following prcomp

Tue Jan 20 00:25:34 CET 2009

@ Stas

Thanks for the extensive answer! I squeezed my data in your function but 
still need to mull over it and your comments for some time.

спасибо,
Axel

Stas Kolenikov schrieb:
> I don't know if there are bugs in the code, but the step 4) does not
> compute significance... at least the way statisticians know it. The
> fractions above or below 0 are not significance. I don't even know how
> to call those... probably cdf of the bootstrap distribution evaluated
> at zero.
>
> Let's put the flies and the sausages separately, as a Russian saying
> goes. If you bootstrap from your original data, you can get the
> confidence intervals, but not the test statistics. What you can do
> with your bootstrap from the raw data is accumulate the distribution
> of the eigenvectors, along the lines of (assuming you are testing for
> the significance of variables in your first component):
>
> function (x, permutations=1000, ...)
> {
>   pcnull <- princomp(x, ... )
>   res <- pcnull$loadings[,1]
>   bsresults = matrix( rep.int(NA, permutations*NROW(res)) ,
> nrow=permutations, ncol=NROW(res) )
>   N <- nrow(x)
>   for (i in 1:permutations) {
>       pc <- princomp(x[sample(N, replace=TRUE), ], ... )
>       pred <- predict(pc, newdata = x)
>       r <-  cor(pcnull$scores, pred)
>       k <- apply(abs(r), 2, which.max)
>       reve <- sign(diag(r[k,]))
>       sol <- pc$loadings[ ,k]
>       sol <- sweep(sol, 2, reve, "*")
>       bsresults[i,] <- t(sol[,1])
>   }
>   apply( bsresults, 2, quantile, c(0.05, 0.95) )
> }
>
> if I am not messing up the dimensions and other stuff too much.
> However as a result you will get an approximately degenerate
> distribution sitting on the unit sphere since the eigenvectors are
> always normalized to have the length of 1. You can still do marginal
> confidence intervals with that though, and see if 0 is covered.
>
> The main problem here is I am not entirely sure the bootstrap is
> applicable for the problem at hand. In other words, it is not clear
> whether the bootstrap estimates are consistent for the true
> variability. Eigenproblems are quite difficult and prone to
> non-regularities (the above mentioned degeneracy is just one of them,
> and probably not the worst one). There are different asymptotics
> (Anderson's of fixed p and n \to \infty,
> http://www.citeulike.org/user/ctacmo/article/3908837, and Johnstone
> joint p and n \to \infty with p/n \to const,
> http://www.citeulike.org/user/ctacmo/article/3908846), the
> distributions are often non-standard, while the bootstrap works well
> when the distributions of the estimates are normal. When you have
> something weird, bootstrap may easily break down, and in a lot of
> other situations, you need to come up with special schemes. See my
> oh-so-favorite paper on the bootstrap,
> http://www.citeulike.org/user/ctacmo/article/575126.
>
> One of those special schemes (back to out muttons, or rather flies and
> sausages) -- to set up the bootstrap for hypothesis testing and get
> the p-values, you need to bootstrap from the distribution that
> corresponds to the null. Beran and Srivastava (1985 Annals,
> http://www.citeulike.org/user/ctacmo/article/3015345) discuss how to
> rotate your data to conform to the null hypothesis of interest for
> inference with covariance matrices and their functions (such as
> eigenvalues, for instance). Whether you need to go into all this
> trouble, I don't really know.
>
> If you have an inferential problem of testing whether a particular
> variable contributes to the overall index, and have a pretty good idea
> of where each variable goes, may be you need to shift your paradigm
> and look at confirmatory factor analysis models instead, estimable in
> R with John Fox' sem package.
>
> On 1/19/09, Axel Strauß <a.strauss at tu-bs.de> wrote:
>   
>> G'Day R users!
>>
>>  Following an ordination using prcomp, I'd like to test which variables
>> singnificantly contribute to a principal component. There is a method
>> suggested by Peres-Neto and al. 2003. Ecology 84:2347-2363  called
>> "bootstrapped eigenvector". It was asked for that in this forum in January
>> 2005 by Jérôme Lemaître:
>>  "1) Resample 1000 times with replacement entire raws from the original data
>> sets []
>>  2) Conduct a PCA on each bootstrapped sample
>>  3) To prevent axis reflexion and/or axis reordering in the bootstrap, here
>> are two more steps for each bootstrapped sample
>>  3a) calculate correlation matrix between the PCA scores of the original and
>> those of the bootstrapped sample
>>  3b) Examine whether the highest absolute correlation is between the
>> corresponding axis for the original and bootstrapped samples. When it is not
>> the case, reorder the eigenvectors. This means that if the highest
>> correlation is between the first original axis and the second bootstrapped
>> axis, the loadings for the second bootstrapped axis and use to estimate the
>> confidence interval for the original first PC axis.
>>  4) Determine the p value for each loading. Obtained as follow: number of
>> loadings >=0 for loadings that were positive in the original matrix divided
>> by the number of boostrap samples (1000) and/or number of loadings =<0 for
>> loadings that were negative in the original matrix divided by the number of
>> boostrap samples (1000)."
>>
>>  (see
>> https://stat.ethz.ch/pipermail/r-help/2005-January/065139.html
>> ).
>>
>>  The suggested solution (by Jari Oksanen) was
>>
>>
>>  function (x, permutations=1000, ...)
>>  {
>>    pcnull <- princomp(x, ...)
>>    res <- pcnull$loadings
>>    out <- matrix(0, nrow=nrow(res), ncol=ncol(res))
>>    N <- nrow(x)
>>    for (i in 1:permutations) {
>>        pc <- princomp(x[sample(N, replace=TRUE), ], ...)
>>        pred <- predict(pc, newdata = x)
>>        r <-  cor(pcnull$scores, pred)
>>        k <- apply(abs(r), 2, which.max)
>>        reve <- sign(diag(r[k,]))
>>        sol <- pc$loadings[ ,k]
>>        sol <- sweep(sol, 2, reve, "*")
>>        out <- out + ifelse(res > 0, sol <=  0, sol >= 0)
>>    }
>>    out/permutations
>>  }
>>
>>  However, in a post from March 2005 (
>> http://r-help.com/msg/6125.html ) Jari himself mentioned
>> that there is a bug in this method.
>>
>>  I was wondering whether someone could tell me where the bug is or whether
>> there is a better method in R to test for significance of loadings (not the
>> significance of the PCs).
>>  Maybe it is not a good idea to do it at all, but I would prefer to have
>> some guidline for interpretation rather than making decisions arbitrarily. I
>> tried to look everywhere before posting here.
>>
>>  I would be very thankful for any help,
>>
>>     
>
>   

-- 
Gravity is a habit that is hard to shake off.
Terry Pratchett