[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

Fri Mar 25 00:02:28 CET 2016

I agree with Kasper, this is a 'big' issue. Does your method of taking only
n PCs reduce the load on memory?

The new addition to the summary looks like a good idea, but Proportion of
Variance as you describe it may be confusing to new users. Am I correct in
saying Proportion of variance describes the amount of variance with respect
to the number of components the user chooses to show? So if I only choose
one I will explain 100% of the variance? I think showing 'Total Proportion
of Variance' is important if that is the case.

Regards,

Steve Bronder
Website: stevebronder.com
Phone: 412-719-1282
Email: sbronder at stevebronder.com

On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

> Martin, I fully agree.  This becomes an issue when you have big matrices.
>
> (Note that there are awesome methods for actually only computing a small
> number of PCs (unlike your code which uses svn which gets all of them);
> these are available in various CRAN packages).
>
> Best,
> Kasper
>
> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
> maechler at stat.math.ethz.ch
> > wrote:
>
> > Following from the R-help thread of March 22 on "Memory usage in prcomp",
> >
> > I've started looking into adding an optional   'rank.'  argument
> > to prcomp  allowing to more efficiently get only a few PCs
> > instead of the full p PCs, say when p = 1000 and you know you
> > only want 5 PCs.
> >
> >  (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
> >
> > As it was mentioned, we already have an optional 'tol' argument
> > which allows *not* to choose all PCs.
> >
> > When I do that,
> > say
> >
> >      C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
> >      all.equal(S, crossprod(C))
> >      set.seed(17)
> >      X <- matrix(rnorm(32000), 1000, 32)
> >      Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
> >      all.equal(cov(Z), S, tol = 0.08)
> >      pZ <- prcomp(Z, tol = 0.1)
> >      summary(pZ) # only ~14 PCs (out of 32)
> >
> > I get for the last line, the   summary.prcomp(.) call :
> >
> > > summary(pZ) # only ~14 PCs (out of 32)
> > Importance of components:
> >                           PC1    PC2    PC3    PC4     PC5     PC6
> >  PC7     PC8
> > Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
> 0.76951
> > 0.67490
> > Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
> 0.01943
> > 0.01495
> > Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
> 0.93944
> > 0.95439
> >                            PC9    PC10    PC11    PC12    PC13   PC14
> > Standard deviation     0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
> > Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
> > Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
> > >
> >
> > which computes the *proportions* as if there were only 14 PCs in
> > total (but there were 32 originally).
> >
> > I would think that the summary should  or could in addition show
> > the usual  "proportion of variance explained"  like result which
> > does involve all 32  variances or std.dev.s ... which are
> > returned from the svd() anyway, even in the case when I use my
> > new 'rank.' argument which only returns a "few" PCs instead of
> > all.
> >
> > Would you think the current  summary() output is good enough or
> > rather misleading?
> >
> > I think I would want to see (possibly in addition) proportions
> > with respect to the full variance and not just to the variance
> > of those few components selected.
> >
> > Opinions?
> >
> > Martin Maechler
> > ETH Zurich
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]