[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

Fri Mar 25 10:08:38 CET 2016

> On 25 Mar 2016, at 10:41 am, peter dalgaard <pdalgd at gmail.com> wrote:
> 
> As I see it, the display showing the first p << n PCs adding up to 100% of the variance is plainly wrong. 
> 
> I suspect it comes about via a mental short-circuit: If we try to control p using a tolerance, then that amounts to saying that the remaining PCs are effectively zero-variance, but that is (usually) not the intention at all. 
> 
> The common case is that the remainder terms have a roughly _constant_, small-ish variance and are interpreted as noise. Of course the magnitude of the noise is important information.  
> 
But then you should use Factor Analysis which has that concept of “noise” (unlike PCA).

Cheers, Jari Oksanen

>> On 25 Mar 2016, at 00:02 , Steve Bronder <sbronder at stevebronder.com> wrote:
>> 
>> I agree with Kasper, this is a 'big' issue. Does your method of taking only
>> n PCs reduce the load on memory?
>> 
>> The new addition to the summary looks like a good idea, but Proportion of
>> Variance as you describe it may be confusing to new users. Am I correct in
>> saying Proportion of variance describes the amount of variance with respect
>> to the number of components the user chooses to show? So if I only choose
>> one I will explain 100% of the variance? I think showing 'Total Proportion
>> of Variance' is important if that is the case.
>> 
>> 
>> Regards,
>> 
>> Steve Bronder
>> Website: stevebronder.com
>> Phone: 412-719-1282
>> Email: sbronder at stevebronder.com
>> 
>> 
>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>> 
>>> Martin, I fully agree.  This becomes an issue when you have big matrices.
>>> 
>>> (Note that there are awesome methods for actually only computing a small
>>> number of PCs (unlike your code which uses svn which gets all of them);
>>> these are available in various CRAN packages).
>>> 
>>> Best,
>>> Kasper
>>> 
>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>> maechler at stat.math.ethz.ch
>>>> wrote:
>>> 
>>>> Following from the R-help thread of March 22 on "Memory usage in prcomp",
>>>> 
>>>> I've started looking into adding an optional   'rank.'  argument
>>>> to prcomp  allowing to more efficiently get only a few PCs
>>>> instead of the full p PCs, say when p = 1000 and you know you
>>>> only want 5 PCs.
>>>> 
>>>> (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
>>>> 
>>>> As it was mentioned, we already have an optional 'tol' argument
>>>> which allows *not* to choose all PCs.
>>>> 
>>>> When I do that,
>>>> say
>>>> 
>>>>    C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>>>>    all.equal(S, crossprod(C))
>>>>    set.seed(17)
>>>>    X <- matrix(rnorm(32000), 1000, 32)
>>>>    Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
>>>>    all.equal(cov(Z), S, tol = 0.08)
>>>>    pZ <- prcomp(Z, tol = 0.1)
>>>>    summary(pZ) # only ~14 PCs (out of 32)
>>>> 
>>>> I get for the last line, the   summary.prcomp(.) call :
>>>> 
>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>> Importance of components:
>>>>                         PC1    PC2    PC3    PC4     PC5     PC6
>>>> PC7     PC8
>>>> Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
>>> 0.76951
>>>> 0.67490
>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
>>> 0.01943
>>>> 0.01495
>>>> Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
>>> 0.93944
>>>> 0.95439
>>>>                          PC9    PC10    PC11    PC12    PC13   PC14
>>>> Standard deviation     0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
>>>> Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>>>>> 
>>>> 
>>>> which computes the *proportions* as if there were only 14 PCs in
>>>> total (but there were 32 originally).
>>>> 
>>>> I would think that the summary should  or could in addition show
>>>> the usual  "proportion of variance explained"  like result which
>>>> does involve all 32  variances or std.dev.s ... which are
>>>> returned from the svd() anyway, even in the case when I use my
>>>> new 'rank.' argument which only returns a "few" PCs instead of
>>>> all.
>>>> 
>>>> Would you think the current  summary() output is good enough or
>>>> rather misleading?
>>>> 
>>>> I think I would want to see (possibly in addition) proportions
>>>> with respect to the full variance and not just to the variance
>>>> of those few components selected.
>>>> 
>>>> Opinions?
>>>> 
>>>> Martin Maechler
>>>> ETH Zurich
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>> 
>>>       [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel