[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'
peter dalgaard
pdalgd at gmail.com
Fri Mar 25 09:41:00 CET 2016
As I see it, the display showing the first p << n PCs adding up to 100% of the variance is plainly wrong.
I suspect it comes about via a mental short-circuit: If we try to control p using a tolerance, then that amounts to saying that the remaining PCs are effectively zero-variance, but that is (usually) not the intention at all.
The common case is that the remainder terms have a roughly _constant_, small-ish variance and are interpreted as noise. Of course the magnitude of the noise is important information.
-pd
> On 25 Mar 2016, at 00:02 , Steve Bronder <sbronder at stevebronder.com> wrote:
>
> I agree with Kasper, this is a 'big' issue. Does your method of taking only
> n PCs reduce the load on memory?
>
> The new addition to the summary looks like a good idea, but Proportion of
> Variance as you describe it may be confusing to new users. Am I correct in
> saying Proportion of variance describes the amount of variance with respect
> to the number of components the user chooses to show? So if I only choose
> one I will explain 100% of the variance? I think showing 'Total Proportion
> of Variance' is important if that is the case.
>
>
> Regards,
>
> Steve Bronder
> Website: stevebronder.com
> Phone: 412-719-1282
> Email: sbronder at stevebronder.com
>
>
> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com> wrote:
>
>> Martin, I fully agree. This becomes an issue when you have big matrices.
>>
>> (Note that there are awesome methods for actually only computing a small
>> number of PCs (unlike your code which uses svn which gets all of them);
>> these are available in various CRAN packages).
>>
>> Best,
>> Kasper
>>
>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>> maechler at stat.math.ethz.ch
>>> wrote:
>>
>>> Following from the R-help thread of March 22 on "Memory usage in prcomp",
>>>
>>> I've started looking into adding an optional 'rank.' argument
>>> to prcomp allowing to more efficiently get only a few PCs
>>> instead of the full p PCs, say when p = 1000 and you know you
>>> only want 5 PCs.
>>>
>>> (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
>>>
>>> As it was mentioned, we already have an optional 'tol' argument
>>> which allows *not* to choose all PCs.
>>>
>>> When I do that,
>>> say
>>>
>>> C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
>>> all.equal(S, crossprod(C))
>>> set.seed(17)
>>> X <- matrix(rnorm(32000), 1000, 32)
>>> Z <- X %*% C ## ==> cov(Z) ~= C'C = S
>>> all.equal(cov(Z), S, tol = 0.08)
>>> pZ <- prcomp(Z, tol = 0.1)
>>> summary(pZ) # only ~14 PCs (out of 32)
>>>
>>> I get for the last line, the summary.prcomp(.) call :
>>>
>>>> summary(pZ) # only ~14 PCs (out of 32)
>>> Importance of components:
>>> PC1 PC2 PC3 PC4 PC5 PC6
>>> PC7 PC8
>>> Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922
>> 0.76951
>>> 0.67490
>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713
>> 0.01943
>>> 0.01495
>>> Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001
>> 0.93944
>>> 0.95439
>>> PC9 PC10 PC11 PC12 PC13 PC14
>>> Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
>>> Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000
>>>>
>>>
>>> which computes the *proportions* as if there were only 14 PCs in
>>> total (but there were 32 originally).
>>>
>>> I would think that the summary should or could in addition show
>>> the usual "proportion of variance explained" like result which
>>> does involve all 32 variances or std.dev.s ... which are
>>> returned from the svd() anyway, even in the case when I use my
>>> new 'rank.' argument which only returns a "few" PCs instead of
>>> all.
>>>
>>> Would you think the current summary() output is good enough or
>>> rather misleading?
>>>
>>> I think I would want to see (possibly in addition) proportions
>>> with respect to the full variance and not just to the variance
>>> of those few components selected.
>>>
>>> Opinions?
>>>
>>> Martin Maechler
>>> ETH Zurich
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel
mailing list