[R-sig-Epi] How to interpret the results of PCA with sampling weights

Tue Apr 26 21:17:56 CEST 2016

Hello everyone,

I have a dataset from a household survey, with sampling weights, and I
wanted to create an assets-based indicator of economic level for the
households. The idea was to run PCA (with sampling weights), and the
first principal component would be the economic level indicator. In
this email, I am calling "surveydesignobj" the object returned by
svydesign(), and "svyprcompobj" the object returned by
svyprcomp(formula, surveydesignobj, center = TRUE, scale. = TRUE,
scores = TRUE).

What I can't understand is why the first principal component stored in
svyprcompobj$x is so different from what I get from
predict(svyprcompobj, surveydesignobject). By "different" I mean
different distributions and only moderate correlation (~ 0.5) between
them. I also tried recreating the first principal component "by hand",
by summing the (centered or not) variables after multiplying them by the
loadings (svyprcompobj$rotation) and dividing them by their scales (from
svyprcompobj$scale); the resulting vector was highly correlated (> 0.99)
with the first principal component obtained from predict().

Peaking in the svyprcomp() code I saw that the function runs PCA in the
data after multiplying it by
sqrt(samplingweights/mean(samplingweights)), and latter divides
svyprcompobj$x by sqrt(samplingweights/mean(samplingweights)) before
returning it. I also noticed that there is no predict.svyprcomp(), only
predict.prcomp().

Given that different methods provide different values, I'd like to know
if there is only one correct method (which?), or if it's a matter of
interpreting differently the results of each method.

Thanks in advance,

Leonardo Ferreira Fontenelle[1]

Links:

  1. http://lattes.cnpq.br/9234772336296638

	[[alternative HTML version deleted]]