[R] SEM validation: Cross-Validation vs. Bootstrapping

Thu Nov 1 19:01:36 CET 2012

Hi Paul,

So you have described bootstrapping in SEM, but that does not address
the cross-validation.  See inline.

On Thu, Nov 1, 2012 at 10:24 AM, Paul Miller <pjmiller_57 at yahoo.com> wrote:
> Hello All,
>
> Recently, I was asked to help out with an SEM cross-validation analysis. Initially, the project was based on "sample-splitting" where half of cases were randomly assigned to a training sample and half to a testing sample. Attempts to replicate a model developed in the training sample using the testing sample were not entirely successful. A number of parameter estimates were substantially different and were subsequently shown to be significantly different in multiple group analyses using cross-group constraints and a difference in chi-square test.
>
> There is a discussion that starts on page 90 in Frank Harrell's book Regression Modeling Strategies that seems to shed light on why this might be the case. In essence, the results are largely a matter of the luck of the draw. Choose one random seed in splitting the sample and the results cross-validate. Choose another and they might not.

Yes and no.  With a large enough sample, the results will be
stable---the problem is that for 19 variables, the sample size
required to get a stable variance-covariance matrix becomes quite
large.  Ironically, as each sample tends to infinity, you will be able
to statistically realiably detect infinitesimally small differences.
I do not know the N you are working with, but my usual suggestion for
cross-validation with SEM is to fit the training model on the majority
(not 50/50) of your data say 2/3.  Then cross validate on the
remaining 1/3 by constraining parameter estimates to be identical.
For a small sample, the chi square test is probably reasonable, if
even 1/3 is large (say 1000) I would tend to ignore the chi square
test and focus on the fit indices (CFI, TLI, SRMR, RMSEA, etc.).

> The book then goes on to suggest some improvements on data splitting. The most promising of these appears to be bootstrapping. In the book, this typically involves fitting, say, a regression model in one’s entire dataset, fitting the model in a series of bootstrap datasets, and then applying the results of each bootstrap model to the original data, in order to derive a measure of optimism in something like R2 or MSE.

I would highly recommend Hastie, Tibshirani, & Friedman's book:
http://www-stat.stanford.edu/~tibs/ElemStatLearn/

they have considerable discussion of how to validate data driven
results (various forms of cross-validation, etc.).

>
> Our SEM would likely require something slightly different. That is, we would need to develop a model based on the entire sample, run the sample model on a series of bootstrap datasets, obtain the average (as well as the SD and 95% CI) for each of the model parametersrs across the bootstrap samples, and then compare that with what we got running the model on the original sample. Some of my other books show something like this for regression (e.g., An R Companion to Applied Regression, page 187; The R Book, page 418).

For a sufficiently large number of bootstraps, the average goes to the
sample estimates.

>
> So now having provided quite a bit of background, let me ask a few questions:
>
> 1. Is there any general agreement that the approach I've suggested is the way to go? Are there others besides Dr. Harrell that I could cite in pursuing this approach?

Although I generally like bootstrapping for testing mediated effects,
if you did model modification based on the data that you want to
validate, I do not think it will help.  Again see the Elements of
Statistical Learning Text for a good reference.

>
> 2. Does anyone know of some substantial published applications of this approach using SEM?

Bootstrapping?  Plenty.  Cross-validation?  Much fewer.

>
> 3. Would any of the available R packages for SEM (e.g., lavaan, sem, OpenMx) be particulary straightforward to use in doing the bootstrapping? Thus far, the SEM has been done using MPLUS. I've not tried SEM in R yet, but would be interested in giving it a shot. The SEM itself is relatively straightforward. Four latent variables, one with 7 indicators and the others with 4 indicators each. A couple of indirect paths involving mediation. Some pretty non-normal data though.  Lots of missingness too that might need to be dealt with using Multiple Imputation.

Bootstrapping is pretty trivial in any of those packages.  They all
take data from R, so all you have to do is sample the data with
replacement, pass the bootstrapped data to the functions, and store
the results.  There's a bit more work to combine the results, get CIs,
etc. but not much.

If you build multiple imputation into this, your life may become quite
painful.  You will then need to combine the results from multiply
imputed datasets as well as from the bootstrap.  I know of no canned
tools for this so you'll be doing a lot on your own.  I would suggest
instead that you use maximum likelihood estimation where the
likelihood function allows missingness (sometimes called full
information maximum likelihood).  Both the lavaan and OpenMx package
can handle this.  If you have additional variables you think may
predict missingness, you should condition on those by adding them into
the model as auxillary variables.  Complicates the assessment of fit
somewhat as they contribute but you do not care about their
contribution so you have to parcel it out, but it is a good way to
deal with missingness.  Craig Enders has a relatively nice book on
missing data approaches that includes SEM.

If you stay in Mplus, it can natively bootstrap most things, if it
cannot I have written R functions to help it out.  I have been writing
a mediation tutorial that is available here:
http://joshuawiley.com/statistics/medanal.html

You can find the source code for that page here:
https://github.com/jwiley as well as the source for the semutils
package which contains many of the interface functions to link R to
Mplus.  I should point out that semutils is not on CRAN and will not
be, as I have teamed up with Michael Hallquist who wrote the
MplusAutomation package and will be folding the additional
functionality of semutils into MplusAutomation.  However, that may
take some time, so in the meantime, source is available from github.

Cheers,

Josh

>
> Thanks,
>
> Paul
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/