[R-sig-ME] Questions - the stepwise selection issue.

Mon Nov 8 01:17:12 CET 2010

The stepwise model reduction issue is an interesting one.

My view is that:
1) One should always begin by looking at the t-statistics for
the coeffs in the full model (assuming that this is a situation
where they are more or less believable!).  If there is a clear
division into those that are significant and those that are
clearly not significant (p-value > 0.1, maybe), then drop
those that are not significant.  Check what difference this
makes to the residual SE, and to the coefficients (any large
changes may matter if there is an interest in interpreting 
coefficients).  There are other issues to consider; are some
variables of such scientific consequence that they should
be retained regardless?

2) Backwise stepwise selection or (better) exhaustive 
subset selection if the problem is not too large are a resort
of desperation if some variant of (1) fails to give a useful
result.  The approach (1) is most likely to fail to give a useful 
result in cases where there is quite a large number of 
explanatory variables, exactly the situation where variable 
selection bias becomes a serious issue.  The function
bestsetNoise() in the DAAG package is designed to make
it easy to experiment with variable selection effects for
data that are purely noise.  For getting realistic SEs, some
alternatives are:
  a) repeat the whole analysis, selection and all, with repeated 
  bootstrap samples;
  b) get SEs by repeated bootstrap sampling from 'test' data.
  c) simulate, with selection and all, from the fitted model.
For both a) and c) one has to deal with getting somewhat
different selections each time round (that is itself instructive).

3) If the approach (2) has been used, and there is interest
in the interpretation of coefficients, check the coefficients
against those from the full model.  Any large changes will
ring a warning bell for attempts to interpret them.

4) One possibility, following stepwise or other selection,
is that one or more p-values may be so small that they are
very unlikely to be an artefact of the selection process.
In general, a simulation will be required, in order to be
sure.

Somewhat casual approaches to the use of backward (or
other) stepwise selection may be a holdover from hand
calculator days, or times when computers grunted somewhat
to handle even modest sized calculations.  If the inclusion and
exclusion criteria are suitably chosen (but who really knows
what is suitable?), I suspect that in some contexts they do
more or less work to give believable answers, without undue
selection bias.  But how, without checks such as I have noted,
can one be sure?  This may be one of the murky dark alleys
of statistical practice, where magic incantations and hope too 
often prevail over hard evidence.

It would be useful to find a review paper that covers this 
ground systematically and incisively, without undue reliance 
on specific examples.

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On 08/11/2010, at 5:42 AM, Ben Bolker wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 10-11-07 11:14 AM, Shawn McCracken wrote:
>> Shawn McCracken <smccracken at ...> writes:
>> 
>>> 
>>> Dear Mixed-models group,
>>> 
>>> I am working with a dataset containing fixed and nested random effects. I
>>> have
>>> one fixed effect that I am most interested in and the others were collected
>>> to
>>> see if they also might have an influence. I apologize for the novel but
>>> hopefully discussion of this will help others in the future who are as
>>> intimidated as I was/am.
>>> 
>>> The data consist of total counts of anuran individuals from a particular
>>> species
>>> of epiphytic phytotelm plant found in tree canopies at two sites. The site
>>> difference (if any) is my main interest. At each site all trees with
>>> suitable
>>> #¹s of this epiphyte species for sampling were located within a
>>> predetermined
>>> size area. 16 trees were randomly selected from those available at each site
>>> and
>>> 5 epiphytes were then randomly sampled for all anurans within them. So, 2
>>> sites
>>> -> 16 trees (at each site) -> 5 epiphytes (in each tree), which equals 80
>>> samples from each site for a total of 160........................
>>> 	[[alternative HTML version deleted]]
>>> 
>>> 
>> 
>> Update: The install problem with glmmADMB has been fixed on my Mac. Thanks to
>> Dave. Details below. I could still use some feedback on what I have done so far
>> still.
>> 
> 
>  Update: I have been working on the glmmADMB package a bit.  The
> current version on R-forge installs OK on my MacOS X.6 machine. It
> contains 32-bit binaries which it automatically puts in the correct
> location, so that you shouldn't have to mess around with doing this
> stuff manually.  Dave F. has sent me compiled 64-bit OS X binaries, but
> I haven't gotten around to incorporating them yet (the 32-bit binaries
> do work on my system, although presumably the 64-bit ones would be
> faster in general).
>  So
>  install.packages("glmmADMB",repos="http://r-forge.r-project.org")
> should work on MacOS.
>  It would be helpful to get reports of trouble from list members who
> try it.
> 
>  To follow up on some of your other questions with my own opinions:
> * as I recommend on <http://glmm.wikidot.com/faq> (I have just added a
> few words to make my personal opinions clearer), I would recommend
> glmm.admb or glmer with individual-level random effects over the various
> quasi- options.
> * glmm.admb currently only works with a single random effect, so you
> can't do nested random effects that way.  You could build a more
> complete model in AD Model Builder, or revert to glmer.
> * Your model specification
> 
> m1po<-lmer(count~treat+treedbh+treehgt+numepi+elevepi+hgtepi+leafepi+
> (1|tree/epi),family=poisson,data=ecpad2)
> 
>  looks reasonable.  If you say
> ecpad2$indiv <- 1:nrow(ecpad2)
> and add +(1|indiv) to your model specification you will have an
> individual-level random effect.
> 
> * Is 'treat' your site variable?  In any case, if you are trying to do
> a statistical comparison between only two sites you have a major
> pseudo-replication problem (Hurlbert 1984).
> 
>  * The p-values that you get from summary(lmer) are Wald Z statistics,
> they assume large data sets and are possibly unreliable for
> moderate-sized data sets ...
> 
> * Opinions differ on the value of backward stepwise model reduction. It
> is standard practice in many ecological contexts and is suggested for
> moderate model complexity by many respected practicing
> (eco)statisticians (Bates, Wood, Zuur ...) but is vehemently decried by
> others (Harrell).  I would probably base inference on your full model
> rather than doing backward elimination.
> 
> 
>> The solution that worked for me:
>> 
>> I used the binaries Dave sent in admbfiles.zip over in the post in the
>> admb-users group: http://groups.google.com/group/admb-users/t/df5779586e45b9b
>> 
>> First, I copied them to my desktop and unzipped.
>> Opened Terminal and typed the following to direct it to run nbmm in the expanded
>> folder and confirm it would run:
>> 
>> ShawnMBP:$ /Users/Shawn/Desktop/admbfiles/nbmm     #of course you will need to
>> change this to navigate to where it is on your computer#
>> Error trying to open data input file /users/shawn/desktop/admbfiles/nbmm.dat
>> Error trying to read in model data
>> This is usual caused by a missing DAT file
>> 
>> Dave said the error message comes from ndmm looking for the data file to use but
>> it is running.
>> 
>> I then located where glmmADMB had originally placed these same named files when
>> I did the install of glmmADMB. I can’t remember how I found where they were and
>> spotlight won’t show them either. I think I did a search just for “R” and found
>> an R.framework folder in the Library folder. I looked through there and found
>> them in a folder called admb here:
>> 
>> MBP_SFM>Library>Frameworks>R.framework>Versions>2.11>Resources>library>glmmADMB>
>> admb
>> 
>> I then replaced the nbmm and bvprobit files that were there with the ones
>> provided by Dave.
>> Started up R, loaded glmmADMB, viewed the epil2 dataset, and then ran the
>> example model and it ran fine!
>> 
>> My system: Macbook Pro, Mac OSX 10.6.4, R 64-bit
>> 
>> Hope this helps.
>> 
>> Shawn
>> 
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAkzW8wMACgkQc5UpGjwzenN0DACeNN/OmJf0UK9hSOtTt8DmPcdB
> vHwAmwXH6qKTOwT9snxDsDgldVm6hzVO
> =4NDD
> -----END PGP SIGNATURE-----
> 
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models