[R-sig-eco] SAMs parameter selection

Mon Jan 25 02:55:06 CET 2016

Dear Marika,

I'm really glad that more people are starting to use SAMs, and other model-based methods.

Here are some answers to your questions (as best I can).

The model-selection procedure outlined in Leaper et al. (2014) is just a variant of the much-loved, much-hated, and often-used backwards elimination 
method.  The variation is that the number of species-archetypes is selected first (so that the number of potential models is not unmanageably huge).

I presume that you get two different BIC values for two different instances of maximising the likelihood?  That is, from two different calls to 
SpeciesMix()?  This can occur, and often does, as the process of maximising the (log-)likelihood can get stuck in local maxima -- this is a 'feature' 
of any type of model that has random/latent factors/variables but seems to be quite acute in mixture models.  The remedy for this, as outlined in the 
estimation section of Dunstan et al (2011) and buried in the application section of Dunstan et al (2013), is to perform multiple starts -- the more 
the merrier.  The model that you should use for comparisons is the one that finds the global maxima (the highest likelihood -- lowest BIC -- that you 
observe).  Performing multiple starts will increase computation time, but it will also reduce the possibility of making inference from a sub-optimal 
model.  You should perform multiple starts.

I would also recommend, as you allude to, that you have a look at the values of tau -- the (posterior) probability of each species belong to each 
archetype group -- and pi the probability of any new species (with no data) belonging to each group.  This can help remove a certain type of 
'miss-fit', which is likely to be a singularity in the (log-)likelihood surface.

A colleague has been busy trying to make model selection more robust.  In particular, he has looked at alternatives to using BIC in SAMs and related 
models (Hui et al 2015a), which aims to remove one of the question marks in SAMs by getting a decent criterion for choosing between models.  He has 
also looked at automated methods based on regularisation shrinkage (Hui et al 2015b).  Both are great additions to the arsenal for SAMs.  However, 
both are not (yet?) incorporated in the SpeciesMix R-package.

Lastly (and I hope that this is not just in my opinion), model selection for any analysis is difficult irrespective of how complex the modelling 
framework is.  Automating the process can make the process seem objective, but in truth there will always be assumptions made and personal preferences 
will come through.  To my mind, the modelling process is enhanced by context-specific information that is only available from experts (generally the 
people that obtained the data).  Such things as "polychaete assemblages are highly likely to vary with sediment size" are just the beginning...  
However, formalising this process is difficult and it is even more difficult to convince editors/reviewers/readers that you have done an excellent job 
without resorting to well established algorithms for model selection.

I hope that this answers your questions.  Let me know if you have any more.

Cheers,

Scott (SpeciesMix contributor)

Dunstan, Foster and Darnell (2011) Model based grouping of species across environmental gradients.  Ecological Modelling. 222: 955-963. DOI: 
10.1016/j.ecolmodel.2010.11.030
Dunstan, Foster, Hui and Warton (2013) Finite Mixture of Regression Modeling for High-Dimensional Count and Biomass Data in Ecology. Journal of 
Agricultural, Biological and Environmental Statistics. 18: 357-375. DOI: 10.1007/s13253-013-0146-x
Hui, Warton and Foster (2015a) Order selection in finite mixture models: complete or observed likelihood information criteria? Biometrika. 102: 
724-730. DOI: 10.1093/biomet/asv027
Hui Warton and Foster (2015b) MULTI-SPECIES DISTRIBUTION MODELING USING PENALIZED MIXTURE OF REGRESSIONS.  Annals of Applied Statistics.  9: 866-882. 
DOI: 10.1214/15-AOAS813
Leaper, Dunstan, Foster, Barrett and Edgar (2014) Do communities exist? Complex patterns of overlapping marine species distributions. Ecology. 95: 
2016-2025. DOI: 10.1890/13-0789.1

On 22/01/16 22:34, Marika Galanidi wrote:
> Dear all,
>
> I have been using Species Archetype Models (Dunstan et al 2011) to model
> the distribution of benthic polychaete assemblages (presence/absence data).
> I perform model selection as described in Leaper et al. (2014). However,
> when using a particular sub-set of predictor variables, SpeciesMix returns
> two different BIC values accompanied by different model parameters for the
> exact same model (same predictor variables) at various stages in the
> parameter selection process. Looking at the pi and tau values and the SEs
> of the coefficients, one can draw certain conclusions but is there a more
> rigorous way to proceed with model selection in this case?
>
> Many thanks
>
>
>
>
> Marika Galanidi
> Post-Doctoral Researcher
> Institute of Marine Science and Technology
> Dokuz Eylul University, Izmir
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>

-- 
Scott Foster
CSIRO
E scott.foster at csiro.au T +61 3 6232 5178
Postal address: CSIRO Marine Laboratories, GPO Box 1538, Hobart TAS 7001
Street Address: CSIRO, Castray Esplanade, Hobart Tas 7001, Australia
www.csiro.au