[R-sig-ME] How to supply random intercepts for new data, blme

Tue Dec 12 07:45:25 CET 2023

Hello folks,

I have a mixed model that predicts the presence or absence of a species,
given a set of predictors describing habitat. It is based on tracking data
from tagged individual animals, together with pseudoabsences sampled from
the surrounding area. The model assumes all individuals have the same
habitat preferences (i.e. the slope coefficients are represented as fixed
effects). By specifying random intercepts per individual, the model
accommodates the fact that some individuals were tracked for much longer
than others, so that the number of presences relative to pseudoabsences is
much higher for some individuals.

As part of a cross-validation exercise to assess model accuracy, I would
like to make predictions on new data, involving different individuals from
other geographic areas. As I already know how long these individuals were
tracked for, I would like to supply that information (in the form of fitted
random intercepts from a 'global' model fitted to all individuals) to the
model for that particular cross validation fold. Otherwise, it seems like
the estimate of model accuracy will be unnecessarily pessimistic, since the
predictions for each new individual won't take into account whether they
were extensively tracked or not.

I am using an implementation of binary random forests with mixed effects
due to Speiser and colleagues (see refs below). The random effects part of
the model is produced using blme.

My questions are:
1) Is this a bad idea? It involves taking intercepts from one model and
inserting them into another.
2) If it is a bad idea, is there an alternative?
3) If it is a good idea, how is it best executed using a model derived from
blme? I could find all the parts of the model object containing
individual-specific terms and replace them (so far I've counted eight
vectors in the model object that contain either the individual IDs, or
values fitted to those IDs) , but is there a more elegant solution?

A different approach could be to draw pseudoabsences for each individual
equal to the number of presences recorded for that animal. That would mean
that the ratio of presences to absences was fixed at 0.5 for all
individuals. I avoided this in the first instance, because it would mean
fitting models with few points for some individuals - as few as 5 or 10
presences so n= presences + absence = 10–20.

Many thanks for any help you can offer, Fiona

References
Speiser, J. L., et al. (2019). "BiMM forest: A random forest method for
modeling clustered and longitudinal binary outcomes." Chemometrics and
Intelligent Laboratory Systems *185*: 122-134)

Speiser, J. L. (2021). "A random forest method with feature selection for
developing medical prediction models with clustered and longitudinal
data." Journal
of Biomedical Informatics *117*: 103763.

*Fiona Scarff*
*Murdoch University*

	[[alternative HTML version deleted]]