[R-sig-ME] spaMM::fitme() - a glmm for longitudinal data that accounts for spatial autocorrelation

Thu Jul 16 10:19:07 CEST 2020

Le 15/07/2020 à 16:48, Sarah Chisholm a écrit :
> Thanks Francois. I hadn't considered that the number of unique 
> locations could be the source of the problem, rather than the size of 
> the entire data set. It is a possibility for me to simply remove 
> observations for a number of locations to bring the total sample size 
> (of unique coordinates) down. I'll also test a lattice model using the 
> IMRF() notation to describe the random spatial effect - I believe this 
> is what you referred to in your previous email?

yes, use the IMRF formula term for this purpose.

F.
>
> Sarah
>
> On Wed, Jul 15, 2020 at 10:01 AM Francois Rousset 
> <francois.rousset using umontpellier.fr 
> <mailto:francois.rousset using umontpellier.fr>> wrote:
>
>     Dear Thierry,
>
>     thanks. So (expectedly) this is a different issue. spaMM can fit
>     some correlation models described by objects produced by
>     INLA::inla.spde2.matern() and then, in my past experiments, the
>     computation times were close to those of INLA, and the memory
>     requirements were much smaller than what I wrote previously where
>     this is not what I meant by "Matern".
>
>     Beyond general features that contribute to these computational
>     differences (the use of sparse matrix methods, and to a lesser
>     extent the constraint on the smoothness parameter of the
>     approximated Matern model), the 'cutoff' argument in your call to
>     inla.mesh.2d() appears important to reduce the number  of
>     locations actually considered, in the most costly computations,
>     below the number of locations in the data (to 8804 rather than
>     30K, if I get it right), and this would also allow a faster fit by
>     spaMM when called on the resulting inla.spde2 object.
>
>     Best,
>
>     F.
>
>     Le 15/07/2020 à 12:50, Thierry Onkelinx a écrit :
>>     Dear François,
>>
>>     Here you go:
>>     https://drive.google.com/drive/folders/1Ocq88Yq9u_lM-loayRQlMyBS2HLy_Tio
>>     Almost 30K locations. Fit in little over 7 min on my laptop with
>>     16 GB RAM.
>>
>>     Best regards,
>>
>>     ir. Thierry Onkelinx
>>     Statisticus / Statistician
>>
>>     Vlaamse Overheid / Government of Flanders
>>     INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
>>     NATURE AND FOREST
>>     Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
>>     Assurance
>>     thierry.onkelinx using inbo.be <mailto:thierry.onkelinx using inbo.be>
>>     Havenlaan 88 bus 73, 1000 Brussel
>>     www.inbo.be <http://www.inbo.be>
>>
>>     ///////////////////////////////////////////////////////////////////////////////////////////
>>     To call in the statistician after the experiment is done may be
>>     no more than asking him to perform a post-mortem examination: he
>>     may be able to say what the experiment died of. ~ Sir Ronald
>>     Aylmer Fisher
>>     The plural of anecdote is not data. ~ Roger Brinner
>>     The combination of some data and an aching desire for an answer
>>     does not ensure that a reasonable answer can be extracted from a
>>     given body of data. ~ John Tukey
>>     ///////////////////////////////////////////////////////////////////////////////////////////
>>
>>     <https://www.inbo.be>
>>
>>
>>     Op wo 15 jul. 2020 om 00:10 schreef Francois Rousset
>>     <francois.rousset using umontpellier.fr
>>     <mailto:francois.rousset using umontpellier.fr>>:
>>
>>         Dear Thierry,
>>
>>         please provide a reproducible example so that we know what
>>         you have actually done.
>>
>>         Best,
>>
>>         F.
>>
>>         Le 14/07/2020 à 20:00, Thierry Onkelinx a écrit :
>>>         Dear François and Sarah,
>>>
>>>         INLA seems more efficient. I ran a model with Mattern
>>>         correlation structure on 13K locations (1 observation per
>>>         location) in under 10 minutes on a laptop with 16GB RAM.
>>>
>>>         Best regards,
>>>
>>>         ir. Thierry Onkelinx
>>>         Statisticus / Statistician
>>>
>>>         Vlaamse Overheid / Government of Flanders
>>>         INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE
>>>         FOR NATURE AND FOREST
>>>         Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
>>>         Assurance
>>>         thierry.onkelinx using inbo.be <mailto:thierry.onkelinx using inbo.be>
>>>         Havenlaan 88 bus 73, 1000 Brussel
>>>         www.inbo.be <http://www.inbo.be>
>>>
>>>         ///////////////////////////////////////////////////////////////////////////////////////////
>>>         To call in the statistician after the experiment is done may
>>>         be no more than asking him to perform a post-mortem
>>>         examination: he may be able to say what the experiment died
>>>         of. ~ Sir Ronald Aylmer Fisher
>>>         The plural of anecdote is not data. ~ Roger Brinner
>>>         The combination of some data and an aching desire for an
>>>         answer does not ensure that a reasonable answer can be
>>>         extracted from a given body of data. ~ John Tukey
>>>         ///////////////////////////////////////////////////////////////////////////////////////////
>>>
>>>         <https://www.inbo.be>
>>>
>>>
>>>         Op di 14 jul. 2020 om 18:22 schreef Francois Rousset
>>>         <francois.rousset using umontpellier.fr
>>>         <mailto:francois.rousset using umontpellier.fr>>:
>>>
>>>             Dear Sarah,
>>>
>>>             Le 14/07/2020 à 16:55, Sarah Chisholm a écrit :
>>>             > Hi Mollie, thank you for your suggestion. glmmTMB
>>>             seems like a good
>>>             > option for my needs as well. In your sample code
>>>             above, can you
>>>             > explain what the term 'group' does in
>>>             matern(pos+0|group)? Does this
>>>             > allow the spatial correlation structure to be applied
>>>             to specific
>>>             > groupings in the data (in my case, for example, by
>>>             'continent')?
>>>             >
>>>             > Francois, thank you for this very clear answer. This
>>>             is a very
>>>             > convenient feature of the function! May I ask you a
>>>             couple of other
>>>             > questions about some issues that I've had with
>>>             spaMM::fitme()?
>>>             >
>>>             > In particular, when I try fitting this model to a
>>>             large data set (~14
>>>             > 000 rows x 7 columns, ~2 MB), the model will run for
>>>             an extended
>>>             > period of time, to the point where I've had to
>>>             terminate the
>>>             > computation. I've tried applying the suggestions that
>>>             are mentioned in
>>>             > the user guide, i.e. setting init=list(lambda=0.1)
>>>             > and init=list(lambda=NaN). Implementing
>>>             init=list(lambda=0.1) returned
>>>             > an error suggesting that there was a lack of memory,
>>>             while running the
>>>             > model with init=list(lambda=NaN) also ran for an
>>>             extended period of
>>>             > time without completing. Is there something else I can
>>>             do to speed up
>>>             > the fit of these models?
>>>             >
>>>             > I've had a similar problem with an even larger data
>>>             set (~185 000 rows
>>>             > x 8 columns, ~21 MB), where, when I try running the
>>>             model, this error
>>>             > is returned immediately:
>>>             >
>>>             > ErrorinZA %*%xmatrix :Cholmoderror 'problem too
>>>             large'at file
>>>             > ../Core/cholmod_dense.c,line 105
>>>             >
>>>             > I've tried running this model on two devices, both
>>>             with a 64-bit OS
>>>             > with Windows 10, one with 32 GB of RAM and the other
>>>             with 64 GB. I've
>>>             > gotten the same error from both devices. Is there a
>>>             way that fitme()
>>>             > can accommodate these large data sets?
>>>
>>>             spaMM can handle large data sets, but the first issue to
>>>             consider here
>>>             is the number of distinct locations for the spatial
>>>             random effect. The
>>>             large correlation matrices of geostatistical models will
>>>             always be a
>>>             problem, both in terms of memory requirements and of
>>>             potentially huge
>>>             computation times. My guess from past experiments is
>>>             that one should
>>>             still be able to fit models with ~ 10K locations within
>>>             a few days on a
>>>             computer with <60 Gb of RAM (given perhaps some
>>>             tinkering of the
>>>             arguments), so at least the data set of 14 000 rows
>>>             should be feasible,
>>>             particularly if the number of locations is smaller.
>>>
>>>             Anyone planning to analyze large spatial data sets
>>>             should anticipate
>>>             these problems and check by themselves whether there is
>>>             any practical
>>>             alternative suitable for their particular problem. The
>>>             discussion in
>>>             section 6.2 of the "gentle introduction" to spaMM may
>>>             then be useful.
>>>
>>>             Best,
>>>
>>>             F.
>>>
>>>             >
>>>             > Thank you,
>>>             >
>>>             > Sarah
>>>
>>>                     [[alternative HTML version deleted]]
>>>
>>>             _______________________________________________
>>>             R-sig-mixed-models using r-project.org
>>>             <mailto:R-sig-mixed-models using r-project.org> mailing list
>>>             https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>
>
> -- 
> Sarah Chisholm
> MSc Candidate
> Department of Biology
> University of Ottawa
> Linkedin <http://www.linkedin.com/in/sarah-chisholm-422a5785>