[R-sig-Geo] Seeking guidance for best application of spdep localG_perm

Mon Jul 8 15:19:20 CEST 2024

Thank you both for the detailed response!

-----Original Message-----
From: Roger Bivand <Roger.Bivand using nhh.no> 
Sent: Thursday, July 4, 2024 4:47 AM
To: Josiah Parry <josiah.parry using gmail.com>; Cunningham, Angela <cunninghamar using ornl.gov>
Cc: r-sig-geo using r-project.org
Subject: [EXTERNAL] Re: [R-sig-Geo] Seeking guidance for best application of spdep localG_perm

Thanks!

The issue Josiah referred to for pysal/esda is: https://urldefense.us/v2/url?u=https-3A__github.com_pysal_esda_issues_199&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=GBz82eZqXipHTaEw9bN2WePQMAUNngQja79J8TpM6Zg&e= ; as you can see, things take time. However, the problem was only found after Bivand & Wong (2018 https://urldefense.us/v2/url?u=https-3A__openaccess.nhh.no_nhh-2Dxmlui_handle_11250_2565494-3Fshow-3Dfull&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=fUMiB_TLGBbPQej0kl5hnYWPdT_tTKTDVOU1LCNIh_8&e= , https://urldefense.us/v2/url?u=https-3A__doi.org_10.1007_s11749-2D018-2D0599-2Dx&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=Xiij52sdL-p0PIYtsyrwPi0fvf9WF1FbrUCO_gy7REI&e= ). So yes, for non-bell-shaped variables and using conditional permutations,  "Pr(z != E(Gi)) Sim" should be preferred to "Pr(folded) Sim" and "Pr(z != E(Gi))". "Pr(z != E(Gi)) Sim" comes from punif() on the unit interval [0, 1] and for the number of conditional simulations and alternative (default "two.sided"). If the variable is bell-shaped, probably conditional permutation is unnecessary, as the analytical values will be very close to "Pr(z != E(Gi))" for the same alternative.

The general direction of https://urldefense.us/v2/url?u=https-3A__r-2Dspatial.org_book_15-2DMeasures.html&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=8MKh5TE-5Nk0D90GJWRUkUTXSNpsm9_3ycu16QCyGgg&e=  and  https://urldefense.us/v2/url?u=https-3A__doi.org_10.1111_gean.12319&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=e56_wjjtfeJh7yGpjkiGmW40ZNe_vGWI2sxQDVmFNMQ&e=  may be summarised as:

1) ESDA is ESDA, p-values are just a measure in no sense implying anything inferential - use of the term "interesting" as proposed in Anselin (2019, https://urldefense.us/v2/url?u=https-3A__doi.org_10.1111_gean.12164&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=dhfnBDU9CHrVVSv5OgbKzzosF5I_9r9nqbBI3PkNpgY&e= ) is judicious. In all cases, adjustment for multiple comparisons is judicious, to avoid everything being seen as interesting. FDR-adjustment is a reasonable compromise.

2) ESDA is also prone to mix up global and local spatial dependence, and local spatial dependence with local spatial heterogeneity; this makes any statement about results conditional on outcomes conditional on their not neing affected by those omissions.

3) ESDA should arguably be carried out on residuals of fitted models, because covariates and/or a modelled global spatial process could account for global dependence and heterogeneity.

All one gets from increasing the number of simulations is more digits in the p-value; the value as a measure not  an inferential test will be more precise. This is not p-hacking, it just increases the number of digits. With 99 draws, and adding the observed Gi, if it scores rank 1, its one-sided "greater" pseudo-p will be 0.01. If we go to 999 draws, and obseved Gi still has rank 1, we get to 0.001, which is a more precise estimate of the same value, observed Gi has not changed.

> spdep:::probs_lut(nsim=99, alternative="greater")[1]
[1] 0.01
> spdep:::probs_lut(nsim=999, alternative="greater")[1]
[1] 0.001

As permutations are random, results will vary between successive uses of the same function and arguments unless the RNG seed is re-set to the same value, so yielding the same stream(s) of random numbers.

Probably this should be written up properly, a broader community effort to provide guidance would be most welcome!

Please do comment if you feel that the above is too categorical!

Thanks for an interesting set of questions!

Roger

--
Roger Bivand
Emeritus Professor
Norwegian School of Economics
Postboks 3490 Ytre Sandviken, 5045 Bergen, Norway Roger.Bivand using nhh.no

________________________________________
From: R-sig-Geo <r-sig-geo-bounces using r-project.org> on behalf of Josiah Parry <josiah.parry using gmail.com>
Sent: 03 July 2024 13:34
To: Cunningham, Angela
Cc: r-sig-geo using r-project.org
Subject: Re: [R-sig-Geo] Seeking guidance for best application of spdep localG_perm

This is all very well said! I would recommend using the percentile base approach that Roger implemented. The Pysal folks are in the process of adopting it (with a slight adjustment). I think it is the most "accurate"
p-value you will get from the functions today.

I don't have a recommendation for the upper bound. But you do bring up a good point about the classification of them. I don't think I'm as qualified to answer that !

On Wed, Jul 3, 2024 at 06:25 Cunningham, Angela via R-sig-Geo < r-sig-geo using r-project.org> wrote:

> Hello all,
>
> I am using spdep (via sfdep) for a cluster analysis of the rate of 
> rare events.  I am hoping you can provide some advice on how to apply 
> these functions most appropriately. Specifically I am interested in 
> any guidance about which significance calculation might be best in 
> these circumstances, and which (if any) adjustment for multiple 
> testing and spatial dependence (Bonferroni, FDR, etc) should be paired 
> with the different p value calculations.
>
> When running localG_perm(), three Pr values are returned: Pr(z != 
> E(Gi)), Pr(z != E(Gi)) Sim, and Pr(folded) Sim. My understanding is 
> that the first value is based on the mean and should only be used for 
> normally distributed data, that the second uses a rank-percentile 
> approach and is more robust, and that the last uses a Pysal-based 
> calculation and may be quite sensitive. Is this correct? The second, 
> Pr(z != E(Gi)) Sim, appears to be the most appropriate for my data situation; would you suggest otherwise?
>
> The documentation for localG_perm states that "for inference, a 
> Bonferroni-type test is suggested"; thus any adjustments for e.g. 
> multiple testing must be made in a second step, such as with the 
> p.adjust arguments in the hotspot() function, correct? Further, while 
> fdr is the default for hotspot(), are there situations like having 
> small numbers, a large number of simulations, or employing a 
> particular Prname which would recommend a different p.adjust method?
>
> Also, if I can bother you all with a very basic question: given that 
> significance is determined through conditional permutation simulation, 
> increasing the number of simulations should refine the results and 
> make them more reliable, but unless a seed is set, I assume that is 
> still always possible that results will change slightly across 
> separate runs of a model, perhaps shifting an observation to either 
> side of a threshold. Aside from computation time, are there other 
> reasons to avoid increasing the number of simulations beyond a certain 
> point? (It feels a bit like "p-hacking" to increase the nsim ad 
> infinitum.) Are slight discrepancies in hot spot assignment between 
> runs even with a large number of permutations to be expected? Is this particularly the case when working with small numbers?
>
> Thank you for your time and consideration.
>
>
> Angela R Cunningham, PhD
> Spatial Demographer (R&D Associate)
> Human Geography Group | Human Dynamics Section
>
> Oak Ridge National Laboratory
> Computational Sciences Building (5600), O401-29
> 1 Bethel Valley Road, Oak Ridge, TN 37830 
> <https://urldefense.us/v2/url?u=https-3A__www.google.com_maps_search_1
> -2BBethel-2BValley-2BRoad-2C-2BOak-2BRidge-2C-2BTN-2B37830-3Fentry-3Dg
> mail-26source-3Dg&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4Vi
> Yc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJq
> K2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=pJxZTWz6wrBk5v9EX6a1s3
> MJQGGneuohxVeODjNXLpI&e= > cunninghamar using ornl.gov
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo using r-project.org
> https://urldefense.us/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo
> _r-2Dsig-2Dgeo&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&
> r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2p
> YhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=oUK1rdFdQTVyJXHijGGKdGP4b
> jnpuis1WVh_LaUkJts&e=
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo using r-project.org
https://urldefense.us/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dsig-2Dgeo&d=DwIF-g&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=OeZrzGaaDjUKThQHRN8lrBOoL77yez7G7QZba8_FreI&m=RlJo8VbsF-E1UzAhBJqK2pYhsiH5n9kWn_SrxTGLVnj91l8N4LJSvDG0md5P11WD&s=oUK1rdFdQTVyJXHijGGKdGP4bjnpuis1WVh_LaUkJts&e=