[R-sig-Geo] Re: DCluster

Wed Apr 13 19:08:55 CEST 2005

Hi,

 This message is a reply to a mail sent by Jakob Petersen on DCluster
two weeks ago. I am very sorry for the delay.

> I am applying DCluster (ver. 0.1-3, windows) and would be grateful for 
> advice on a few topics.

> 1. GAM
> 
> A) As I understand it the default setting is based on a poisson 
> distribution. This creates some not implausible clusters, but I wonder 
> whether I could set the opgam, so that it uses a negative binomial 
> distribution (for which I have the parameters for the ‘disease’ 
> variable; size and mu) or to use a bootstrap procedure instead. Some of 
> the internal functions, like opgam.iscluster.negbin, seem to support 
> this, but I am uncertain about how to incorporate them.

You can use the option 'is.cluster' to pass the name of the function
that will test if the current ball is a cluster or not. In fact, opgam
is nothing more than a general function which you can use to try
different scan functions (such as Kulldorff's statistic).

> B) To reduce the multiple testing problem (Waller & Gotway 2004, 
> “Applied Spatial Statistics for Public Health Data”, Wiley, p.208) I 
> wonder whether to set radius to <50% of step size, e.g. 100m radius in a 
> 300m grid, so that the smallest circles won't touch?

It seems a good idea, but you may have clusters at different levels that
may remain undetected. The key here is that, for a given radius, small
variations in the 'cluster' centre may produce different configurations
of areas. This means that, specially when  small areas are present, by
varying a few meters (i.e., 100 meters) the step of the grid many
different configurations of areas can be tested.

So, if you force the grid to have at least a 300-meter step and you work
at the postcode level, you may not be able to detect some clusters.

> 2. Besag-Newell
> 
> I am getting results with ‘poisson’ (almost everything becomes a cluster 
> - possibly because the sites are clumped and not randomly distributed) 
> and with ‘permutation’, but wonders how the ‘negbin’ is used? Not like 

I think that if you have extra-variation in your data the observed
number of cases will be higher than expected by the Poisson model. That
might be a reason...

I wouldn't use the 'permutation' model unless I had very homogeneous
(i.e., same population) areas. In addition, the 'permutation' model is
better suited to check homogeneity of the data  than clustering.

> this:
> 
> > bnresults<-opgam(pcpoor, thegrid=pcpoor[,c("x","y")], alpha=.05,
> 
> 
> + iscluster=bn.iscluster, set.idxorder=TRUE, k=20, model="negbin",
> 
> + R=100, mle=calculate.mle(pcpoor) )
> > > Error in rnbinom(n, size, prob) : invalid arguments
> 
> 
> 
Try to add 'model="negbin"' to 'calculate.mle'. As you have written it,
you are estimating the parameters for the Poisson model, not the
Negative Binomial. The number and names of the parameters differ between
these two models.

> 3. Kulldorff & Nagarwalla
> 
> Again I struggle with the parameters. Not like this:
> 
> > #K&N's method over the centroids
> 
> 
> > mle<-calculate.mle(pcpoor, model="negbin")
> 
> 
> > > Error in while (((abs(m - m0) > tol * (m + m0)) || (abs(v - v0) > tol 
>
> * :
> 
> missing value where TRUE/FALSE needed

The error occurs in the iterative procedure to estimate the parameters
but I can't guess why. Could you run 'empbaysmooth(pcpoor$Observed,
pcpoor$Expected)' to see if it crashes?

> 
> > knresults<-opgam(data=pcpoor, thegrid=pcpoor[,c("x","y")], alpha=.05,
> 
> 
> + iscluster=kn.iscluster, fractpop=.5, R=100, model="negbin", mle=mle)
> 
> > > Error in rnbinom(n, size, prob) : invalid arguments

'mle' has not been properly estimated. That's why you get an error.

> 4. Turnbull. Is Turnbull analysis possible in DCluster yet?. Some 
> references in the manual, but haven’t been able to locate it.

Hum. It was originally in the package but I decided to remove it from
the package (and all the references to it in the documentation). If you
really want to use it I can send you what I have (I think I still have
it).

> 5. General
> 
> A) I am considering increasing the study area (p.t. working with 1262 
> postcode points) and wonder what the limits might be for a desktop pc. I 
> gather that the distance matrices (created by tripack or spdep) could be 
> a limiting factor? Would it be an idea to run this step first and once 
> the table is created run the cluster detection algorithm?

Well... :D I really like this question because I have been thinking
about that and I think the solution is parallel computing. :P

I am serious. Most of the computations in the scan methods can be done
in parallel if the total study region is split into smaller regions. The
only tricky thing is to be careful with the boundaries and consider all
the areas surrounding every smaller study area.

But if you want to check the limits of a desktop PC computing distances
once is a good idea. Compute it and save it into a file. I am sure
loading will take less time than recomputing it and you can select the
distances you need later.

> B) I wonder whether permutations always are superior to standard stats. 
> Distributions, and if not, then why not?

Permutations only make sense when you have very homogeneous areas. Note
that if you have non-homogeneous populations the areas can't be compared
directly and the permutation test is not useful (it will always give
significant results).

By using standard distributions, each area has a number of parameters
that describe how data behave in that particular area. Then, the
observed number of cases in each area is compared to the critical values
of that specific distribution.

When the distribution is 'complex' (for example, a negative binomial)
bootstrap gives you an approximation to the critical values.

Note that when you use a multinomial/Poisson model, areas can be
aggregated (i.e., you construct a new variable which is the sum of the
current variables) and the  observed number of cases in the new set of
areas still follow a multinomial/Poisson (with different parameters).

It is not the case when you use a N. Binomial distribution because the
sum of random variables following this distribution is not Neg.
Binomially distributed. Well, only if the prob. parameter is the same,
which is highly unlikely in an epidemiological problem.

I hope this solves all your doubts. I am afraid this is an e-mail
difficult to digest... but don't hesitate to write if you have further
questions.

Best regards,

Virgilio