[R-sig-Geo] error message when running errorsarlm
Roger.Bivand at nhh.no
Fri May 23 08:42:56 CEST 2008
On Thu, 22 May 2008, Roger Bivand wrote:
> On Thu, 22 May 2008, evans324 at umn.edu wrote:
>> On May 22 2008, Roger Bivand wrote:
>> > > > Does that mean that you get a sensible lambda for your model now -
>> > > > the line search leads somewhere other than a boundary of the
>> > > > interval?
>> > >
>> > > I apologize for being unclear. I actually upgraded R and updated
>> > > packages, then ran errorsarlm with method="Matrix" and got the same
>> > > error messages I'd had previously (i.e., the search led to the
>> > > boundary of the interval). I then tried your other suggestion and
>> > > used method="spam" and got a result with no error messages.
>> > But we do not know why the two are not the same (they should be), so I
>> > would still not trust the outcome. I would be interested in off-list
>> > access to the data being used - I think that there is some issue with
>> > the scaling of the variable values. Do you see the same difference using
>> > spautolm(), which is effectively the same as errorsarlm(), but with a
>> > different internal structure?
>> I do see the same difference using spautolm() and get no error messages
>> using it. I'll send you then data separately and would appreciate your
>> opinion on them.
To close the part of the thread about differences between the spam and
Matrix methods, I can report that on Linux (both 2GB and 1GB 32-bit),
there is no difference for the original model using the 2500m distance
criterion, and both line searches reach a lambda of 0.9165158. The same
applies to Windows 32-bit. R version 2.7.0 using R-generic BLAS, spdep
0.4-21, Matrix 0.999375-9, spam 0.13-3, in all cases.
The sparser weights cases described below run with adequate speed, but the
semi-dense weights (average # neighbours 280) run more slowly, but the
time is mostly spent in making sure that the weights are exactly symmetric
- now in R functions similar.listw() and listw2U(), both of which will be
re-written to hand out the time consuming parts to compiled C.
The data were 10055 house prices, the objective to fit a hedonic
regression. Conclusion: using a more sparse neighbour representation is
advisable; the computational problems could not be reproduced, but looked
initially like a package version issue - Matrix is moving fast, and spdep
is trying to keep up with it.
> OK, thanks. On first inspection, the choice of a distance criterion for
> neighbours seems to be part of the problem. Using:
> nb_k5 <- knn2nb(knearneigh(coordinates(rd), k=5))
> nb_k5s <- make.sym.nb(nb_k5)
> where rd is the SpatialPointsDataFrame object, with many fewer neighbours
> than your 2500m or 3000m criteria, gives results from "Matrix" and "spam"
> that are identical, and most likely what you are after. These weights are the
> 5 nearest neighbours coerced to symmetric, so all ahave 5 neighbours and the
> largest number of neighbours is 12 (your 2500m criterion had a mean number of
> neighbours of 280, maximum 804). If you can live without your choice of
> neighbours (which in some settings may be getting pretty close to your market
> segment dummies), I'd advise using something much sparser (but symmetric).
> The sparser weights matrices also increase the speed dramatically.
> If you look at the bottom of ?bptest.sarlm, you'll see a cheap and totally
> untested way of adjusting the output SEs, but please don't believe what it
> does, because it is treating the lambda value as known, not estimated. A
> guess at the remaining heterogeneity would be age by maintenance interaction,
> older houses will vary in value by maintenance, probably also by
> Hope this helps,
>> > > > There are different traditions. Econometricians and some others in
>> > > > social science try to trick the standard errors by "magic", while
>> > > > epidemiologists (and crime people) typically use case weights -
>> > > > that is model the heteroscedasticity directly. spautolm() can
>> > > > include such case weights. I don't think that there is any
>> > > > substantive and reliable theory for adjusting the SE, that is
>> > > > theory that doesn't appeal to assumptions we already know don't
>> > > > hold. Sampling from the posterior gives a handle on this, but is
>> > > > not simple, and doesn't really suit 10K observations.
>> > > >
>> > > Can you explain "magic" a little further? I'm running this for a
>> > > professor who is a bit nervous about black box techniques and I'd
>> > > like to be able to offer him a good explanation. I think he'll just
>> > > have me calculate White's standard errors and ignore spatial
>> > > autocorrelation if I can't be clearer.
>> > >
>> > If this is all your "professor" can manage, please replace/educate! The
>> > model is fundamentally misspecified, and neither "magicing" the standard
>> > errors, nor just fitting a simultaneous autoregressive error model will
>> > let you make fair decisions on the "significance" or otherwise of the
>> > right-hand side variables, which I suppose is the object of the
>> > exercise?
>> I agree here, but haven't been able to get much advice on this. I
>> appreciate your input.
>> > (Looking at Johnston & DiNardo (1997), pp. 164-166, it looks as if
>> > White's SE only help asymptotically (in Prof. Ripley's well-known
>> > remark, asymptotics are a foreign country with spatial data), and not in
>> > finite samples, and their performance is unknown if the residuals are
>> > autocorrelated, which is the case here).
>> > The vast number of observations is no help either, because they
>> > certainly introduce heterogeneity that has not been controlled for. Is
>> > this a grid of global species occurrence data, by any chance? Which RHS
>> > variables are covering for differences in environmental drivers? Or is
>> > there a better reason for using many observations (instead of careful
>> > data collection) than just their being available?
>> This is a hedonic regression with a goal of eliciting economic values for
>> different percentages of tree cover on parcels and in the local
>> neighborhood as capitalized in home sale prices. We're using all 2005
>> residential sales from Ramsey and Dakota counties in Minnesota, USA as our
>> observations. This gives us sales from most study area regions and for all
>> months. I'll send you a description of the RHS variables with the dataset.
>> > More observations do not mean more information if meaningful differences
>> > across the observations are not captured by included variables (with the
>> > correct functional form). Have you tried GAM with flexible functional
>> > forms on the RHS variables and s(x,y) on the (point) locations of the
>> > observations?
>> I haven't tried this, but will look into it.
>> > You are not alone in your plight, but if the inferences matter, then
>> > it's better to be cautious, irrespective of the "professor".
>> Thanks very much for your help.
>> --- Heather Sander
>> Ph.D. Candidate: Conservation Biology
>> Office: 305 Ecology & 420 Blegen
>> Mail: University of Minnesota
>> Dept. of Geography
>> 414 Social Science Bldg.
>> 267 19th Ave. S.
>> Minneapolis, MN 55455
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no
More information about the R-sig-Geo