[R-sig-Geo] Advice on GLS on residual variogram?
Lyndon Estes
lestes at princeton.edu
Wed Jun 16 02:55:03 CEST 2010
Dear Thierry,
Many thanks for your help.
> Have a look at the normalised residuals when creating a variogram. Those
> take the correlation structure into account.
Using the normalized residuals for the variogram produces more
sensible looking results (see bottom image here:
http://sites.google.com/site/ldemisc/variogram):
# OLS and GLS
gls.2<-gls(log(CTOP)~PCB1+PCB2+PCB3+PCB4+PCB5+PCB6+PCB8,data=oc_2)
gls.2.sph<-update(gls.2,correlation=corSpher(form=~X+Y,nugget=TRUE))
# OLS variogram
v.gls<-variogram(residuals(gls.2)~1,data=oc_2)
v.gls.pl<-plot(v.gls)
# GLS variogram
v.gls.sph<-variogram(residuals(gls.2.sph,type="normalized")~1,data=oc_2)
v.gls.sph.pl<-plot(v.gls.sph)
# Side by side variogram plot
v.gls.pl$x.limits<-v.gls.sph.pl$x.limits
v.gls.pl$y.limits<-v.gls.sph.pl$y.limits
print(v.gls.pl,split=c(1,1,2,1),more=TRUE)
print(v.gls.sph.pl,split=c(2,1,2,1),more=FALSE)
This looks like much more plausible GLS output, no?
> And have a look at the
> parameters of the correlation structure. Are they from a similar
> magnitude as you would expect from the OLS variogram? If not, rerun the
> GLS with starting values for range and sill based on the OLS variogram.
> I had some strange results in the past with range parameters being
> smaller than the smallest distance between two points. Supplying
> reasonable starting values yielded better results.
Looking at the variogram parameter found from the OLS residuals, I see
a fair bit of difference between those found by GLS for the
correlation structure:
vgm.ols.norm<-variogram(residuals(gls.2,type="normalized")~1,data=oc_2)
ols.vgm.parms1<-vgm(nugget=0.27,model="Sph",psill=0.6,range=200000)
ols.fit.sph.var<-fit.variogram(vgm.ols2,model=ols.vgm.parms1)
> ols.fit.sph.var
model psill range
1 Nug 0.26197480 0.0
2 Sph 0.09186248 220528.5
#Partial output from summary of GLS fit
>summary(gls.2.sph)
Correlation Structure: Spherical spatial correlation
Formula: ~X + Y
Parameter estimate(s):
range nugget
1509259 0.1357911
So, I will retry the GLS using the fitted variogram values overnight
(the GLS with spherical correlation model takes ~ 6 hours to run).
I had one more issue that came up related to this process, which is
rather alarming--using R on two different installations, with the same
code and same data, I am getting different results for the GLS! I was
hoping someone could help me figure out what's wrong.
Installation 1: MacBook Pro with OS X 10.5.8, with the following R :
R version 2.11.1 (2010-05-31)
[R.app GUI 1.34 (5589) x86_64-apple-darwin9.8.0]
Installation 2: 8 node Linux cluster, with custom version of RHEL5. I
built R 2.11.1 from source here and installed locally.
As I said, with my dataset, I am getting different results.
Everything is the same with the initial portions of the code:
Mac R >
oc<-read.dbf('~/Desktop/ARC_profs_OC+PCA+PREDs.dbf')
oc_2<-as.data.frame(oc[,c(2,5:6,21:27,207:225)])
gls.2<-gls(log(CTOP)~PCB1+PCB2+PCB3+PCB4+PCB5+PCB6+PCB8,data=oc_2)
summary(gls.2)
Generalized least squares fit by REML
Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
Data: oc_2
AIC BIC logLik
6320.516 6375.617 -3151.258
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.6608899 0.4280775 3.87988 1e-04
PCB1 -0.0002037 0.0000156 -13.03177 0e+00
PCB2 0.0004070 0.0000230 17.68030 0e+00
PCB3 0.0026349 0.0000460 57.28741 0e+00
PCB4 0.0133557 0.0011803 11.31585 0e+00
PCB5 -0.0085951 0.0012186 -7.05294 0e+00
PCB6 0.0315716 0.0012912 24.45119 0e+00
PCB8 -0.0990330 0.0064676 -15.31206 0e+00
[snipped for space]
Residual standard error: 0.6061033
Degrees of freedom: 3377 total; 3369 residual
Linux R >
Generalized least squares fit by REML
Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
Data: oc_2
AIC BIC logLik
6320.516 6375.617 -3151.258
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.6608899 0.4280775 3.87988 1e-04
PCB1 -0.0002037 0.0000156 -13.03177 0e+00
PCB2 0.0004070 0.0000230 17.68030 0e+00
PCB3 0.0026349 0.0000460 57.28741 0e+00
PCB4 0.0133557 0.0011803 11.31585 0e+00
PCB5 -0.0085951 0.0012186 -7.05294 0e+00
PCB6 0.0315716 0.0012912 24.45119 0e+00
PCB8 -0.0990330 0.0064676 -15.31206 0e+00
[snipped for space]
Residual standard error: 0.6061033
Degrees of freedom: 3377 total; 3369 residual
So, this suggests that the input datasets are not somehow different
from one another. The problem comes here:
Mac R >
gls.2.sph<-update(gls.2,correlation=corSpher(form=~X+Y,nugget=TRUE))
summary(gls.2.sph)
Generalized least squares fit by REML
Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
Data: oc_2
AIC BIC logLik
5742.647 5809.993 -2860.323
Correlation Structure: Spherical spatial correlation
Formula: ~X + Y
Parameter estimate(s):
range nugget
1.509259e+06 1.357911e-01
Coefficients:
Value Std.Error t-value p-value
(Intercept) -1.6911740 0.9610770 -1.759665 0.0786
PCB1 -0.0001095 0.0000166 -6.612440 0.0000
PCB2 0.0012919 0.0000677 19.081299 0.0000
PCB3 0.0014827 0.0001903 7.791159 0.0000
PCB4 0.0026119 0.0014134 1.847974 0.0647
PCB5 -0.0119822 0.0012409 -9.655755 0.0000
PCB6 0.0196123 0.0017884 10.966588 0.0000
PCB8 0.0290101 0.0193420 1.499847 0.1337
Correlation:
(Intr) PCB1 PCB2 PCB3 PCB4 PCB5 PCB6
PCB1 -0.496
PCB2 -0.202 0.229
PCB3 0.260 -0.427 -0.161
PCB4 0.254 -0.439 -0.403 0.280
PCB5 -0.007 -0.256 -0.123 0.023 0.455
PCB6 0.326 -0.405 -0.246 0.428 0.487 0.161
PCB8 -0.618 0.473 0.339 -0.583 -0.535 -0.091 -0.622
Standardized residuals:
Min Q1 Med Q3 Max
-2.8618482 -1.1854998 -0.7948018 -0.3657106 1.3957297
Residual standard error: 1.368062
Degrees of freedom: 3377 total; 3369 residual
Linux R >
gls.2.sph<-update(gls.2,correlation=corSpher(form=~X+Y,nugget=TRUE))
summary(gls.2.sph)
Generalized least squares fit by REML
Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
Data: oc_2
AIC BIC logLik
5742.867 5810.213 -2860.433
Correlation Structure: Spherical spatial correlation
Formula: ~X + Y
Parameter estimate(s):
range nugget
1.625556e+06 1.272891e-01
Coefficients:
Value Std.Error t-value p-value
(Intercept) -1.5625757 0.9952075 -1.570100 0.1165
PCB1 -0.0001095 0.0000166 -6.609627 0.0000
PCB2 0.0012916 0.0000677 19.078453 0.0000
PCB3 0.0014815 0.0001904 7.779445 0.0000
PCB4 0.0026110 0.0014141 1.846390 0.0649
PCB5 -0.0119806 0.0012410 -9.654330 0.0000
PCB6 0.0196109 0.0017893 10.959835 0.0000
PCB8 0.0291060 0.0193667 1.502889 0.1330
Correlation:
(Intr) PCB1 PCB2 PCB3 PCB4 PCB5 PCB6
PCB1 -0.477
PCB2 -0.189 0.229
PCB3 0.247 -0.428 -0.162
PCB4 0.242 -0.439 -0.404 0.281
PCB5 -0.008 -0.256 -0.123 0.023 0.455
PCB6 0.312 -0.405 -0.246 0.429 0.487 0.161
PCB8 -0.592 0.474 0.340 -0.583 -0.535 -0.091 -0.623
Standardized residuals:
Min Q1 Med Q3 Max
-2.8627169 -1.2401476 -0.8617219 -0.4464233 1.2590098
Residual standard error: 1.413038
Degrees of freedom: 3377 total; 3369 residual
Just about everything is different in this, even if just slightly, and
I am not sure why. To check to see if there was a problem with another
dataset, I used meuse:
data(meuse)
meu.ols<-gls(log(zinc)~dist.m+ffreq,meuse)
meu.gls<-update(meu.ols,correlation=corExp(form=~x+y,nugget=TRUE))
summary(meu.gls)
But the results are the same from both, e.g.
Mac >
Parameter estimate(s):
range nugget
236.94597158 0.02719958
Coefficients:
Value Std.Error t-value p-value
(Intercept) 6.759524 0.12988675 52.04168 0
dist.m -0.001910 0.00028717 -6.64969 0
ffreq2 -0.562034 0.06642560 -8.46110 0
ffreq3 -0.557639 0.10166715 -5.48495 0
Linux >
Parameter estimate(s):
range nugget
236.94597172 0.02719958
Coefficients:
Value Std.Error t-value p-value
(Intercept) 6.759524 0.12988675 52.04168 0
dist.m -0.001910 0.00028717 -6.64969 0
ffreq2 -0.562034 0.06642560 -8.46110 0
ffreq3 -0.557639 0.10166715 -5.48495 0
So, I am really not sure what is going on here. As I said, my datasets
seem fine, and to further confirm this I exported the values from oc_2
from linux, imported into my Mac R installation, and did a few tests:
Linux R >
write.csv(oc_2[,c(1:3,9,14:21)],oc2out.txt)
Mac R >
oc_2_ad<-read.csv("oc2out.txt")
str(oc_2_ad)
str(oc_2)
#Compare coordinates
xdiffs<-round(oc_2$X-oc_2_ad$X,5)
ydiffs<-round(oc_2$Y-oc_2_ad$Y,5)
ctopdiffs<-oc_2$CTOP-oc_2_ad$CTOP
pc1diffs<-oc_2$PCB1-oc_2_ad$PCB1
pc2diffs<-oc_2$PCB2-oc_2_ad$PCB2
pc3diffs<-oc_2$PCB3-oc_2_ad$PCB3
pc4diffs<-oc_2$PCB4-oc_2_ad$PCB4
pc5diffs<-oc_2$PCB5-oc_2_ad$PCB5
pc6diffs<-oc_2$PCB6-oc_2_ad$PCB6
pc8diffs<-oc_2$PCB8-oc_2_ad$PCB8
min(xdiffs);max(xdiffs);mean(xdiffs)
min(ydiffs);max(ydiffs);mean(ydiffs)
min(ctopdiffs);max(ctopdiffs);mean(ctopdiffs)
min(pc1diffs);max(pc1diffs);mean(pc1diffs)
min(pc2diffs);max(pc2diffs);mean(pc2diffs)
min(pc3diffs);max(pc3diffs);mean(pc3diffs)
min(pc4diffs);max(pc4diffs);mean(pc4diffs)
min(pc5diffs);max(pc5diffs);mean(pc5diffs)
min(pc6diffs);max(pc6diffs);mean(pc6diffs)
min(pc8diffs);max(pc8diffs);mean(pc8diffs)
The results were all zeros, so the problem doesn't seem to be
different values in the dataset, or misaligned rows. So, I am at a
loss here.
Could this have something to do with a faulty build on the linux cluster?
Thanks in advance for any help and advice!
Cheers, Lyndon
More information about the R-sig-Geo
mailing list