[R-sig-Geo] Advice on GLS on residual variogram?

Wed Jun 16 02:55:03 CEST 2010

Dear Thierry,

Many thanks for your help.

> Have a look at the normalised residuals when creating a variogram. Those
> take the correlation structure into account.

Using the normalized residuals for the variogram produces more
sensible looking results (see bottom image here:
http://sites.google.com/site/ldemisc/variogram):

# OLS and GLS
gls.2<-gls(log(CTOP)~PCB1+PCB2+PCB3+PCB4+PCB5+PCB6+PCB8,data=oc_2)
gls.2.sph<-update(gls.2,correlation=corSpher(form=~X+Y,nugget=TRUE))

# OLS variogram
v.gls<-variogram(residuals(gls.2)~1,data=oc_2)
v.gls.pl<-plot(v.gls)

# GLS variogram
v.gls.sph<-variogram(residuals(gls.2.sph,type="normalized")~1,data=oc_2)
v.gls.sph.pl<-plot(v.gls.sph)

# Side by side variogram plot
v.gls.pl$x.limits<-v.gls.sph.pl$x.limits
v.gls.pl$y.limits<-v.gls.sph.pl$y.limits
print(v.gls.pl,split=c(1,1,2,1),more=TRUE)
print(v.gls.sph.pl,split=c(2,1,2,1),more=FALSE)    			

This looks like much more plausible GLS output, no?

> And have a look at the
> parameters of the correlation structure. Are they from a similar
> magnitude as you would expect from the OLS variogram? If not, rerun the
> GLS with starting values for range and sill based on the OLS variogram.
> I had some strange results in the past with range parameters being
> smaller than the smallest distance between two points. Supplying
> reasonable starting values yielded better results.

Looking at the variogram parameter found from the OLS residuals, I see
a fair bit of difference between those found by GLS for the
correlation structure:

vgm.ols.norm<-variogram(residuals(gls.2,type="normalized")~1,data=oc_2)
ols.vgm.parms1<-vgm(nugget=0.27,model="Sph",psill=0.6,range=200000)
ols.fit.sph.var<-fit.variogram(vgm.ols2,model=ols.vgm.parms1)

> ols.fit.sph.var
  model      psill    range
1   Nug 0.26197480      0.0
2   Sph 0.09186248 220528.5

#Partial output from summary of GLS fit

>summary(gls.2.sph)

Correlation Structure: Spherical spatial correlation
 Formula: ~X + Y
 Parameter estimate(s):
       range       nugget
1509259    0.1357911

So, I will retry the GLS using the fitted variogram values overnight
(the GLS with spherical correlation model takes ~ 6 hours to run).

I had one more issue that came up related to this process, which is
rather alarming--using R on two different installations, with the same
code and same data, I am getting different results for the GLS!  I was
hoping someone could help me figure out what's wrong.

Installation 1: MacBook Pro with OS X 10.5.8, with the following R :

R version 2.11.1 (2010-05-31)
[R.app GUI 1.34 (5589) x86_64-apple-darwin9.8.0]

Installation 2: 8 node Linux cluster, with custom version of RHEL5. I
built R 2.11.1 from source here and installed locally.

As I said, with my dataset, I am getting different results.

Everything is the same with the initial portions of the code:

Mac R >

oc<-read.dbf('~/Desktop/ARC_profs_OC+PCA+PREDs.dbf')		
oc_2<-as.data.frame(oc[,c(2,5:6,21:27,207:225)])
gls.2<-gls(log(CTOP)~PCB1+PCB2+PCB3+PCB4+PCB5+PCB6+PCB8,data=oc_2)
summary(gls.2)

Generalized least squares fit by REML
  Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
  Data: oc_2
       AIC      BIC    logLik
  6320.516 6375.617 -3151.258

Coefficients:
                 Value Std.Error   t-value p-value
(Intercept)  1.6608899 0.4280775   3.87988   1e-04
PCB1        -0.0002037 0.0000156 -13.03177   0e+00
PCB2         0.0004070 0.0000230  17.68030   0e+00
PCB3         0.0026349 0.0000460  57.28741   0e+00
PCB4         0.0133557 0.0011803  11.31585   0e+00
PCB5        -0.0085951 0.0012186  -7.05294   0e+00
PCB6         0.0315716 0.0012912  24.45119   0e+00
PCB8        -0.0990330 0.0064676 -15.31206   0e+00

[snipped for space]

Residual standard error: 0.6061033
Degrees of freedom: 3377 total; 3369 residual

Linux R >

Generalized least squares fit by REML
  Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
  Data: oc_2
       AIC      BIC    logLik
  6320.516 6375.617 -3151.258

Coefficients:
                 Value Std.Error   t-value p-value
(Intercept)  1.6608899 0.4280775   3.87988   1e-04
PCB1        -0.0002037 0.0000156 -13.03177   0e+00
PCB2         0.0004070 0.0000230  17.68030   0e+00
PCB3         0.0026349 0.0000460  57.28741   0e+00
PCB4         0.0133557 0.0011803  11.31585   0e+00
PCB5        -0.0085951 0.0012186  -7.05294   0e+00
PCB6         0.0315716 0.0012912  24.45119   0e+00
PCB8        -0.0990330 0.0064676 -15.31206   0e+00

[snipped for space]

Residual standard error: 0.6061033
Degrees of freedom: 3377 total; 3369 residual

So, this suggests that the input datasets are not somehow different
from one another. The problem comes here:

Mac R >

gls.2.sph<-update(gls.2,correlation=corSpher(form=~X+Y,nugget=TRUE))
summary(gls.2.sph)

Generalized least squares fit by REML
  Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
  Data: oc_2
       AIC      BIC    logLik
  5742.647 5809.993 -2860.323

Correlation Structure: Spherical spatial correlation
 Formula: ~X + Y
 Parameter estimate(s):
       range       nugget
1.509259e+06 1.357911e-01

Coefficients:
                 Value Std.Error   t-value p-value
(Intercept) -1.6911740 0.9610770 -1.759665  0.0786
PCB1        -0.0001095 0.0000166 -6.612440  0.0000
PCB2         0.0012919 0.0000677 19.081299  0.0000
PCB3         0.0014827 0.0001903  7.791159  0.0000
PCB4         0.0026119 0.0014134  1.847974  0.0647
PCB5        -0.0119822 0.0012409 -9.655755  0.0000
PCB6         0.0196123 0.0017884 10.966588  0.0000
PCB8         0.0290101 0.0193420  1.499847  0.1337

 Correlation:
     (Intr) PCB1   PCB2   PCB3   PCB4   PCB5   PCB6
PCB1 -0.496
PCB2 -0.202  0.229
PCB3  0.260 -0.427 -0.161
PCB4  0.254 -0.439 -0.403  0.280
PCB5 -0.007 -0.256 -0.123  0.023  0.455
PCB6  0.326 -0.405 -0.246  0.428  0.487  0.161
PCB8 -0.618  0.473  0.339 -0.583 -0.535 -0.091 -0.622

Standardized residuals:
       Min         Q1        Med         Q3        Max
-2.8618482 -1.1854998 -0.7948018 -0.3657106  1.3957297

Residual standard error: 1.368062
Degrees of freedom: 3377 total; 3369 residual

Linux R >

gls.2.sph<-update(gls.2,correlation=corSpher(form=~X+Y,nugget=TRUE))
summary(gls.2.sph)

Generalized least squares fit by REML
  Model: log(CTOP) ~ PCB1 + PCB2 + PCB3 + PCB4 + PCB5 + PCB6 + PCB8
  Data: oc_2
       AIC      BIC    logLik
  5742.867 5810.213 -2860.433

Correlation Structure: Spherical spatial correlation
 Formula: ~X + Y
 Parameter estimate(s):
       range       nugget
1.625556e+06 1.272891e-01

Coefficients:
                 Value Std.Error   t-value p-value
(Intercept) -1.5625757 0.9952075 -1.570100  0.1165
PCB1        -0.0001095 0.0000166 -6.609627  0.0000
PCB2         0.0012916 0.0000677 19.078453  0.0000
PCB3         0.0014815 0.0001904  7.779445  0.0000
PCB4         0.0026110 0.0014141  1.846390  0.0649
PCB5        -0.0119806 0.0012410 -9.654330  0.0000
PCB6         0.0196109 0.0017893 10.959835  0.0000
PCB8         0.0291060 0.0193667  1.502889  0.1330

 Correlation:
     (Intr) PCB1   PCB2   PCB3   PCB4   PCB5   PCB6
PCB1 -0.477
PCB2 -0.189  0.229
PCB3  0.247 -0.428 -0.162
PCB4  0.242 -0.439 -0.404  0.281
PCB5 -0.008 -0.256 -0.123  0.023  0.455
PCB6  0.312 -0.405 -0.246  0.429  0.487  0.161
PCB8 -0.592  0.474  0.340 -0.583 -0.535 -0.091 -0.623

Standardized residuals:
       Min         Q1        Med         Q3        Max
-2.8627169 -1.2401476 -0.8617219 -0.4464233  1.2590098

Residual standard error: 1.413038
Degrees of freedom: 3377 total; 3369 residual

Just about everything is different in this, even if just slightly, and
I am not sure why. To check to see if there was a problem with another
dataset, I used meuse:

data(meuse)
meu.ols<-gls(log(zinc)~dist.m+ffreq,meuse)
meu.gls<-update(meu.ols,correlation=corExp(form=~x+y,nugget=TRUE))
summary(meu.gls)

But the results are the same from both, e.g.

Mac >

Parameter estimate(s):
       range       nugget
236.94597158   0.02719958

Coefficients:
                Value  Std.Error  t-value p-value
(Intercept)  6.759524 0.12988675 52.04168       0
dist.m      -0.001910 0.00028717 -6.64969       0
ffreq2      -0.562034 0.06642560 -8.46110       0
ffreq3      -0.557639 0.10166715 -5.48495       0

Linux >

 Parameter estimate(s):
       range       nugget
236.94597172   0.02719958

Coefficients:
                Value  Std.Error  t-value p-value
(Intercept)  6.759524 0.12988675 52.04168       0
dist.m      -0.001910 0.00028717 -6.64969       0
ffreq2      -0.562034 0.06642560 -8.46110       0
ffreq3      -0.557639 0.10166715 -5.48495       0

So, I am really not sure what is going on here. As I said, my datasets
seem fine, and to further confirm this I exported the values from oc_2
from linux, imported into my Mac R installation, and did a few tests:

Linux R >

write.csv(oc_2[,c(1:3,9,14:21)],oc2out.txt)

Mac R >

oc_2_ad<-read.csv("oc2out.txt")
str(oc_2_ad)
str(oc_2)

#Compare coordinates

xdiffs<-round(oc_2$X-oc_2_ad$X,5)
ydiffs<-round(oc_2$Y-oc_2_ad$Y,5)
ctopdiffs<-oc_2$CTOP-oc_2_ad$CTOP
pc1diffs<-oc_2$PCB1-oc_2_ad$PCB1
pc2diffs<-oc_2$PCB2-oc_2_ad$PCB2
pc3diffs<-oc_2$PCB3-oc_2_ad$PCB3
pc4diffs<-oc_2$PCB4-oc_2_ad$PCB4
pc5diffs<-oc_2$PCB5-oc_2_ad$PCB5
pc6diffs<-oc_2$PCB6-oc_2_ad$PCB6
pc8diffs<-oc_2$PCB8-oc_2_ad$PCB8

min(xdiffs);max(xdiffs);mean(xdiffs)							
min(ydiffs);max(ydiffs);mean(ydiffs)							
min(ctopdiffs);max(ctopdiffs);mean(ctopdiffs)
min(pc1diffs);max(pc1diffs);mean(pc1diffs)
min(pc2diffs);max(pc2diffs);mean(pc2diffs)
min(pc3diffs);max(pc3diffs);mean(pc3diffs)
min(pc4diffs);max(pc4diffs);mean(pc4diffs)
min(pc5diffs);max(pc5diffs);mean(pc5diffs)
min(pc6diffs);max(pc6diffs);mean(pc6diffs)
min(pc8diffs);max(pc8diffs);mean(pc8diffs)			

The results were all zeros, so the problem doesn't seem to be
different values in the dataset, or misaligned rows.  So, I am at a
loss here.

Could this have something to do with a faulty build on the linux cluster?

Thanks in advance for any help and advice!

Cheers, Lyndon