[R] confusion over "names" of lm.influence()$hat

Wed Apr 15 04:18:10 CEST 2009

I am performing a locally weighted regression model using housing  
data, where I only include observations within a certain distance of  
the house in question. For cross-validation of the bandwidth I am  
collecting elements of the "hat matrix" (where y hat=hat matrix *y).   
I was convinced I could grab the diagonal elements for the hat matrix  
using lm.influence()$hat. In particular, I am interested in grabbing  
the one element of the hat matrix that corresponds with the  
observation I am running my locally weighted regression at.  When I  
looked more closely at the lm.influence()$hat output, I realized that  
the observations used in my regression do not appear to be the same  
observations for which the hat matrix returns values. I had assumed  
the "names" associated with lm.influence()$hat were the observation  
numbers for the regression data, am I wrong?

I've included a code snippet and its output. I am confused as to why  
the observations for which I give positive weights in the regression  
do not appear to be the same as the "names" in the hat matrix output.  
Do you know what mistake I am making?

 > obs <- 451 # this is the location/observation in the data for which  
we are currently running the regression, for example
 >        require(fields)
 >        # calculate the distance all other observations are from  
this observation
 >        Di=t(rdist.earth(cbind(housedata$longitude[obs],housedata 
$latitude[obs]),
+                         cbind(housedata$longitude,housedata 
$latitude)            ))
 >
 >        ##########################
 >          b=.3           # this is the relevant distance threshold
 >
 >            housedata$w <- 0   # generate a "weights" variable
 >            housedata$w[Di<b] <- 1 # give all observations closer  
than b a weight of 1
 >            print(which(housedata$w>0)) # this tells me which  
observations are included in this regression
  [1] 333 336 340 345 346 376 378 406 414 418 419 425 426 427 428 429  
430 431
[19] 436 438 441 444 450 451 456 457 458 461 462 463 464 465 467 468  
469 470
[37] 471 474 475 476 479 481 483 488 494 496 508 512 514 518 525 526  
528 530
[55] 531 533 538 539 544 548 563 572 576 584 585 587 591 594 595 600  
601 607
[73] 613 615 616 617 618 624 631 637 638 641 645 647 652 653 654 655  
656 659
[91] 663 678 681 685 688 689 691 693 694 711 712
 >            # run the linear regression only including the  
observations within the distance threshold
 >            result.b <- lm(adjprice~lotsize+squareft+garagesqft 
+numbath+numbed+time,
+                           data=housedata,
+                           weights=w )
 >            # collect the hat matrix
 >            print(lm.influence(result.b)$hat) #
       345        348        352        357        358         
389        391
0.06332126 0.06332126 0.05592105 0.09368046 0.10605304 0.05592105  
0.09757274
       419        427        431        432        438         
439        440
0.03762151 0.10091480 0.04979739 0.05659565 0.05160888 0.03915642  
0.10149422
       441        442        443        444        449         
451        722
0.05572360 0.03086186 0.05624229 0.04658039 0.09087753 0.06436925  
0.09952022
       725        731        732        737        738         
739        742
0.08183102 0.06732644 0.05362610 0.04742278 0.05196055 0.02725287  
0.03086186
       743        744        745        746        748         
749        750
0.03848066 0.06161776 0.03352387 0.09729289 0.04968367 0.04588662  
0.04620045
       751        752        755        756        757         
760        762
0.08194437 0.07748418 0.20282956 0.05679513 0.05283027 0.08194437  
0.05737857
       764        769        775        777        789         
793        795
0.14753830 0.04742278 0.04409041 0.04675800 0.05739381 0.05739381  
0.04125143
       799        806        807        809        811         
812        814
0.11049178 0.05286319 0.04125143 0.13971558 0.03192842 0.04254609  
0.06587966
       819        820        825        829        844         
853        857
0.23414783 0.02942560 0.04627927 0.04968367 0.04968367 0.04627927  
0.02689040
       865        866        868        872        875         
876        881
0.10691998 0.09988275 0.06171944 0.08152409 0.11049178 0.04627927  
0.05572857
       882        888        894        896        897         
898        899
0.10646147 0.04149530 0.12769051 0.04092457 0.06117365 0.04092457  
0.04316847
       905        912        918        919        922         
926        928
0.17072235 0.04125143 0.06117365 0.14435872 0.04309004 0.06117365  
0.05196055
       933        934        935        936        937         
940        944
0.06065717 0.03094961 0.18271286 0.10755273 0.05196055 0.06117365  
0.06117365
       959        962        966        969        971         
973        975
0.13231524 0.06752826 0.06752826 0.06752826 0.06752826 0.06117365  
0.06752826
       976        994        995
0.04149530 0.04125143 0.06158475

I only noticed this problem because several times the observation in  
question wasn't even a part of the hat matrix output... Am I incorrect  
in assuming that the output from print(which(housedata$w>0))  should  
be the same as the "names" from print(lm.influence(result.b)$hat).  
Both have the same length (in this case 88 observations, but they  
don't appear to be the same observations.

Thanks for anyone who can help me clear this up,

Aaron