[R] confusion over "names" of lm.influence()$hat
Aaron M. Swoboda
aaron.swoboda at gmail.com
Wed Apr 15 04:18:10 CEST 2009
I am performing a locally weighted regression model using housing
data, where I only include observations within a certain distance of
the house in question. For cross-validation of the bandwidth I am
collecting elements of the "hat matrix" (where y hat=hat matrix *y).
I was convinced I could grab the diagonal elements for the hat matrix
using lm.influence()$hat. In particular, I am interested in grabbing
the one element of the hat matrix that corresponds with the
observation I am running my locally weighted regression at. When I
looked more closely at the lm.influence()$hat output, I realized that
the observations used in my regression do not appear to be the same
observations for which the hat matrix returns values. I had assumed
the "names" associated with lm.influence()$hat were the observation
numbers for the regression data, am I wrong?
I've included a code snippet and its output. I am confused as to why
the observations for which I give positive weights in the regression
do not appear to be the same as the "names" in the hat matrix output.
Do you know what mistake I am making?
> obs <- 451 # this is the location/observation in the data for which
we are currently running the regression, for example
> require(fields)
> # calculate the distance all other observations are from
this observation
> Di=t(rdist.earth(cbind(housedata$longitude[obs],housedata
$latitude[obs]),
+ cbind(housedata$longitude,housedata
$latitude) ))
>
> ##########################
> b=.3 # this is the relevant distance threshold
>
> housedata$w <- 0 # generate a "weights" variable
> housedata$w[Di<b] <- 1 # give all observations closer
than b a weight of 1
> print(which(housedata$w>0)) # this tells me which
observations are included in this regression
[1] 333 336 340 345 346 376 378 406 414 418 419 425 426 427 428 429
430 431
[19] 436 438 441 444 450 451 456 457 458 461 462 463 464 465 467 468
469 470
[37] 471 474 475 476 479 481 483 488 494 496 508 512 514 518 525 526
528 530
[55] 531 533 538 539 544 548 563 572 576 584 585 587 591 594 595 600
601 607
[73] 613 615 616 617 618 624 631 637 638 641 645 647 652 653 654 655
656 659
[91] 663 678 681 685 688 689 691 693 694 711 712
> # run the linear regression only including the
observations within the distance threshold
> result.b <- lm(adjprice~lotsize+squareft+garagesqft
+numbath+numbed+time,
+ data=housedata,
+ weights=w )
> # collect the hat matrix
> print(lm.influence(result.b)$hat) #
345 348 352 357 358
389 391
0.06332126 0.06332126 0.05592105 0.09368046 0.10605304 0.05592105
0.09757274
419 427 431 432 438
439 440
0.03762151 0.10091480 0.04979739 0.05659565 0.05160888 0.03915642
0.10149422
441 442 443 444 449
451 722
0.05572360 0.03086186 0.05624229 0.04658039 0.09087753 0.06436925
0.09952022
725 731 732 737 738
739 742
0.08183102 0.06732644 0.05362610 0.04742278 0.05196055 0.02725287
0.03086186
743 744 745 746 748
749 750
0.03848066 0.06161776 0.03352387 0.09729289 0.04968367 0.04588662
0.04620045
751 752 755 756 757
760 762
0.08194437 0.07748418 0.20282956 0.05679513 0.05283027 0.08194437
0.05737857
764 769 775 777 789
793 795
0.14753830 0.04742278 0.04409041 0.04675800 0.05739381 0.05739381
0.04125143
799 806 807 809 811
812 814
0.11049178 0.05286319 0.04125143 0.13971558 0.03192842 0.04254609
0.06587966
819 820 825 829 844
853 857
0.23414783 0.02942560 0.04627927 0.04968367 0.04968367 0.04627927
0.02689040
865 866 868 872 875
876 881
0.10691998 0.09988275 0.06171944 0.08152409 0.11049178 0.04627927
0.05572857
882 888 894 896 897
898 899
0.10646147 0.04149530 0.12769051 0.04092457 0.06117365 0.04092457
0.04316847
905 912 918 919 922
926 928
0.17072235 0.04125143 0.06117365 0.14435872 0.04309004 0.06117365
0.05196055
933 934 935 936 937
940 944
0.06065717 0.03094961 0.18271286 0.10755273 0.05196055 0.06117365
0.06117365
959 962 966 969 971
973 975
0.13231524 0.06752826 0.06752826 0.06752826 0.06752826 0.06117365
0.06752826
976 994 995
0.04149530 0.04125143 0.06158475
I only noticed this problem because several times the observation in
question wasn't even a part of the hat matrix output... Am I incorrect
in assuming that the output from print(which(housedata$w>0)) should
be the same as the "names" from print(lm.influence(result.b)$hat).
Both have the same length (in this case 88 observations, but they
don't appear to be the same observations.
Thanks for anyone who can help me clear this up,
Aaron
More information about the R-help
mailing list