[R] R package 'np' problems

Sun Nov 14 19:30:24 CET 2010

Hi List,

I'm trying to get a density estimate for a point of interest from an npudens
object created for a sample of points. I'm working with 4 variables in total
(3 continuous and 1 unordered discrete - the discrete variable is the
character column in training.csv). When I try to evaluate the density for a
point that was not used in the training dataset, and when I extract the
fitted values from the npudens object itself, I'm getting values that are
much greater than 1 in some cases, which, if I understand correctly,
shouldn't be possible considering a pdf estimate can only be between 0 and
1. I think I must be doing something wrong, but I can't see it. Attached
I've included the training data (training.csv) and the point of interest
(origin.csv); below I've included the code I'm using and the results I'm
getting. I also don't understand why, when trying to evaluate the npudens
object at one point, I'm receiving the same set of fitted values from the
npudens object with the predict() function. It should be noted that I'm
indexing the dataframe of training data in order to get samples of the df
for density estimation (the samples are from different geographic locations
measured on the same set of variables; hence my use of sub-setting by [i]
and removing columns from the df before running the density estimation).
Moreover, in the example I'm providing here, the point of interest does
happen to come from the training dataset, but I'm receiving the same results
when I compare the point of interest to samples of which it is not a part
(density estimates that are either extremely small, which is acceptable, or
much greater than one, which doesn't seem right to me). Any thoughts would
be greatly appreciated,

Chris

> fitted(npudens(tdat=training_df[training_cols_select][training_df$cat ==
i,]))

[1] 7.762187e+18 9.385532e+18 6.514318e+18 7.583486e+18 6.283017e+18
 [6] 6.167344e+18 9.820551e+18 7.952821e+18 7.882741e+18 1.744266e+19
[11] 6.653258e+18 8.704722e+18 8.631365e+18 1.876052e+19 1.995445e+19
[16] 2.323802e+19 1.203780e+19 8.493055e+18 8.485279e+18 1.722033e+19
[21] 2.227207e+19 2.177740e+19 2.168679e+19 9.329572e+18 9.380505e+18
[26] 1.023311e+19 2.109676e+19 7.903112e+18 7.935457e+18 8.917777e+18
[31] 8.899827e+18 6.265440e+18 6.204720e+18 6.276559e+18 6.218002e+18

> npu_dens <- npudens(tdat=training_df[training_cols_select][training_df$cat
== i,])
> summary(npu_dens)

Density Data: 35 training points, in 4 variable(s)
              aster_srtm_aspect aster_srtm_dem_filled aster_srtm_slope
Bandwidth(s):          29.22422          2.500559e-24         3.111467
              class_unsup_pc_iso
Bandwidth(s):          0.2304616

Bandwidth Type: Fixed
Log Likelihood: 1531.598

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Vars.: 3

Unordered Categorical Kernel Type: Aitchison and Aitken
No. Unordered Categorical Vars.: 1

> predict(npu_dens,newdata=origin[training_cols_select]))

[1] 7.762187e+18 9.385532e+18 6.514318e+18 7.583486e+18 6.283017e+18
 [6] 6.167344e+18 9.820551e+18 7.952821e+18 7.882741e+18 1.744266e+19
[11] 6.653258e+18 8.704722e+18 8.631365e+18 1.876052e+19 1.995445e+19
[16] 2.323802e+19 1.203780e+19 8.493055e+18 8.485279e+18 1.722033e+19
[21] 2.227207e+19 2.177740e+19 2.168679e+19 9.329572e+18 9.380505e+18
[26] 1.023311e+19 2.109676e+19 7.903112e+18 7.935457e+18 8.917777e+18
[31] 8.899827e+18 6.265440e+18 6.204720e+18 6.276559e+18 6.218002e+18