[R] R package 'np' problems

Mon Nov 15 10:34:01 CET 2010

Chris Carleton wrote:
> Hi List,
> 
> I'm trying to get a density estimate for a point of interest from an npudens
> object created for a sample of points. I'm working with 4 variables in total
> (3 continuous and 1 unordered discrete - the discrete variable is the
> character column in training.csv). When I try to evaluate the density for a
> point that was not used in the training dataset, and when I extract the
> fitted values from the npudens object itself, I'm getting values that are
> much greater than 1 in some cases, which, if I understand correctly,
> shouldn't be possible considering a pdf estimate can only be between 0 and
> 1. I think I must be doing something wrong, but I can't see it. Attached
> I've included the training data (training.csv) and the point of interest
> (origin.csv); below I've included the code I'm using and the results I'm
> getting. I also don't understand why, when trying to evaluate the npudens
> object at one point, I'm receiving the same set of fitted values from the
> npudens object with the predict() function. It should be noted that I'm
> indexing the dataframe of training data in order to get samples of the df
> for density estimation (the samples are from different geographic locations
> measured on the same set of variables; hence my use of sub-setting by [i]
> and removing columns from the df before running the density estimation).
> Moreover, in the example I'm providing here, the point of interest does
> happen to come from the training dataset, but I'm receiving the same results
> when I compare the point of interest to samples of which it is not a part
> (density estimates that are either extremely small, which is acceptable, or
> much greater than one, which doesn't seem right to me). Any thoughts would
> be greatly appreciated,
> 
> Chris
> 

I haven't looked at this in any detail, but why do say that pdf values
cannot exceed 1? That's certainly not true in general.

   -Peter Ehlers

>> fitted(npudens(tdat=training_df[training_cols_select][training_df$cat ==
> i,]))
> 
> [1] 7.762187e+18 9.385532e+18 6.514318e+18 7.583486e+18 6.283017e+18
>  [6] 6.167344e+18 9.820551e+18 7.952821e+18 7.882741e+18 1.744266e+19
> [11] 6.653258e+18 8.704722e+18 8.631365e+18 1.876052e+19 1.995445e+19
> [16] 2.323802e+19 1.203780e+19 8.493055e+18 8.485279e+18 1.722033e+19
> [21] 2.227207e+19 2.177740e+19 2.168679e+19 9.329572e+18 9.380505e+18
> [26] 1.023311e+19 2.109676e+19 7.903112e+18 7.935457e+18 8.917777e+18
> [31] 8.899827e+18 6.265440e+18 6.204720e+18 6.276559e+18 6.218002e+18
> 
>> npu_dens <- npudens(tdat=training_df[training_cols_select][training_df$cat
> == i,])
>> summary(npu_dens)
> 
> Density Data: 35 training points, in 4 variable(s)
>               aster_srtm_aspect aster_srtm_dem_filled aster_srtm_slope
> Bandwidth(s):          29.22422          2.500559e-24         3.111467
>               class_unsup_pc_iso
> Bandwidth(s):          0.2304616
> 
> Bandwidth Type: Fixed
> Log Likelihood: 1531.598
> 
> Continuous Kernel Type: Second-Order Gaussian
> No. Continuous Vars.: 3
> 
> Unordered Categorical Kernel Type: Aitchison and Aitken
> No. Unordered Categorical Vars.: 1
> 
>> predict(npu_dens,newdata=origin[training_cols_select]))
> 
> [1] 7.762187e+18 9.385532e+18 6.514318e+18 7.583486e+18 6.283017e+18
>  [6] 6.167344e+18 9.820551e+18 7.952821e+18 7.882741e+18 1.744266e+19
> [11] 6.653258e+18 8.704722e+18 8.631365e+18 1.876052e+19 1.995445e+19
> [16] 2.323802e+19 1.203780e+19 8.493055e+18 8.485279e+18 1.722033e+19
> [21] 2.227207e+19 2.177740e+19 2.168679e+19 9.329572e+18 9.380505e+18
> [26] 1.023311e+19 2.109676e+19 7.903112e+18 7.935457e+18 8.917777e+18
> [31] 8.899827e+18 6.265440e+18 6.204720e+18 6.276559e+18 6.218002e+18