[R] Root mean square on binned GAM results

Sat Jun 19 05:08:42 CEST 2010

On Jun 18, 2010, at 10:38 PM, David Jarvis wrote:

> Hi, David.
>
> accurately reflect how closely the model (GAM) fits the data. I was  
> told
>
> This was my presumption; I could be mistaken.
>
> that the accuracy of the correlation can be improved using a root mean
> square deviation (RMSD) calculation on binned data.
>
> By whom? ...  and with what theoretical basis?
>
> I talked with Christian Schunn. He mentioned that using RMSD would  
> produce a better result for goodness-of-fit (if that term is not  
> synonymous with correlation, I apologise -- I'm still rather new to  
> this level of statistics):
>
> http://www.lrdc.pitt.edu/schunn/gof/index.html
>
> It was regarding a chart similar to:
>
> http://i.imgur.com/X0gxV.png
>
> In the chart, the calculation for Pearson's, Spearman's, and  
> Kendall's Tau provide, in my opinion, an incorrect indicator as to  
> the strength of GAM's fit to the data. I could be wrong here, too.
>
> His suggestion was to use bin the means (in groups of 5 or so) to  
> reduce the noise.
>
> I doubt that your strategy offers any statistical advantage, but if  
> you want to play around with it then consider:
>
> binned.x <- round( (x + 2.5)/5)
>
> > d <-  
> c 
> (1,3,5,4,3,6,3,1,5,7,8,9,4,3,2,7,3,6,8,9,5,3,1,4,5,8,9,3,3,2,5,7,8,8,5,4,3,2,6,4,3,1,4,5,6,8,9,0,7,7,5,4,3,3,2,1,3,4,5,6,7,9,0,2,4,3,3 
> )
> > binned.d <- round( (d + 2.5)/5)
> > print(binned.d)
>  [1] 1 1 2 1 1 2 1 1 2 2 2 2 1 1 1 2 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 2  
> 2 2 2 1 1 1
> [39] 2 1 1 1 1 2 2 2 2 0 2 2 2 1 1 1 1 1 1 1 2 2 2 2 0 1 1 1 1
>
> That doesn't make sense to me.

Then I blame your powers of exposition. Without some sort of explicit  
example the parsing of English is very prone to error. If you want to  
pick out elements of x in some pre-specified order in groups of five  
then consider:

 > x <- 1:100
 >
 > rep(1:20, each=5)
   [1]  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5   
5  5
  [24]  5  5  6  6  6  6  6  7  7  7  7  7  8  8  8  8  8  9  9  9  9   
9 10
  [47] 10 10 10 10 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 14 14  
14 14
  [70] 14 15 15 15 15 15 16 16 16 16 16 17 17 17 17 17 18 18 18 18 18  
19 19
  [93] 19 19 19 20 20 20 20 20
 > tapply(x, rep(1:20, each=5), mean)
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20   # this  
row is just indices
  3  8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93 98  # this  
row is the means

If you wanted them in random groups of roughly 5, then you could use  
sample(x, prob=rep(5/n, n/5))

> My impression was that I should try to put every 5 values in a bin,  
> average that bin, then calculate the RMSD between the observed  
> values and the values from GAM. In other words (o is observed and m  
> is model):

Do you intend that m[n] would be the predicted value from a model? How  
are you forming the groups of 5? Are they ordered? If so ordered by  
observed of by predicted? (In R a "model" is a complex list structure,  
but may in some cases have a simple predicted value for each case.  
Again a specific example might work wonders.

-- 
David.

>
>   bins <- 5
>
>   while( length(o) %% bins != 0 ) {
>     o <- o[-length(o)]
>   }
>   omean <- apply( matrix(o, bins), 2, mean )
>
>   while( length(m) %% bins!= 0 ) {
>     m <- m[-length(m)]
>   }
>   mmean <- apply( matrix(m, bins), 2, mean )
>
>   sqrt( mean( omean - mmean ) ^ 2 )
>
> But that feels sloppy, error prone, and fragile.
>
> Joris mentioned that I could try using tapply with  
> cut(d,round(length(d)/5)). I couldn't figure out how to get the  
> means back from the factors.
>
> Dave
>

David Winsemius, MD
West Hartford, CT