[R] Testing for strength of fit using R

Thu Nov 26 17:35:24 CET 2009

On Nov 26, 2009, at 9:48 AM, Steve Murray wrote:

>
> Dear all,
>
> I am trying to validate a model by comparing simulated output values  
> against observed values. I have produced a simple X-y scatter plot  
> with a 1:1 line, so that the closer the points fall to this line,  
> the better the 'fit' between the modelled data and the observation  
> data.
>
> I am now attempting to quantify the strength of this fit by using a  
> statistical test in R. I am no statistics guru, but from my limited  
> understanding, I suspect that I need to use the Chi Squared test (I  
> am more than happy to be corrected on this though!).
>
> However, this results in the following:
>
>
>> chisq.test(data$Simulation,data$Observation)
>
>     Pearson's Chi-squared test
>
> data:  data$Simulation and data$Observation
> X-squared = 567, df = 550, p-value = 0.2989
>
> Warning message:
> In chisq.test(data$Simulation, data$Observation) :
>   Chi-squared approximation may be incorrect
>
>
> The ?chisq.test document suggests that the objects should be of  
> vector or matrix format, so I tried the following, but still receive  
> a warning message (and different results):
>
>> chisq.test(as.matrix(data[,4:5]))
>
>     Pearson's Chi-squared test
>
> data:  as.matrix(data[, 4:5])
> X-squared = 130.8284, df = 26, p-value = 6.095e-16

When you look at your "data" you see only 27 cases, so it would be  
implausible that your first invocation with a degree of freedom = 550  
would be giving you something meaningful. The second one might have  
been more meaningful goodness of fit. I cannot explain why code # 1  
did not give the same results since I would have thought that the  
positional matching of R would have resulted in the same results for  
both calls. What happens if you try:

chisq.test(data$Simulation, y=data$Observation)  # ?

All of that being said, chisq.test is primarily intended for  
contingency tables. Testing association between two paired continuous  
variables is usually approached with regression and correlation tests.  
E.g.:

?cor
?lm

Also may want to look at the Q-Q plot.

?qqplot

-- 
David Winsemius

>
> Warning message:
> In chisq.test(as.matrix(data[, 4:5])) :
>   Chi-squared approximation may be incorrect
>
>
>
> What am I doing wrong and how can I successfully measure how well  
> the simulated values fit the observed values?
>
>
> If it's of any help, here are how my data are structured - note that  
> I am only using columns 4 and 5 (Observation and Simulation).
>
>> str(data)
> 'data.frame':    27 obs. of  5 variables:
>  $ Location        : Factor w/ 27 levels "Australia","Brazil",..: 8  
> 2 13 19 22 14 16 23 6 7 ...
>  $ Vegetation      : Factor w/ 21 levels "Beech","Broadleaf  
> evergreen laurel",..: 17 21 2 16 15 16 9 16 3 4 ...
>  $ Vegetation.Class: Factor w/ 4 levels "Boreal and Temperate  
> Evergreen",..: 3 3 4 1 1 1 4 1 4 1 ...
>  $ Observation     : num  24 8.9 14.7 26.7 42.4 31.7 30.8 7.5 14  
> 22 ...
>  $ Simulation      : num  33.9 7.8 9.74 7.6 11.8 10.7 12 28.1 1.7  
> 1.7 ...
>
>
> I hope someone is able to point me in the right direction.
>
> Many thanks,
>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT