[R] methodology question : is anova appropriate for these data?

Wed Oct 6 16:46:54 CEST 2010

Representative small sample of data:

algorithmID <- factor(c(rep('alg1',4),rep('alg2',4),rep('alg3',4)))
threshold <- factor(rep(c(.45,.50,.55,.60),times=3))
score <- c(30,32,31,30,10,12,13,14,22,21,20,24)
d <- data.frame(algorithmID,threshold,score)

AlgorithmID is the name of each algorithm; threshold is the value of a parameter used by the algorithm that produces the score; the score is a number that can take any integer value between 0 and 40.

I'd like to know whether different algorithms reliably produce different scores. A score comes from the algorithm being run with the specified value of 'threshold'. The value of threshold is fixed for a given run of each algorithm - in that sense I think that (but I'm not sure that) it should be treated as a fixed factor rather than a random factor.

I am tempted to try:

d.aov <- aov(score ~ algorithmID + Error(threshold/algorithmID))

but I am doubtful whether it is appropriate to treat 'threshold' in this way.

I have two queries:

1. How should I determine whether ANOVA is an appropriate test of the null hypothesis that score does not vary significantly by algorithmID?

2. If values for threshold were randomly sampled from the range 0.01 to 0.99, rather than being fixed, which is an option, would that make any difference to whether ANOVA would be suitable?

Any advice gratefully received,
Matt
Research Assistant, University of Aberdeen

The University of Aberdeen is a charity registered in Scotland, No SC013683.