[R] [OT] 1 vs 2-way anova technical question

Mon Nov 21 20:04:05 CET 2011

Hello Rob,

Thank you for your suggestions. I tried glm too without success. Anyhow I include all the information just in case someone with good knowledge can give me a hand with this. I take log of the response variable because: 
- its values span across multiple orders of magnitudes 
- the diagnostic plots e.g. QQ, residuals vs fitted etc do improve with that.

Below I include:
1) general summary of my data
2) 1-way anova and summary of the model
3) 4-way anova and summary of the model  

Attached:
a) Overview of the data (where main interactions occur i.e. No_databases and No_middlewares)
b) diagnostic plots for 2) Here the Normality assumption of the residuals looks reasonable
c) diagnostic plots for 3) Here the Normality assumption of the residuals does not seem to hold so it invalidates the 4-way aov model?

I tried glm and it delivers similar results as 3)

My impression is that my system is heavily polluted with outliers one can see that from plot a) how much the mean and the median differ due to the outliers. That's just the way the system I implemented behaves. Btw the system is a multi-tiered architecture that I developed in Java from scratch that includes XA and different data access and partitioning patterns. I need to quantitatively analyze and draw conclusion from this system. Most of my class mates just make it real simple: make 2^k experiments take one grand mean out of each experiment and do the ANOVA on those means i.e. 1-repetition, compute the fraction of variation and that's it. I am trying to model it more deeply by checking model assumptions, etc. 

Many thanks in advance,
Best regards,
Giovanni

> str(throughput)
'data.frame':	479 obs. of  9 variables:
 $ Time              : num  7 8 9 10 11 12 13 14 15 16 ...
 $ Throughput        : int  155 155 154 157 155 214 4631 2118 136 132 ...
 $ Workload          : chr  "All" "All" "All" "All" ...
 $ No_databases      : Factor w/ 2 levels "1","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Partitioning      : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 1 1 1 1 1 ...
 $ No_middlewares    : Factor w/ 3 levels "1","2","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Queue_size        : Factor w/ 2 levels "40","100": 1 1 1 1 1 1 1 1 1 1 ...
 $ No_clients        : Factor w/ 1 level "64": 1 1 1 1 1 1 1 1 1 1 ...
 $ Experimental_error: Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...

> summary(throughput)
      Time         Throughput       Workload         No_databases      Partitioning No_middlewares
 Min.   : 7.00   Min.   :  35.0   Length:479         1:239        sharding   :240   1:160         
 1st Qu.:11.50   1st Qu.:  50.5   Class :character   4:240        replication:239   2:159         
 Median :16.00   Median : 744.0   Mode  :character                                  4:160         
 Mean   :16.48   Mean   : 830.3                                                                   
 3rd Qu.:21.00   3rd Qu.:1205.5                                                                   
 Max.   :26.00   Max.   :4631.0                                                                   
 Queue_size No_clients Experimental_error
 40 :240    64:479     1:479             
 100:239   

## #######################################################
##
##  ANOVA "one-way" interaction
##
## #######################################################
> throughput.aov <- aov(log(Throughput)~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput)
> throughput.aov
Call:
   aov(formula = log(Throughput) ~ No_databases + Partitioning + 
    No_middlewares + Queue_size, data = throughput)

Terms:
                No_databases Partitioning No_middlewares Queue_size Residuals
Sum of Squares      521.5264       5.6971        50.5814     0.4628  476.6826
Deg. of Freedom            1            1              2          1       473

Residual standard error: 1.003885 
Estimated effects may be unbalanced
> summary(throughput.aov)
                Df Sum Sq Mean Sq  F value    Pr(>F)    
No_databases      1 521.53  521.53 517.4974 < 2.2e-16 ***
Partitioning           1   5.70    5.70   5.6530   0.01782 *  
No_middlewares   2  50.58   25.29  25.0953 4.381e-11 ***
Queue_size          1   0.46    0.46   0.4592   0.49833    
Residuals      473 476.68    1.01                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> 

## #######################################################
##
##  ANOVA 4-way interaction
##
## #######################################################

> throughput.aov <- aov(log(Throughput)~No_databases*Partitioning*No_middlewares*Queue_size,data=throughput)
> throughput.aov
Call:
   aov(formula = log(Throughput) ~ No_databases * Partitioning * 
    No_middlewares * Queue_size, data = throughput)

Terms:
                No_databases Partitioning No_middlewares Queue_size No_databases:Partitioning
Sum of Squares      521.5264       5.6971        50.5814     0.4628                   96.9198
Deg. of Freedom            1            1              2          1                         1
                No_databases:No_middlewares Partitioning:No_middlewares No_databases:Queue_size
Sum of Squares                     110.4102                      8.4819                  0.0916
Deg. of Freedom                           2                           2                       1
                Partitioning:Queue_size No_middlewares:Queue_size
Sum of Squares                   0.0015                    0.2254
Deg. of Freedom                       1                         2
                No_databases:Partitioning:No_middlewares No_databases:Partitioning:Queue_size
Sum of Squares                                   23.6400                               0.0512
Deg. of Freedom                                        2                                    1
                No_databases:No_middlewares:Queue_size Partitioning:No_middlewares:Queue_size
Sum of Squares                                  0.1247                                 0.1511
Deg. of Freedom                                      2                                      2
                No_databases:Partitioning:No_middlewares:Queue_size Residuals
Sum of Squares                                               0.7391  235.8461
Deg. of Freedom                                                   2       455

Residual standard error: 0.7199605 
Estimated effects may be unbalanced
> summary(throughput.aov)
                                                     Df Sum Sq Mean Sq   F value    Pr(>F)    
No_databases                               1 521.53  521.53 1006.1413 < 2.2e-16 ***
Partitioning                                    1   5.70    5.70   10.9909 0.0009888 ***
No_middlewares                           2  50.58   25.29   48.7914 < 2.2e-16 ***
Queue_size                                  1   0.46    0.46    0.8928 0.3452201    
No_databases:Partitioning           1  96.92   96.92  186.9800 < 2.2e-16 ***
No_databases:No_middlewares  2 110.41   55.21  106.5030 < 2.2e-16 ***
Partitioning:No_middlewares       2   8.48    4.24    8.1818 0.0003229 ***
No_databases:Queue_size         1   0.09    0.09    0.1766 0.6744713    
Partitioning:Queue_size              1   0.00    0.00    0.0028 0.9576692    
No_middlewares:Queue_size     2   0.23    0.11    0.2174 0.8046764    
No_databases:Partitioning:No_middlewares   2  23.64   11.82   22.8034 3.648e-10 ***
No_databases:Partitioning:Queue_size          1   0.05    0.05    0.0988 0.7534090    
No_databases:No_middlewares:Queue_size 2   0.12    0.06    0.1203 0.8866605    
Partitioning:No_middlewares:Queue_size      2   0.15    0.08    0.1457 0.8644517    
No_databases:Partitioning:No_middlewares:Queue_size   2   0.74    0.37    0.7129 0.4907654    
Residuals                                           455 235.85    0.52                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Thanks in advance,
Best regards,
Giovanni