[R] Non-normal data issues in PhD software engineering experiment

Thu Jul 10 17:15:42 CEST 2008

Hi All,

This is a rather general post to begin with. This is because I need to
provide you some important context. There are very R specific elements to
this further along in this rather long posting so I thank you in advance
for your patients.

I will do my best to clearly explain my experiment, data, goals and
problems I have. Please let me know if I have left out any vital
information or if there is any ambiguity that I need to address such that
you can help me.

I have a very limited background in statistics - I have just completed a
postgraduate course in Statistics at Trinity College Dublin, Ireland. So I
have the basics and not much more.

I would also like to say up front that I am not the most gifted in terms
of  maths. With that in mind, I would appreciate it that if you respond to
this with a long equations and mathematical notations you could also
describe at a high level what the equation does or represents.

*** Experimental setup ***
I am have conducted a software engineering experiment in which I have
taken measures of quality for a software system build using 2 different
design paradigms (1 and 2) over 10 evolutionary versions of the system (1
- 10). So for each version I have a pair of systems identical in that they
do precisely the same thing and differ only in that they are build using 2
different design paradigms.

For each version and paradigm type I have collected a data set of measures
called sensitivity measures. So for instance I have 20 different data
sets, 10 for the 10 versions of software under design paradigm 1 and 10
for the 10 versions of software under design paradigm 2.

*** Data ***

My data can be found at - https://www.cs.tcd.ie/~ajackso/data.csv

In this data file there are a number of columns -
"version","paradigm","location","coverage","execution","infection","propogation","sensitivity"

Sensitivity is the main response - please ignore
"coverage","execution","infection","propogation" as these were used to
calculate sensitivity.

All 20 if my data sets are in this file - the columns version (1 - 10) and
paradigm (1 or 2) differentiate them.

*** Goals ***
With this data collected I now want to do a number of things -

1) I want to look at the analysis of variance so see if there is a
difference in mean for each paradigm over the 10 versions. I want to
remove the version related variance by blocking on version. With this done
I should be able to get a picture of the variance related to paradigm
only. My null hypothesis is that there the means of both data sets are the
same. I also want to look at each data set individually also to see if
there is any difference between each pair of system designs.

2) I want to create two regression models, one for each paradigm to enable
me to see how the quality of each paradigm is effected over time
(versions). It would also be nice to have both confidence and prediction
boundaries.

3) I want to be able to look at the power of all of this and possible see
how many times I would need to do this to have concrete evidence that one
paradigm is different/the same/better/worse than the other.

4) I am not 100% sure if its relevant - but the analysis of divergence
(Something I came across when reading an R book - Introductory statistics
with R - Peter Dalagaard - Springer - p197) may fit what I am looking for
to assess the difference between the two regression models stated in goal
statement 2. I think that this will assess the degree to which the
regression models diverge over time.

*** Problems ***

1)The problem I have is that each of the 20 data sets are of variable
size. These data sets are also not-normal. I have assessed this using the
normality tests (ad.test etc.in R and Mini-tab)  So as far as I understand
it I had two choices - the first is to transform my non-normal data into
normal data. The second is to look at using non-parametric approaches.

So I tried to use R to conduct a boxcox transformation for each of my 20
data sets. I couldn't figure it out past generating an optimal lambda. I
then turned to mini-tab and found that I could make transformations there
- the problem however was that there was a subset group option I didn't
understand. I set it at various numbers but always seemed to get the same
result so it didn't seem to upset the outcome that much/if at all. The
result of this was non-normal data again. I then turned to the Johnson
transformation and found that that also failed to produce transform my
non-normal data to normal data.

3) I have looked at the Friedman test as a means of performing two way
analysis of variance to address with my scenario. I have tried to execute
it in R and Mini-tab but cant really cant figure out what my arguments
should be.

Using R: I then read my data into a frame using "read.table(data)". I
proceed to then with the following - friedman.test( data$sensitivity ~
data$paradigm | data$version, data, data$version, na.action=na.exclude).
This produces the following error "incorrect specification for 'formula'".
I see that my formula needs to be of length == 3 for this test to be used
(https://svn.r-project.org/R/trunk/src/library/stats/R/friedman.test.R). I
dont think that my formula should be like this even but I wanted to be as
close as possible to the example provided by R.

I then tried to use the kruskal.test as follows -
kruskal.test(data$sensitivity ~ data$sensitivity, data = data,
na.action=na.exclude) - this gave me a result - however there was no
account of the variance between versions.

-- kruskal.test(data$sensitivity ~ data$version + data$paradigm, data =
sensResults, na.action=na.exclude)
--
--	Kruskal-Wallis rank sum test
--
-- data:  data$sensitivity by data$version by data$paradigm
-- Kruskal-Wallis chi-squared = 12.1449, df = 9, p-value = 0.2053

I have no idea if these tests are the right thing to do here? This test is
advertised as a substitute to one way anova. My instinct tells me that I
need to use the friedman.test - but as you can see I am noting having much
luck with it. I have looked at the code in R as you can see from the link
above and can see where it us rejecting my formula - I just don't
understand what I need to do to my model for it to be accepted.

4) I have looked at the outputs to the kruskal.test and friedman.test and
they differ from the anova table -

By following and executing the R man examples I can see the friedman.test
produces the following output:

-- > friedman.test(x ~ w | t, data = wb)
-- 
-- 	Friedman rank sum test
-- 
-- data:  x and w and t
-- Friedman chi-squared = 0.3333, df = 1, p-value = 0.5637

You can also see from the above point that the output of the kruskal.test
looks similar enough. This is a big contrasts to an anova table. In an
anova table I can see the components of variance and the significant of
each F test. These alternative tests do not seem to provide me this
information.

Using Mini-tab: I go to stats->Nonparametrics->Friedman. This prompts me
to provide columns for response, treatment and blocks.

I provide the following:

response <- sensitivity
treatment <- paradigm
blocks <- version

When I try to execute this I get the following error

Friedman 'sensitivity' 'paradigm' 'version' 'RESI1' 'FITS1'.

* ERROR * Must have one observation per cell.
* ERROR * Completion of computation impossible.

5) I have looked briefly at the non-parametric approaches to regression -
there seems to be many
(http://socserv.mcmaster.ca/jfox/Courses/Oxford-2005/R-nonparametric-regression.html)
paths that can be taken. I need some guidance on which approach I should
follow? What are the trade-offs? How do I do this?

Thank you and best regards,
Andrew Jackson