[R] Non-normal data issues in PhD software engineering experiment

Thu Jul 10 18:06:20 CEST 2008

I hope you don't really want our patients :)

It looks that you have an experiment with two groups. You have several
trials for each group. And within each trial you observe your units a
distinct points in time.

The first advice for you is to graphically display your data. Before you
start modeling your data wrong, you should have a strong feeling what the
right approach will be. If your data is nonlinear, for example, you will
take a different approach than when it is. So what I suggest you to do is to
plot your Ys (dependent variables) against time for each of your trials,
optimally two plots, one for each group (but multiple plots are also okay).
These plots should give you a firm intution about how your dependent
variable develops over time for each group. The modeling of your data in a
regression model then depends on the presumed functional relationship
between your dependent variable and your independent variables (time and
group). An important question is the distribution of your dependent
variable. Is normally distributed? Or is it a proportion? All this is
important information in deciding how to model your problem.

Can you try to answer these question first? You will typically address your
questions 1 and 2 jointly in your regression model.

Regarding question 3. Power comes from sample size. Thus, a very easy way to
do it is to "multiply" your data. Copy your data a second time and look how
it looks (this leads to your data matrices not being full rank, which may or
may not be an issue). The second approach is to look at whatever statistics
you get from your regression analysis. You know the difference in the
coefficients and you know the test statistic for this coefficient (e.g. its
t-value). From that you can project how big your sample must be. But that
should come at a later stage, when you know how to analyze your data.
Otherwise you may waste time and effort (or money) in conducting new
experiment that are unnecessary in the end.

Best,
Daniel

Andrew Jackson-9 wrote:
> 
> Hi All,
> 
> This is a rather general post to begin with. This is because I need to
> provide you some important context. There are very R specific elements to
> this further along in this rather long posting so I thank you in advance
> for your patients.
> 
> I will do my best to clearly explain my experiment, data, goals and
> problems I have. Please let me know if I have left out any vital
> information or if there is any ambiguity that I need to address such that
> you can help me.
> 
> I have a very limited background in statistics - I have just completed a
> postgraduate course in Statistics at Trinity College Dublin, Ireland. So I
> have the basics and not much more.
> 
> I would also like to say up front that I am not the most gifted in terms
> of  maths. With that in mind, I would appreciate it that if you respond to
> this with a long equations and mathematical notations you could also
> describe at a high level what the equation does or represents.
> 
> *** Experimental setup ***
> I am have conducted a software engineering experiment in which I have
> taken measures of quality for a software system build using 2 different
> design paradigms (1 and 2) over 10 evolutionary versions of the system (1
> - 10). So for each version I have a pair of systems identical in that they
> do precisely the same thing and differ only in that they are build using 2
> different design paradigms.
> 
> For each version and paradigm type I have collected a data set of measures
> called sensitivity measures. So for instance I have 20 different data
> sets, 10 for the 10 versions of software under design paradigm 1 and 10
> for the 10 versions of software under design paradigm 2.
> 
> *** Data ***
> 
> My data can be found at - https://www.cs.tcd.ie/~ajackso/data.csv
> 
> In this data file there are a number of columns -
> "version","paradigm","location","coverage","execution","infection","propogation","sensitivity"
> 
> Sensitivity is the main response - please ignore
> "coverage","execution","infection","propogation" as these were used to
> calculate sensitivity.
> 
> All 20 if my data sets are in this file - the columns version (1 - 10) and
> paradigm (1 or 2) differentiate them.
> 
> *** Goals ***
> With this data collected I now want to do a number of things -
> 
> 1) I want to look at the analysis of variance so see if there is a
> difference in mean for each paradigm over the 10 versions. I want to
> remove the version related variance by blocking on version. With this done
> I should be able to get a picture of the variance related to paradigm
> only. My null hypothesis is that there the means of both data sets are the
> same. I also want to look at each data set individually also to see if
> there is any difference between each pair of system designs.
> 
> 2) I want to create two regression models, one for each paradigm to enable
> me to see how the quality of each paradigm is effected over time
> (versions). It would also be nice to have both confidence and prediction
> boundaries.
> 
> 3) I want to be able to look at the power of all of this and possible see
> how many times I would need to do this to have concrete evidence that one
> paradigm is different/the same/better/worse than the other.
> 
> 4) I am not 100% sure if its relevant - but the analysis of divergence
> (Something I came across when reading an R book - Introductory statistics
> with R - Peter Dalagaard - Springer - p197) may fit what I am looking for
> to assess the difference between the two regression models stated in goal
> statement 2. I think that this will assess the degree to which the
> regression models diverge over time.
> 
> *** Problems ***
> 
> 1)The problem I have is that each of the 20 data sets are of variable
> size. These data sets are also not-normal. I have assessed this using the
> normality tests (ad.test etc.in R and Mini-tab)  So as far as I understand
> it I had two choices - the first is to transform my non-normal data into
> normal data. The second is to look at using non-parametric approaches.
> 
> So I tried to use R to conduct a boxcox transformation for each of my 20
> data sets. I couldn't figure it out past generating an optimal lambda. I
> then turned to mini-tab and found that I could make transformations there
> - the problem however was that there was a subset group option I didn't
> understand. I set it at various numbers but always seemed to get the same
> result so it didn't seem to upset the outcome that much/if at all. The
> result of this was non-normal data again. I then turned to the Johnson
> transformation and found that that also failed to produce transform my
> non-normal data to normal data.
> 
> 3) I have looked at the Friedman test as a means of performing two way
> analysis of variance to address with my scenario. I have tried to execute
> it in R and Mini-tab but cant really cant figure out what my arguments
> should be.
> 
> Using R: I then read my data into a frame using "read.table(data)". I
> proceed to then with the following - friedman.test( data$sensitivity ~
> data$paradigm | data$version, data, data$version, na.action=na.exclude).
> This produces the following error "incorrect specification for 'formula'".
> I see that my formula needs to be of length == 3 for this test to be used
> (https://svn.r-project.org/R/trunk/src/library/stats/R/friedman.test.R). I
> dont think that my formula should be like this even but I wanted to be as
> close as possible to the example provided by R.
> 
> I then tried to use the kruskal.test as follows -
> kruskal.test(data$sensitivity ~ data$sensitivity, data = data,
> na.action=na.exclude) - this gave me a result - however there was no
> account of the variance between versions.
> 
> -- kruskal.test(data$sensitivity ~ data$version + data$paradigm, data =
> sensResults, na.action=na.exclude)
> --
> --	Kruskal-Wallis rank sum test
> --
> -- data:  data$sensitivity by data$version by data$paradigm
> -- Kruskal-Wallis chi-squared = 12.1449, df = 9, p-value = 0.2053
> 
> I have no idea if these tests are the right thing to do here? This test is
> advertised as a substitute to one way anova. My instinct tells me that I
> need to use the friedman.test - but as you can see I am noting having much
> luck with it. I have looked at the code in R as you can see from the link
> above and can see where it us rejecting my formula - I just don't
> understand what I need to do to my model for it to be accepted.
> 
> 4) I have looked at the outputs to the kruskal.test and friedman.test and
> they differ from the anova table -
> 
> By following and executing the R man examples I can see the friedman.test
> produces the following output:
> 
> -- > friedman.test(x ~ w | t, data = wb)
> -- 
> -- 	Friedman rank sum test
> -- 
> -- data:  x and w and t
> -- Friedman chi-squared = 0.3333, df = 1, p-value = 0.5637
> 
> You can also see from the above point that the output of the kruskal.test
> looks similar enough. This is a big contrasts to an anova table. In an
> anova table I can see the components of variance and the significant of
> each F test. These alternative tests do not seem to provide me this
> information.
> 
> Using Mini-tab: I go to stats->Nonparametrics->Friedman. This prompts me
> to provide columns for response, treatment and blocks.
> 
> I provide the following:
> 
> response <- sensitivity
> treatment <- paradigm
> blocks <- version
> 
> When I try to execute this I get the following error
> 
> Friedman 'sensitivity' 'paradigm' 'version' 'RESI1' 'FITS1'.
> 
> * ERROR * Must have one observation per cell.
> * ERROR * Completion of computation impossible.
> 
> 5) I have looked briefly at the non-parametric approaches to regression -
> there seems to be many
> (http://socserv.mcmaster.ca/jfox/Courses/Oxford-2005/R-nonparametric-regression.html)
> paths that can be taken. I need some guidance on which approach I should
> follow? What are the trade-offs? How do I do this?
> 
> Thank you and best regards,
> Andrew Jackson
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/Non-normal-data-issues-in-PhD-software-engineering-experiment-tp18385510p18385584.html
Sent from the R help mailing list archive at Nabble.com.