[R] queue waiting times comparison

Thu Aug 18 16:12:57 CEST 2011

Hi Jim

> 
> If those values represent response times in a system, then when I was
> responsible for characterizing what the system would do from the
> viewpoint of an SLA (service level agreement) with customers using the
> system, we usually specified that "90% of the transactions would have
> a response time of --- or less".  This took care of most "long tails".
>  So it depends on how you are planning to use this data.  We usually
> monitored the 90th or 95th percentile to see how a system was
> operating day to day.

I get the point. This can be an option. I will discuss it with my 
colleagues.

Thank you for your time and an answer.

Best regards
Petr

> 
> On Thu, Aug 18, 2011 at 8:52 AM, Petr PIKAL <petr.pikal at precheza.cz> 
wrote:
> > Hallo Jim
> >
> > Thank you and see within text.
> >
> > jim holtman <jholtman at gmail.com> napsal dne 18.08.2011 14:09:11:
> >
> >> I am not sure why you say that "lapply(ml, mean)" shows (incorrectly)
> >> that the second year has a larger average; it is correct for the 
data:
> >>
> >> > lapply(ml, my.func)
> >> $y1
> >>     Count      Mean        SD       Min    Median       90%       95%
> >>      Max       Sum
> >>  18.00000  16.83333  12.42980   4.00000  12.50000  37.20000  41.05000
> >> 47.00000 303.00000
> >>
> >> $y2
> >>     Count      Mean        SD       Min    Median       90%       95%
> >>      Max       Sum
> >>  15.00000  20.06667  25.27694   4.00000  11.00000  45.80000  70.40000
> >> 97.00000 301.00000
> >>
> >>
> >> You have a larger "outlier" in the second year that causes the mean 
to
> >> be higher.  The median is lower, but I usually look at the 90th
> >> percentile if I am looking at response time from a system and again
> >> the second year has a higher value.
> >>
> >> So exactly why do you not "trust" your data?
> >
> > Well. I trust them, however mean is "correct" central value only when 
data
> > are normally distributed or at least symmetrical. As the values are
> > heavily  distorted I feel that I shall not use mean for comparison of 
such
> > sets. Anyway t.test tells me that there is no difference between y2 
and
> > y1.
> >
> >> t.test(ml[[1]], ml[[2]])
> >
> >        Welch Two Sample t-test
> >
> > data:  ml[[1]] and ml[[2]]
> > t = -0.452, df = 19.557, p-value = 0.6563
> > alternative hypothesis: true difference in means is not equal to 0
> > 95 percent confidence interval:
> >  -18.17781  11.71115
> > sample estimates:
> > mean of x mean of y
> >  16.83333  20.06667
> >
> > So based on this I probably will never get conclusive result as sd due 
to
> > "outliers" will be quite high.
> >
> > When I do
> > plot(ecdf(ml[[2]]))
> > plot(ecdf(ml[[1]]), add=T, col=2)
> >
> > it seems to me that both sets are almost the same and they differ
> > substantially only with those "outlier" values.
> >
> > If I decreased small values of y2 (e.g.)
> >
> > ml[[2]][ml[[2]]<20] <- ml[[2]][ml[[2]]<20]/2
> >
> > I get same mean
> >
> > lapply(ml, mean)
> > $y1
> > [1] 16.83333
> >
> > $y2
> > [1] 16.1
> >
> > and t.test tells me that there is no difference between those two 
sets,
> > although I know that most events take half of the time and only few 
last
> > longer so for me such set is better (we improved performance for most 
of
> > the time however there are still scarce events which take a long 
time).
> >
> > plot(ecdf(ml[[2]]))
> > plot(ecdf(ml[[1]]), add=T, col=2)
> >
> > So still the question stays - what procedure to use for comparison of 
two
> > or more sets with such long tailed distribution? - Trimmed mean?, 
Median?,
> > ...
> >
> > Thanks.
> >
> > Regards
> > Petr
> >
> >>
> >> On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pikal at precheza.cz>
> > wrote:
> >> > Hallo all
> >> >
> >> > I try to find a way how to compare set of waiting times during
> > different
> >> > periods. I tried learn something from queueing theory and used also 
R
> >> > search. There is plenty of ways but I need to find the easiest and
> > quite
> >> > simple.
> >> > Here is a list with actual waiting times.
> >> >
> >> > ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7, 15,
> >> > 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5,
> >> > 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1", "y2"))
> >> >
> >> > par(mfrow=c(1,2))
> >> > lapply(ml, hist)
> >> >
> >> > shows that in the first year is more longer waiting times
> >> >
> >> > lapply(ml, mean)
> >> >
> >> > shows (incorrectly) that in the second year there is longer average
> >> > waiting time.
> >> >
> >> > lapply(ml, mean)
> >> >
> >> > gives me completely reversed values.
> >> >
> >> > Can you please give me some hints what to use for "correct" and
> > "simple"
> >> > comparison of  waiting times in two or more periods.
> >> >
> >> > Thank you
> >> > Petr
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >> >
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >
> >
> 
> 
> 
> -- 
> Jim Holtman
> Data Munger Guru
> 
> What is the problem that you are trying to solve?