[R] queue waiting times comparison

Thu Aug 18 15:39:49 CEST 2011

If those values represent response times in a system, then when I was
responsible for characterizing what the system would do from the
viewpoint of an SLA (service level agreement) with customers using the
system, we usually specified that "90% of the transactions would have
a response time of --- or less".  This took care of most "long tails".
 So it depends on how you are planning to use this data.  We usually
monitored the 90th or 95th percentile to see how a system was
operating day to day.

On Thu, Aug 18, 2011 at 8:52 AM, Petr PIKAL <petr.pikal at precheza.cz> wrote:
> Hallo Jim
>
> Thank you and see within text.
>
> jim holtman <jholtman at gmail.com> napsal dne 18.08.2011 14:09:11:
>
>> I am not sure why you say that "lapply(ml, mean)" shows (incorrectly)
>> that the second year has a larger average; it is correct for the data:
>>
>> > lapply(ml, my.func)
>> $y1
>>     Count      Mean        SD       Min    Median       90%       95%
>>      Max       Sum
>>  18.00000  16.83333  12.42980   4.00000  12.50000  37.20000  41.05000
>> 47.00000 303.00000
>>
>> $y2
>>     Count      Mean        SD       Min    Median       90%       95%
>>      Max       Sum
>>  15.00000  20.06667  25.27694   4.00000  11.00000  45.80000  70.40000
>> 97.00000 301.00000
>>
>>
>> You have a larger "outlier" in the second year that causes the mean to
>> be higher.  The median is lower, but I usually look at the 90th
>> percentile if I am looking at response time from a system and again
>> the second year has a higher value.
>>
>> So exactly why do you not "trust" your data?
>
> Well. I trust them, however mean is "correct" central value only when data
> are normally distributed or at least symmetrical. As the values are
> heavily  distorted I feel that I shall not use mean for comparison of such
> sets. Anyway t.test tells me that there is no difference between y2 and
> y1.
>
>> t.test(ml[[1]], ml[[2]])
>
>        Welch Two Sample t-test
>
> data:  ml[[1]] and ml[[2]]
> t = -0.452, df = 19.557, p-value = 0.6563
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
>  -18.17781  11.71115
> sample estimates:
> mean of x mean of y
>  16.83333  20.06667
>
> So based on this I probably will never get conclusive result as sd due to
> "outliers" will be quite high.
>
> When I do
> plot(ecdf(ml[[2]]))
> plot(ecdf(ml[[1]]), add=T, col=2)
>
> it seems to me that both sets are almost the same and they differ
> substantially only with those "outlier" values.
>
> If I decreased small values of y2 (e.g.)
>
> ml[[2]][ml[[2]]<20] <- ml[[2]][ml[[2]]<20]/2
>
> I get same mean
>
> lapply(ml, mean)
> $y1
> [1] 16.83333
>
> $y2
> [1] 16.1
>
> and t.test tells me that there is no difference between those two sets,
> although I know that most events take half of the time and only few last
> longer so for me such set is better (we improved performance for most of
> the time however there are still scarce events which take a long time).
>
> plot(ecdf(ml[[2]]))
> plot(ecdf(ml[[1]]), add=T, col=2)
>
> So still the question stays - what procedure to use for comparison of two
> or more sets with such long tailed distribution? - Trimmed mean?, Median?,
> ...
>
> Thanks.
>
> Regards
> Petr
>
>>
>> On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pikal at precheza.cz>
> wrote:
>> > Hallo all
>> >
>> > I try to find a way how to compare set of waiting times during
> different
>> > periods. I tried learn something from queueing theory and used also R
>> > search. There is plenty of ways but I need to find the easiest and
> quite
>> > simple.
>> > Here is a list with actual waiting times.
>> >
>> > ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7, 15,
>> > 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5,
>> > 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1", "y2"))
>> >
>> > par(mfrow=c(1,2))
>> > lapply(ml, hist)
>> >
>> > shows that in the first year is more longer waiting times
>> >
>> > lapply(ml, mean)
>> >
>> > shows (incorrectly) that in the second year there is longer average
>> > waiting time.
>> >
>> > lapply(ml, mean)
>> >
>> > gives me completely reversed values.
>> >
>> > Can you please give me some hints what to use for "correct" and
> "simple"
>> > comparison of  waiting times in two or more periods.
>> >
>> > Thank you
>> > Petr
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>
>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?