[R] Curve Fitting/Regression with Multiple Observations

Kyeong Soo (Joseph) Kim kyeongsoo.kim at gmail.com
Fri Apr 30 19:00:42 CEST 2010


Dear Andy,

You're the "kind soul" I mentioned in my previous e-mail!

Certainly yours is the kind of response I've been looking for, and now
I can start with that, especially "splinefun()" with "monoH.FC"
method.

As for my simulation data, your understanding is correct; there are
multiple y values from different replications for the same x values.
Even though there are multiple y values for a given x value, this
could be interpreted as the combination of multiple, different random
components (inherent in any monte carlo simulation) + one fixed,
unknown deterministic component. So underlying assumption is that
there is a one-to-one (monotone) function between x and y.

This is typical in many computer simulation in networking. As said
before, for instance, you can get a nice, closed-form (monotone)
function of utilization (i.e., \rho) for the average delay of
customers in the queueing system in M/M/1 queue. The simulation with
different random seeds, however, gives slightly different average
delays for a given utilization per run. Still, we know from the
underlying model that there is one-to-one correspondence between the
utilization and the average delay. Of course, unlike the simple M/M/1
queue, for most of actual networking systems to analyze, we don't know
the exact models, but it is well accepted and assumed in nearly all
existing work in this area that there is still one-to-one
correspondence between the utilization (or system load) and
performance measures like delay, throughput, and packet loss.

I do appreciate your suggestion and this would be of tremendous help
for my current research.
Also, thanks for the assessment on this list, which I take as a
valuable advice in the future.

With Regards,
Joseph


On Fri, Apr 30, 2010 at 12:52 PM, Liaw, Andy <andy_liaw at merck.com> wrote:
> You may want to run
>
> RSiteSearch("monotone splines")
>
> at the R prompt.  The 3rd hit looks quite promising.  However, if I
> understand your data, you have multiple y values for the same x
> values.  If so, can you justify inverting the regression function?
>
> The traffic on this mailing list is very high, and the signal to
> noise ratio is rather low.  This has the tendency of burning out
> those who started with good intentions to help.
>
> Andy
>
> From: Kyeong Soo (Joseph) Kim
>>
>> Dear Keith,
>>
>> Thanks for the suggestion and taking your time to respond to it.
>>
>> But, you misunderstand something and seems that you do not read all my
>> previous e-mails.
>> For instance, can a hand-drawing curve give you an inverse function
>> (analytically or numerically) so that you can find an x value given
>> the y value (not just for one, but for hundreds of points)?
>>
>> As for the statistical inferences, I admit that my communications were
>> not that very clear. My intention is to get a smoothed curve from the
>> simulation data in a statistically meaningful way as much as possible
>> for my intended use of the resulting curve.
>>
>> As said before, I don't know all the thorough theoretical details
>> behind regression and curve fitting functions available in R (know the
>> basics though as one with PhD in Elec. Eng. unlike someone's
>> assessment), but am doing my best to catch up reading textbooks and
>> manuals, and posting this question to this list is definitely a way to
>> learn from many experts and advanced users of R.
>>
>> By the way, I wonder why most of the responses I've received from this
>> list are so cynical (or skeptical?) and in some sense done in a quite
>> arrogant way. It's very hard to imagine that one would receive such
>> responses in my own areas of computer simulation and optical
>> communications/networking. If a newbie asks a question to the list not
>> making much sense or another FAQ, that is usually ignored (i.e., no
>> response) because all we are too busy to deal with that. Sometimes,
>> though, a kind soul (like Gabor) takes his/her own valuable time and
>> doesn't mind explaining all the details from simple basics.
>>
>> Again, what I want to hear from the list is the proper use of
>> regression/curve fitting functions of R for my simulation data with
>> replications: Applying after taking means or directly on them? So far
>> I haven't heard anyone even specifically touching my question,
>> although there were several seemingly related suggestions.
>>
>> Regards,
>> Joseph
>>
>> On Fri, Apr 30, 2010 at 4:25 AM, kMan <kchamberln at gmail.com> wrote:
>> > Dear Joseph,
>> >
>> > If you do not need to make any inferences, that is, you
>> just want it to look pretty, then drawing a curve by hand is
>> as good a solution as any. Plus, there is no reason for
>> expert testimony to say that the curve does not mean anything.
>> >
>> > Sincerely,
>> > KeithC.
>> >
>> > -----Original Message-----
>> > From: Kyeong Soo (Joseph) Kim [mailto:kyeongsoo.kim at gmail.com]
>> > Sent: Tuesday, April 27, 2010 2:33 PM
>> > To: Gabor Grothendieck
>> > Cc: r-help at r-project.org
>> > Subject: Re: [R] Curve Fitting/Regression with Multiple Observations
>> >
>> > Frankly speaking, I am not looking for such a framework.
>> >
>> > The system I'm studying is a communication network (like
>> M/M/1 queue, but way too complicated to mathematically
>> analyze it using classical queueing theory) and the
>> conclusion I want to make is qualitative rather than
>> quantatitive -- a high-level comparative study of various
>> network architectures based on the "equivalence principle" (a
>> concept specific to netwokring, not in the general sense).
>> >
>> > What l want in this regard is a smooth, non-decreasing (hence
>> > one-to-one) function built out of simulation data because
>> later in my processing, I need an inverse function of the
>> said curve to find out an x value given the y value. That
>> was, in fact, the reason I used the exponential (i.e.,
>> non-decreasing function) curve fiting.
>> >
>> > Even though I don't need a statistical inference framework
>> for my work, I want to make sure that my use of
>> regression/curve fitting techniques with my simulation data
>> (as a tool for getting the mentioned curve) is proper and a
>> usual practice among experts like you.
>> >
>> > To get answer to my question, I digged a lot through the
>> Internet but found no clear explanation so far.
>> >
>> > Your suggestions and providing examples (always!) are much
>> appreciated, but I am still not sure the use of those
>> regression procedures with the kind of data I described is a
>> right way to do.
>> >
>> > Again, many thanks for your prompt and kind answers, Joseph
>> >
>> >
>> > On Tue, Apr 27, 2010 at 8:46 PM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>> >> If you are looking for a framework for statistical
>> inference you could
>> >> look at additive models as in the mgcv package which has  a book
>> >> associated with it if you need more info. e.g.
>> >>
>> >> library(mgcv)
>> >> fm <- gam(dist ~ s(speed), data = cars)
>> >> summary(fm)
>> >> plot(dist ~ speed, cars, pch = 20)
>> >> fm.ci <- with(predict(fm, se = TRUE), cbind(0, -2*se.fit,
>> 2*se.fit) +
>> >> c(fit)) matlines(cars$speed, fm.ci, lty = c(1, 2, 2), col = c(1, 2,
>> >> 2))
>> >>
>> >>
>> >> On Tue, Apr 27, 2010 at 3:07 PM, Kyeong Soo (Joseph) Kim
>> >> <kyeongsoo.kim at gmail.com> wrote:
>> >>> Hello Gabor,
>> >>>
>> >>> Many thanks for providing actual examples for the problem!
>> >>>
>> >>> In fact I know how to apply and generate plots using various R
>> >>> functions including loess, lowess, and smooth.spline procedures.
>> >>>
>> >>> My question, however, is whether applying those
>> procedures directly
>> >>> on the data with multiple observations/duplicate
>> points(?) is on the
>> >>> sound basis or not.
>> >>>
>> >>> Before asking my question to the list, I checked
>> smooth.spline manual
>> >>> pages and found the mentioning of "cv" option related
>> with duplicate
>> >>> points, but I'm not sure "duplicate points" in the manual has the
>> >>> same meaning as "multiple observations" in my case. To
>> me, the manual
>> >>> seems a bit unclear in this regard.
>> >>>
>> >>> Looking at "car" data, I found it has multiple points
>> with the same
>> >>> "speed" but different "dist", which is exactly what I mean by
>> >>> multiple observations, but am still not sure.
>> >>>
>> >>> Regards,
>> >>> Joseph
>> >>>
>> >>>
>> >>> On Tue, Apr 27, 2010 at 7:35 PM, Gabor Grothendieck
>> >>> <ggrothendieck at gmail.com> wrote:
>> >>>> This will compute a loess curve and plot it:
>> >>>>
>> >>>> example(loess)
>> >>>> plot(dist ~ speed, cars, pch = 20)
>> >>>> lines(cars$speed, fitted(cars.lo))
>> >>>>
>> >>>> Also this directly plots it but does not give you the
>> values of the
>> >>>> curve separately:
>> >>>>
>> >>>> library(lattice)
>> >>>> xyplot(dist ~ speed, cars, type = c("p", "smooth"))
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Apr 27, 2010 at 1:30 PM, Kyeong Soo (Joseph) Kim
>> >>>> <kyeongsoo.kim at gmail.com> wrote:
>> >>>>> I recently came to realize the true power of R for statistical
>> >>>>> analysis -- mainly for post-processing of data from large-scale
>> >>>>> simulations -- and have been converting many of existing
>> >>>>> Python(SciPy) scripts to those based on R and/or Perl.
>> >>>>>
>> >>>>> In the middle of this conversion, I revisited the
>> problem of curve
>> >>>>> fitting for simulation data with multiple observations resulting
>> >>>>> from repetitions.
>> >>>>>
>> >>>>> In the past, I first processed simulation data (i.e.,
>> multiple y's
>> >>>>> from repetitions) to get a mean with a confidence interval for a
>> >>>>> given value of x (independent variable) and then applied spline
>> >>>>> procedure for those mean values only (i.e., unique
>> pairs of (x_i,
>> >>>>> y_i) for i=1, 2, ...) to get a smoothed curve. Because of rather
>> >>>>> large confidence intervals, however, the resulting curves were
>> >>>>> hardly smooth enough for my purpose, I had to fix the
>> function to
>> >>>>> exponential and used least square methods to fit its
>> parameters for data.
>> >>>>>
>> >>>>> >From a plot with confidence intervals, it's rather
>> easy for one to
>> >>>>> visually and manually(?) figure out a smoothed curve for it.
>> >>>>> So I'm thinking right now of directly applying spline
>> (or whatever
>> >>>>> regression procedures for this purpose) to the
>> simulation data with
>> >>>>> repetitions rather than means. The simulation data in this case
>> >>>>> looks like this (assuming three repetitions):
>> >>>>>
>> >>>>> # x    y
>> >>>>> 1      1.2
>> >>>>> 1      0.9
>> >>>>> 1      1.3
>> >>>>> 2      2.2
>> >>>>> 2      1.7
>> >>>>> 2      2.0
>> >>>>> ...      ....
>> >>>>>
>> >>>>> So my idea is to let spline procedure handle the fluctuations in
>> >>>>> the data (i.e., in repetitions) by itself.
>> >>>>> But I wonder whether this direct application of spline
>> procedures
>> >>>>> for data with multiple observations makes sense from the
>> >>>>> statistical analysis (i.e., theoretical) point of view.
>> >>>>>
>> >>>>> It may be a stupid question and quite obvious to many, but
>> >>>>> personally I don't know where to start.
>> >>>>> It would be greatly appreciated if anyone can shed a
>> light on this
>> >>>>> in this regard.
>> >>>>>
>> >>>>> Many thanks in advance,
>> >>>>> Joseph
>> >>>>>
>> >>>>> ______________________________________________
>> >>>>> R-help at r-project.org mailing list
>> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >>>>> PLEASE do read the posting guide
>> >>>>> http://www.R-project.org/posting-guide.html
>> >>>>> and provide commented, minimal, self-contained,
>> reproducible code.
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>> >
>> >
>> >
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> Notice:  This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates Direct contact information
> for affiliates is available at
> http://www.merck.com/contact/contacts.html) that may be confidential,
> proprietary copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this message. If you are
> not the intended recipient, and have received this message in error,
> please notify us immediately by reply e-mail and then delete it from
> your system.
>
>



More information about the R-help mailing list