upndown Package Vignette 1: Up-and-Down Basics

Background

Up-and-Down designs (UDDs) have the unique distinction of being, in all likelihood, the most popular experimental design that present-day statisticians have never heard of. Developed mostly by statisticians and successfully introduced in the mid-20th Century to a staggering array of application fields, in subsequent decades academic statistical interest has wandered elsewhere.

Nevertheless, UDDs have persisted. Studies using UDDs are routinely published in anesthesiology, toxicology, materials science, dentistry, electrical engineering, and other fields – as well as the fields where the design was originally developed: sensory studies and explosive testing. Thus, UDDs are far from being an “endangered” design; rather, they are statistically under-served.

The upndown package attempts to close some of this service gap. It presents basic computational tools and shortcuts to help modernize and standardize UDD usage, and to render insights about the design more accessible to practitioners and statisticians. It incorporates the most up-to-date methodological improvements. We also provide methodological references for statisticians and analysts wishing to learn more about UDDs.

For UDD estimation and data visualization we rely heavily upon the cir package. That package codifies Centered Isotonic Regression (CIR),¹ an improvement of isotonic regression motivated by the needs of UDDs and of dose-response studies in general. The upndown utilities using cir functions emphasize ease of use, and do not necessitate in-depth familiarity with cir.

This vignette presents a general overview of UDDs, as well as re-analysis of published experimental examples which demonstrate upndown’s data analysis and visualization utilities. For a broader overview, see Oron et al. (2022) published in Anesthesiology², and a somewhat more technical overview by Oron and Flournoy (2024)³

A second package vignette (still in planning) will go deeper into UDD design rules and statistical properties, demonstrating the package’s design-aid utilities.

What Makes a Design “Up-and-Down”?

UDDs belong to the methodological field of dose-finding. This field aims to estimate a target dose, by applying different dose strengths of the same treatment (or stimulus) to a sequence of individuals (or in some applications, a sequence of doses to the same individual), and collecting their responses. Depending upon application, the target dose might be called the \(ED_{100p}\), \(MEV_{100p}\), or \(LD_{100p}\) (\(p\in(0, 1)\)), the sensory threshold, the Maximum Tolerated Dose (MTD), etc. In abstract statistical terms, this field attempts to find a specific percentile of an underlying cumulative response-threshold distribution (CDF) \(F(x),\) i.e., to estimate \(F^{-1}(p).\)

There is substantial confusion regarding the term “Up-and-Down”. From a uselessly broad perspective, any dose-finding designs in which consecutive dose assignments might go “up” or “down” can be called a UDD, and some articles come pretty close to claiming that. We prefer a narrower and far more useful definition: the hallmark of a genuine UDD is the target-centered random walk it generates over the dose levels. To do that, it requires the following:

A binary response (an ordered ternary response should be easily UDD-compatiable, but to our knowledge it has been neither studied nor attempted in practice).
Allowable treatments are a discrete set of increasing dose levels of the same “dose variable” \(x.\)
Monotone dose-response relationship. For simplicity we assume monotone increasing, and denote the relationship via the function \(F(x)\), which as hinted above can often be interpreted as the CDF of the underlying positive-response thresholds.
Doses are allocated sequentially, and only allow for increasing the dose by one level, decreasing by one level, or repeating the same dose. Hence the design’s name “up-and-down”, or (in sensory studies and materials testing) “the Staircase Method.”⁴
Dose-transition rules are based on the doses and responses of the last patient or several patients, rather than on all patient data going back to the beginning of the experiment. Furthermore, the rules do not use any estimated quantity that changes during the study.
Under the monotonicity assumption for \(F(x)\), these rules guarantee that the probability of moving up decreases with increasing \(x\), and vice versa for the probability of moving down.
Lastly, UDDs have no intrinsic mandatory stopping rules.

The fifth element might be partially responsible for UDDs falling out of statistical fashion. In Phase I trial design literature in particular, which happens to be the field enjoying the most dose-finding method development resources, current statistical orthodoxy maintains that anything short of using the entire sample and an estimation process for each dosing decision via some type of greedy algorithm is an “indefensible waste of precious information” (to paraphrase typical rhetoric found in opinion articles on the topic).

In our humble opinion, this view -

confounds the task of data collection - which may not necessarily benefit from carrying the entire dataset “on its back” at each step⁵ - with the task of estimation, at which point UDDs do use the entire sample;
brushes aside some basic statistical realities regarding small-sample, binary-response uncertainty;
remains willfully ignorant of some serious detrimental side-effects of the literature-favored approach, despite those side-effects having been demonstrated beyond doubt (see the reference cited above), and despite them being a well-known downside of greedy algorithms;
last but not least, ignores UDDs’ proven properties and impressive track record.

But I digress.

Experimental Examples

In selecting examples, I sought relatively recent open-access publications, preferably from different fields. The publications also had to share clearly both the design and the raw data.

Material Strength Testing

We begin with a study testing the single-tooth bending failure (STBF) load of steel helicopter gears, published by Gorla et al. 2017.⁶ The article summarized a long-running series of STBF experiments and simulations, and can perhaps be seen as a meta-analysis. The experiments were 9 separate median-finding, or as we call them “Classical” UDD runs on gears made of 9 different types of steel. In each such run, each tested gear was subjected to up to 10 million high-frequency cycles at a specific load strength, presumably approximating an entire gear-lifespan worth of loads.

If the gear survived these cycles without breaking a tooth, it was considered a success and the next gear was subject to a load \(1kN\) higher. (“up”).
If a tooth broke, it was a failure and the next gear received \(1kN\) lower load (“down”).

Classical UDD was first developed during World War II, to estimate the median effective dropping height of explosives.⁷ Independently, Nobel-Prize winning audiologist von Bekesy developed a near-identical design to estimate the hearing threshold.⁸

Gorla et al. visualize the raw data of their 9 experiments (Tables 4-12) in the very same manner that Dixon and Mood visualized an explosive-testing dataset in their 1948 article: as a text graph with “x” and “o” symbols. We re-plot the data in the somewhat more modern fashion encountered in other fields. We share two of the experiments, here’s the first one:

library(upndown)

# From Gorla et al. (2017) Table 6
gorla751x = 39 + c(3:0, 1, 2, 1:3, 2, 3, 2, 3)
# With UD data, one can usually discern the y's (except the last one) from the x's:
gorla751y =  c( (1 - diff(gorla751x) ) / 2, 1)

udplot(x=gorla751x, y=gorla751y, main = "Gorla et al. 2017, Material 751", 
       ytitle = "Load (kN)", las = 1)

legend('bottomright', legend = c(expression('Survived 10'^7*' cycles'), 
                              'Failed'), pch=c(1, 19), bty = 'n')

This run had \(n=13,\) which for live-subject experiments would be on the low side. Given that these gears are industrial products which likely undergo tight process control, such a small \(n\) might suffice - as long as the experiment doesn’t start too far from the true target.
That said, the random walk nature can be seen in how the first half of the run is a bit different from the second. After 6-7 observations, one might jump to conclude that the target ( = median failure threshold) is surely below 41 kN, probably around 40 kN. But the last 6 tell a rather different story, of the target tightly locked between 41 and 42 kN.

Which half is correct? We don’t know. Each half provides a ridiculously small amount of information, and even combined they leave a lot of uncertainty (if the uncertainty is properly accounted for).

Here’s the dose-response plot, and below it the CIR target estimate which is our recommended standard:


drplot(x=gorla751x, y=gorla751y, addest = TRUE, target = 0.5, addcurve = TRUE, 
       percents = TRUE, main = "Gorla et al. 2017, Material 751", 
       xtitle = "Load (kN)", ytitle = "Percent Failure", las = 1)
legend('bottomright', legend = c('CIR target estimate and 90% CI',
              'CIR load-failure curve estimate'), lty = 1, 
              col = c('purple', 'blue'), bty = 'n')

       
udest(gorla751x, gorla751y, target = 0.5)
#   target    point lower90conf upper90conf
# 1    0.5 41.17241    39.57807     41.7665

The 'X' marks denote raw response frequencies, with symbol area proportional to sample size.

Note that indeed the confidence interval (CI) is fairly wide. It is also asymmetric, mostly because cir dose-finding functions (starting version 2.3.0) separately estimate the local slope of \(F(x)\) to the right and left of target. Note also that upndown default (and cir default too) is a \(90\%\) CI rather than \(95\%\). You hopefully agree that with 13 dependent binary observations, pretending to know \(95\%\) of anything is a bit overly ambitious.

CIR implicitly assumes that the dose-response (in this particular case, load-failure) curve is smooth and strictly monotone. Hence, CIR shrinks the characteristic flat stretches produced by isotonic regression onto single points, then interpolates between those points. The CI reported above and shown in the figure, is the cir package’s other flagship product: the first-ever realistic small-sample CI for practical isotonic-regression applications.

In their own article, Gorla et al. used a much older UD estimator suggested by Brownlee et al. in 1953.⁹ Like most early UD estimators, it is some average of the sequence of doses, and the first one to deploy the trick of adding the \(n+1\)st dose, which is pre-determined by the \(n\)th dose and response. In addition, the estimator excludes the first sequence of doses with identical responses - in UD parlance, the doses before the first reversal. This yields 40.91 kN. upndown includes a generic implementation of reversal-driven, dose-averaging estimates, so this number can be replicated by choosing the right arguments:

# Note the manual addition of "dose n+1"
reversmean(c(gorla751x, 41), gorla751y, rstart = 1, all = TRUE, conf = NULL)
# [1] 40.90909

The conf = NULL argument suppresses the default behavior which also calculates a bootstrap-based CI. Our CI for dose-averaging estimates is even newer than the CIR CI, and remains unpublished with no near-term plans to publish it - because its coverage generally falls somewhat short. Yet, we believe that point estimates should come with a CI, and our dose-averaging CI is better than any of the ones you might encounter in UDD studies with dose-averaging estimates. Just like CIR is a modification of isotonic regression, our UD bootstrap algorithm improves upon one published by Stylianou et al., which tends to produce CIs that are far too narrow.¹⁰

Without further ado, here’s the estimate with the bootstrap CI:

# Note the manual addition of "dose n+1"
reversmean(c(gorla751x, 41), gorla751y, rstart = 1, all = TRUE, 
                                   target = 0.5, design = krow, desArgs = list(k=1) )
# Loading required namespace: plyr
# Sun Jun 15 19:55:59 2025 
# .............
#  Sun Jun 15 19:56:00 2025
#       point lower90conf upper90conf 
#    40.90909    39.46154    41.53846

The dots report the bootstrap simulation’s progress; it progresses in parallel, one dot per observation (there is no explicit parallel processing; however, it is mostly matrices so it tends to run very fast). The default has 1000 replications, which on my laptop takes 1 second for this \(n=13\) dataset. Behind the scenes, \(F(x)\) is estimated via some version of CIR, which is then used to generate the bootstrap samples. For usage information, see the help files for dfboot, dfsim, and krow. You will need to read them in order to specify the arguments correctly and produce the relevant CI, because reversmean() will throw an error message about missing arguments unless conf = NULL.

Just like the CIR CI, our dose-averaging CI for this specific dataset is also left-skewed asymmetric, albeit slightly narrower. Gorla et al. did not report a CI nor a standard error for their point estimate.

We generally recommend using CIR for estimating the target dose and the confidence interval. It is more robust than dose-averaging estimators, and likewise its CI coverage is more consistent. See, e.g., the supplement to Oron et al. 2022.

Important Note: Observation Bias and Empirical Correction

Estimating the target with CIR is as simple as reading off the dose-response plot, and seeing where the CIR curve crosses \(y=target.\)

There is one catch though: the up-and-down process induces dependence between consecutive doses and responses. Therefore, even though each response in the Gorla et al. experiments is independent conditional upon the dose it received, responses are not marginally independent and they do not behave according to Binomial assumptions - contrary to common misconceptions in the dose-finding field (of which yours truly had also been guilty in the past).

In particular: the dependence induces a bias upon observed response rates, a bias that tends to “flare out” away from target.¹¹ Near target the bias is minimal, and this is why up-and-down and other dose-finding designs still work - but using observed frequencies away from target at face value is not recommended.

The blue curve in the dose-response plot does include an empirical bias fix, based on that “flaring out” property, which is why it doesn’t cross the middle of the ‘X’ marks denoting observed frequencies. The main motivation for putting that fix in as default, is not to get the curve right - that would be a bit of a fool’s errand, because dose-finding designs are really about estimating a point, not an entire curve - but to get the confidence intervals more realistic. The “flaring-out” bias makes \(F(x)\)’s apparent slope too steep. The bias correction mitigates that steepness and hence widens the CI.

What if we do try to estimate away from target? Here is the (relatively) proper way to do it. Assume we’re interested in knowing a “safe” load, under which only \(5\%\) of gears are expected to fail:

udest(x = gorla751x, y = gorla751y, target = 0.05, balancePt = .5)
# Warning in udest(x = gorla751x, y = gorla751y, target = 0.05, balancePt = 0.5): 
# We strongly advise against estimating targets this far from the design's balance point.
#   target    point lower90conf upper90conf
# 1   0.05 39.13333    37.44887    39.52263

We change the target argument to the desired percentile. But we must also inform udest() that the experiment was actually designed to revolve around another percentile, which we call the balance point.¹² The bias would flare out from there, not from where we want to estimate now. As you might note, udest() kindly reminds us that this is not recommended. (If balancePt is left empty, udest() assumes it is equal to target.)

Suppose you’re an engineer and you want to find out that safe load? Design an experiment that targets a lower percentile (we’ll discuss that soon). Or an experiment that estimates an entire stretch of the load-failure curve. The latter option would not be an UDD though.

On to the second experiment, presented below in one fell swoop.

op <- par(no.readonly = TRUE)

par(mfrow=1:2, mar=c(4,4,4,1))
# From Gorla et al. (2017) Table 9
gorla951x = 35 + c(1:0, 1:4, 3:2, 3:0, 1, 2, 1)
gorla951y =  c( (1 - diff(gorla951x) ) / 2, 1)

udplot(x=gorla951x, y=gorla951y, main = "Gorla et al. 2017, Material 951", 
       ytitle = "Load (kN)", las = 1)

drplot(x=gorla951x, y=gorla951y, addest = TRUE, target = 0.5, addcurve = TRUE, 
       percents = TRUE, main = "Gorla et al. 2017, Material 951", 
       xtitle = "Load (kN)", ytitle = "Percent Failure", las = 1)


udest(gorla951x, gorla951y, target = 0.5)
#   target    point lower90conf upper90conf
# 1    0.5 36.26829    35.28684     38.4467

par(op) # Back to business as usual ;)

This run was a bit less well-behaved than 751, and produced a monotonicity violation in observed response frequencies. CIR (and isotonic regression in general) was developed to deal with that, but as a result the confidence interval is even wider than the 751 estimate’s CI, despite having 2 (wow!) additional data points.

Curiously, Gorla et al. mixed and matched two estimators in their study. For this run, they used the original Dixon-Mood estimator (sometimes called “Dixon-Massey”), which is also a weighted average of doses but in fact a more sophisticated one than Brownlee et al.’s. Using that, they got 36.63 kN with a standard error of 0.62 kN, which translates to a \(90\%\) CI width of 2.0 kN (since Dixon and Mood recommend using Normal tables rather than, e.g., \(t\)). Our CI is \(\sim 1.6x\) wider.

We do have a version of the Dixon-Mood in our package as well.

dixonmood(x=gorla951x, y=gorla951y)
# [1] 36.78571

The value is a bit different from Gorla et al.’s, because there are a myriad variations in how that 1948-vintage estimator is implemented. Our package uses the formula as provided in the original article. However, we consider that estimator rather obsolete, and provide it only for comparisons like this. We also do not provide a CI or standard-error estimate for it, so as not to encourage usage of this obsolete estimator.

\(ED_{90}\) of Vasopressor during Caesarean Section

Whew! We move from the “Wild West” of engineering UDD applications, to a field where the design is used with substantially more recent methodology. Anesthesiology is probably where one finds nowadays the richest and best-informed variety of UDD application articles; much thanks to an influential outreach/tutorial article published in 2007 by Pace and Stylianou.¹³

Pace and Stylianou have successfully nudged anesthesiologists towards using isotonic regression for estimation instead of mid-20th-Century improvisations, and also introduced to that field a UDD that can target percentiles other than the median. Anesthesiologists find much use for estimating a high percentile - the 80th, 90th or the 95th - to identify a dose that would work most of the time, without being excessive.

The Up-and-Down Biased Coin Design (BCD) (not to be confused with BCDs in other applications) is part of a 1990s body of work by Durham, Flournoy, and others, the first substantial modern revisitation of UDDs after 20 years of near-silence about them in the methodological literature.¹⁴ BCD can target any percentile \(p.\) For targets above the median -

After a negative response, move up;
After a positive response, “toss a biased coin” and with probability \((1-p)/p\) move down;
If the toss “fails”, the next patient/specimen/etc. will receive the same dose as the last one.

For targets below the median, flip the coin-tossing to be carried out after a negative response, and invert the toss probability to \(\ p/(1-p).\)

The anesthesiology study we chose was run by an international team, including a leading popularizer of UDDs in Anesthesiology, Malachy Columb.¹⁵ They sought to estimate the \(ED_{90}\) of phenylephrine for treatment of hypotension during Cesarian birth.

Unlike machined helicopter gears, people do not undergo industrial process control; we come in a beautiful diversity of shapes, sizes, ages, and sensitivities. As seen in the previous examples, even with an industrially-controlled sample, \(n\) of 13-15 seems very questionable for dose-finding. With a sample of humans or other animals, such a sample size is woefully insufficient. Generally, anesthesiologists have been rather responsible in using more realistic sample sizes for UDD studies; typical samples encountered in that field are \(n=30-40\) for median-finding, and larger samples for extreme percentiles. The George et al. study described here ended up with \(n=45\) (Which all things considered, might still be a tad short for the target they chose).

The context was Cesarean delivery, during which many mothers experience hypotension. Phenylephrine is one of the recommended treatments, and the team wanted to estimate the \(ED_{90}.\) Inspired by Pace and Stylianou’s article, they chose BCD, albeit to simplify the “toss” probability they rounded it down from \(1/9\) to \(1/10.\) Since the treatment would be administered only to mothers experiencing hypotension, they initially enrolled 65 mothers, 45 of whom were treated for hypotension. The starting dose was the one most commonly used in practice: \(100\ \mu g.\)

george10x = 80 + 20 * c(1, rep(2, 5), 1, 1, 0, 0, rep(1, 7), 0:2, 2, 2, rep(1, 4), 2, 1, 1, 2, 2, 
                        rep(3, 5), 4, 5, 5, rep(4, 6))
george10y = c(ifelse(diff(george10x) > 0, 0, 1), 1)
udplot(x=george10x, y=george10y, ytitle = dlabel )

The study’s UDD balance point was \(p=10/11,\) even as the experimental target - the percentile to be estimated - was still \(0.9\). This is certainly close enough to not worry about that pesky adaptive-response bias, but we should use the actual balance point to get our numbers right.

On to the dose-response curve and the estimates!

The CI veering off further to the right rather than to the left (which could seem plausible looking at the ‘X’ marks) might raise questions. Here are some insights:

Remember \(y\) cannot exceed 1. Therefore, there’s a limit on how much \(F(x)\) can move to the left while remaining reasonably close to the data and below 1. Equivalently, the dose of \(100\ \mu g\) which remains outside the CI despite having a 13 of 17 (\(76\%\)) success rate is both “buffered” from the point estimates by the doses of 120 and 140 having even higher observed rates, and itself having relatively many observations increasing the confidence that indeed, the rate at that dose is below target.
(aside: to avoid misleading the reader, our CI process does not simulate various curves; the limited ability of curves to “move left” is represented via the forward (\(F(x)\)) CIs having to remain below 1)
Conversely, to the right of the point estimate, there’s more “room” for the curve to move (i.e., the lower forward bounds are much further down from the point estimate), and the shallowness of the estimated CIR curve suggests the ED\(_{90}\) estimate can indeed move quite a bit rightward (and/or become even shallower) while remaining close to the data.

Generally with high/low targets, with our CI method one might expect the “external” confidence bound to be further away. We feel this does represent genuine uncertainty regarding how \(F(x)\) levels off towards the \(y=0\) or \(y=1\) boundary. In many cases, only a much larger sample size would narrow that down considerably.

Ok, that’s it for now! I hope you find the package useful.

`upndown` Package Vignette 1: Up-and-Down Basics

Assaf Oron

2025-06-15