- Introduction
- Grouping by underlining
- Grouping using letters or symbols
- Simulated example
- Alternative CLDs
- Conclusions
- References

Compact letter displays (CLDs) are a popular way to display multiple comparisons, especially when there are more than a few means to compare. They are problematic, however, because they are prone to misinterpretation (more details later). Here we present some background on CLDs, and show some adaptations and alternatives that may be less prone to misinterpretation.

CLDs generalize an “underlining” technique shown in some old experimental design and analysis textbooks, where results may be displayed something like this:

```
trt1 ctrl trt3 trt2 trt4
----------
------------------
```

The observed means are sorted in increasing order, so in this
illustration, `trt1`

has the lowest mean, `ctrl`

has the next lowest, and `trt4`

has the highest. The
underlines group the means such that the extremes of each group are
*not* significantly different according to a statistical test
conducted at a specified alpha level. So in this illustration,
`trt1`

is significantly less than `trt3`

,
`trt2`

, and `trt4`

, but not `ctrl`

; and
in fact `trt4`

is significantly greater than all the
others.

This grouping also illustrates the dangers created by careless
interpretations. Some observers of this chart might say that
“`trt1`

and `ctrl`

are equal” and that
“`ctrl`

, `trt3`

, and `trt2`

are equal”
– when in fact we have merely failed to show they are different. And
further confusion results because mathematical equality is transitive –
that is, these two statements of equality would imply that
`trt`

and `trt2`

must be equal, seemingly
contradicting the finding that they are significantly different.
Statistical nonsignificance does *not* have the transitivity
property!

The underlining method becomes problematic in any case where the standard errors (SEs) of the comparisons are unequal – for example if we have unequal sample sizes, or a model with non-homogeneous variances. When the SEs are unequal, it is possible, for example, for two adjacent means to be significantly different, while two more distant ones do not differ significantly. If that happens, we can’t use underlines to group the means. The problem here is that lines are continuous, and that continuousness forces a continuum of groupings.

However, Piepho (2004) solved this problem by using *symbols*
instead of lines, and creating a display where any two means associated
with the same symbol are deemed to not be statistically different. Using
symbols, it is possible to have non-contiguous groupings, e.g., it is
possible for two means to share a symbol while an intervening one does
not share the same symbol. Such a display is called a *compact letter
display*. We do not absolutely require actual letters, just symbols
that can be distinguished from one another. In the case where all the
differences have equal SEs, the CLD will be the “same” as the result of
grouping lines, in that each distinct symbol will span a contiguous
range of means that can be interpreted as a grouping line.

The R package **multcompView** (Graves *et al.*,
2019) provides an implementation of the Piepho algorithm. The multcomp
package (Hothorn *et al.* 2008) provides a generic
`cld()`

function, and the **emmeans** package
provides a `cld()`

method for `emmGrid`

objects.

As a moving example, we simulate some data from an unbalanced design with 7 treatments labeled A, B, …, G; and fit a model to those

```
set.seed(22.10)
mu = c(16, 15, 19, 15, 15, 17, 16) # true means
n = c(19, 15, 16, 18, 29, 2, 14) # sample sizes
foo = data.frame(trt = factor(rep(LETTERS[1:7], n)))
foo$y = rnorm(sum(n), mean = mu[as.numeric(foo$trt)], sd = 1.0)
foo.lm = lm(y ~ trt, data = foo)
```

There are only four distinct true means underlying these seven
treatments: Treatments `B`

, `D`

, and
`E`

have mean 15, treatments `A`

and
`G`

have mean 16, and treatments `F`

and
`C`

are solo players with means 17 and 19 respectively.

Let’s see a compact letter display for the marginal means. (Call this
**CLD #1**)

```
foo.emm = emmeans(foo.lm, "trt")
library(multcomp)
cld(foo.emm)
```

```
## trt emmean SE df lower.CL upper.CL .group
## B 14.6 0.246 106 14.1 15.1 1
## E 15.0 0.177 106 14.6 15.3 1
## D 15.3 0.224 106 14.8 15.7 1
## G 15.3 0.254 106 14.8 15.9 1
## A 16.4 0.218 106 15.9 16.8 2
## F 16.6 0.673 106 15.2 17.9 12
## C 19.3 0.238 106 18.9 19.8 3
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 7 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
```

The default “letters” for the **emmeans** implementation
are actually numbers, and we have three groupings indicated by the
symbols `1`

, `2`

, and `3`

. This
illustrates a case where grouping lines would not have worked, as we see
in the fact that group `1`

is not contiguous. We have (among
other results) that treatment `A`

differs significantly from
treatments `B`

, `E`

, `D`

,
`G`

, and `C`

(at the default 0.05 significance
level, with Tukey adjustment for multiple testing). and that
`C`

is significantly greater than all the other means since
it is the only mean in group `3`

.

An annotation warns that two means in the same group are not
necessarily the same; yet CLDs present a strong visual message that they
are. The careless reader who makes this mistake will have trouble with
the gap in group `1`

, asking how `A`

can differ
from `G`

and yet `G`

and `F`

, are “the
same.” The explanation is that the SE of `F`

is huge, owing
to its very small sample size, so it is hard for it to be
*statistically* different from other means. It is almost a gift
to obtain a non-contiguous grouping like this, as it forces the user to
think more carefully about what these grouping do and do not imply.

Given the discussion above, one might wonder if it is possible to
construct a CLD in such a way that means sharing the same symbol
*are* actually shown to be the same? The answer is yes (otherwise
we wouldn’t have asked the question!) – and it is quite easy to do,
thanks to two things:

- The algorithm for making grouping letters is based on a matrix of
Boolean values associated with each pair of means, where we set the
value
`TRUE`

for any pair that is statistically different (those means must receive different grouping letters), and`FALSE`

otherwise; and the algorithm works for*any*such Boolean matrix - There is such a thing as
*equivalence testing*, by which we can establish with specified confidence that two means*do not*differ by more than a specified threshold \(\delta\). One simple way to do this is to conduct*two one-sided tests*(TOST) whereby we can conclude that two means are equivalent if we show both that the difference exceeds \(-\delta\) and is less than \(+\delta\). We can use this TOST method to set each Boolean pair to`FALSE`

is they are shown to be equivalent and`TRUE`

if not shown to be equivalent.

For our example, suppose, based on subject-matter considerations,
that two means that differ by less than 1.0 can be considered
equivalent. In the **emmeans** setup, we specify that we
want equivalence testing simply by providing this nonzero threshold
value as a `delta`

argument. In addition, we typically will
not make multiplicity adjustments to equivalence tests. Here is the
result we obtain (call this **CLD #2**)

`cld(foo.emm, delta = 1, adjust = "none")`

```
## trt emmean SE df lower.CL upper.CL .equiv.set
## B 14.6 0.246 106 14.1 15.1 1
## E 15.0 0.177 106 14.6 15.3 12
## D 15.3 0.224 106 14.8 15.7 2
## G 15.3 0.254 106 14.8 15.9 2
## A 16.4 0.218 106 15.9 16.8 3
## F 16.6 0.673 106 15.2 17.9 4
## C 19.3 0.238 106 18.9 19.8 5
##
## Confidence level used: 0.95
## Statistics are tests of equivalence with a threshold of 1
## P values are left-tailed
## significance level used: alpha = 0.05
## Estimates sharing the same symbol test as equivalent
```

So we obtain five groupings – but only two if we ignore those that
apply to only one mean. We have that treatments `B`

and
`E`

can be considered equivalent, and treatments
`E`

, `D`

, and `G`

are considered
equivalent. It is also important to know that we *cannot* say
that means in different groups are significantly different.

Unlike CLD #1, we are showing only groupings of means that we can
*show* to be the same. The first four means, which were grouped
together earlier, are now assigned to two equivalence groupings. And
treatment `F`

is not grouped with any other mean – which
makes sense because we have so little data on that treatment that we can
hardly say anything.

Another variation is to simply reverse all the Boolean flags we used
in constructing CLD #1. Then two means will receive the same letter only
if they are significantly different. Thus, we really obtain
*un*grouping letters. We label these groupings “significance
sets.” The resulting display has a distinctively different appearance,
because common symbols tend to be far apart rather than contiguous.
(Call this **CLD #3**)

`cld(foo.emm, signif = TRUE)`

```
## trt emmean SE df lower.CL upper.CL .signif.set
## B 14.6 0.246 106 14.1 15.1 1
## E 15.0 0.177 106 14.6 15.3 2
## D 15.3 0.224 106 14.8 15.7 3
## G 15.3 0.254 106 14.8 15.9 4
## A 16.4 0.218 106 15.9 16.8 1234
## F 16.6 0.673 106 15.2 17.9 5
## C 19.3 0.238 106 18.9 19.8 12345
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 7 estimates
## significance level used: alpha = 0.05
## Estimates sharing the same symbol are significantly different
```

Here we have five significance sets. By comparing with CLD #1, you
can confirm that each significant difference shown explicitly here
corresponds to one shown implicitly (by *not* sharing a group) in
CLD #1.

Compact letter displays show symbols based on statistical testing
results. In such tests, we have strong conclusions or *findings*
– those that have small *P* values, and weak conclusions or
*non-findings* – those where the *P* value is not less
than some \(\alpha\). When we create
visual flags such as grouping lines or symbols, those come across
visually as findings, and the problem with standard CLDs is that those
are the non-findings. We show two simple ways to use software that
creates CLDs so that actual findings are flagged with symbols. It is
hoped that people will find these modifications useful in visually
displaying comparisons among means.

Graves, Spencer, Piepho Hans-Pieter, Selzer, Luciano, and Dorai-Raj,
Sundar (2019). **multcompView**: Visualizations of Paired
Comparisons. R package version 0.1-8,
`https://CRAN.R-project.org/package=multcompView`

Hothorn, Torsten, Bretz, Frank, and Westfall, Peter (2008).
Simultaneous Inference in General Parametric Models. *Biometrical
Journal* **50(3)**, 346–363.

Piepho, Hans-Peter (2004). An algorithm for a letter-based
representation of all pairwise comparisons, *Journal of Computational
and Graphical Statistics* **13(2)** 456–466.