[R] Diamond graphs, again.
Richard A. O'Keefe
ok at cs.otago.ac.nz
Thu Sep 25 08:26:35 CEST 2003
Some time ago I was allowed to discuss "Diamond Graphs", and whether
they would be useful in R, in this mailing list.
The August 2003 issue of The American Statistician has finally arrived
here and I have been able to read the article. A number of points of
interest arise.
1. The article is
"A Diamond-Shaped Equiponderant Graphical Display of the
Effects of Two Categorical Predictors on Continuous Outcomes"
by Xiuhong Li, Jennier M. Buechner, Patrike M. Tarwater,
and Alvaro M\~unoz.
The American Statistician, August 2003, pages 193-199.
(All of the family names are displayed in small caps except for
Xiuhong Li's. Does anyone know why?)
2. There are three examples in the paper.
a. Figures 2 and 3 display likelihood of developing AIDS
as the thing to be explained, with plasma HIV-RNA level
(measured in copies per ml) and degree of immune deficiency
(measured in count of CD4+ T-lymphocyte cells per cubic mm)
as the explanatory variables.
The explanatory variables are continuous, not categorical.
If the raw data were used, it would be possible to estimate
a 2d probability density and display that. The variables have
been made categorical by cutting to 5 levels each.
I managed to get a copy of the article this is based on. It
certainly isn't clear from the diamond graph that the viral load
categories were approximately quartiles, with the bottom
quartile split in two. So two of the viral load categories have
less data than the other three. Much the same happened with the
CD4+ levels, which is also not apparent.
Using a diamond graph instead of a density plot with rugs on the
margins hides the amount of data available for estimating the
cells (6 of the 25 cells are empty because there is missing data).
Viral load and CD4+ count were respectively the best and second
best predictor out of five predictors (seven are listed at one
point; I guess this means that CD3+ and CD8+ levels weren't useful
at all). I have skimmed the article a couple of times, and cannot
figure out why just two predictors were chosen. It would be of
interest to see graphs for one predictor, two predictors, and
three predictors. I have not yet seen any diamond graphs with
three explanatory variables...
b. Figures 4-6 display (age-adjusted rate of end-stage renal
disease due to any cause per 100,000 person-years) as the thing
to be explained, with systolic blood pressure (measured in mm Hg)
and diastolic pressure (measured in mm Hg) as the explanatory
variables.
Once again the explanatory variables are continuous, not
categorical. They are cut to 6 levels each. With the raw data,
one could perhaps get a contour plot of fitted disease rate
and a scatterplot of the explanatory variables on the same graph.
4 of the 36 cells are empty, but in this case the values in the
cells basically _are_ counts, so we are _not_ left wondering
how much data each cell is based on.
I have not yet seen the article this was based on.
c. Figure 7 has two graphs. On the left it's relative risk of
breast cancer as the thing to be explained, with adult weight
change (measured in kg; why not as a proportion of starting weight?)
and hormone use (never, past, current) as the explanatory variables.
On the right excess risk is to be explained, with the same
explanatory variables.
One of the explanatory variables is continuous. (Although it is
not obvious to me that a weight change of 10 kg in a 55kg woman
should have the same significance as a weight change of 10kg in
a 75kg woman.)
It strikes me that the other explanatory variable may well be
an approximation to a continuous predictor also (some kind of
exponentially weighted dose, perhaps).
I have not yet seen the article this was based on. I have seen
the abstract, though, which draws a conclusion somewhat at odds
with the apparent significance of these graphs. I expect that
this shows that the diamond graphs _are_ useful.
Like "a", we get no idea of how much data each cell is based on.
In no case were there really two categorical predictors to start with.
3. I finally pinned down what these graphs remind me of: the two-way
plots described in Tukey's 1977 book "Exploratory Data Analysis",
which is not cited in the article. Tukey's basic idea goes like this:
(1) Fit an additive model to the data (median polish, whatever).
(2) Tilt and spread the axes so that the vertical dimension is the
fitted values.
(3) Draw lines, not boxes, so that the fitted value for X=m Y=n
is the level at the intersection of the lines for X=m and Y=n.
(4) Now that you can see what the fitted values are from the
intersections, plot the residuals, either as sticks from the
intersection to the true value, or as variously sized/shaped
blobs to show the relative magnitudes of the residuals.
Once I realised this, I realised what really bothered me about these
graphs. They simply summarise the raw data (crudely). There is no
"data = fit + residuals". I found myself _itching_ for the raw data
so that I could see what was really going on.
4. The paper compares a diamond graph (figure 5) with a trellis graph,
or rather, a pair of trellis graphs (figure 6) for the same data.
I felt much more comfortable with the trellis graph, largely because
the numbers (in the range 0..200ish) were well spread out. The
trellis graph was, however, much bigger, and is less immediately
accessible; the diamond graph conveys an impression of understanding
without needing a lot of explanation.
5. My analysis of perceptual issues was right in some respects and wrong
in others. The hexagons have been very carefully designed so that
- the area is proportional to p
- one length is proportional to p
- the difference of two other lengths is proportional to p
where p is the value in the interval [0,1] which is to be presented.
I must say that for me the visually most salient length is one which
is _not_ proportional to p (it's (1+p)/2).
6. The article neither presents nor cites any experimental data to show
that any task can be completed faster or more accurately using
diamond graphs than some other kind of display. As yet, it's a
matter of opinion.
7. It is unfortunate that the name "diamond graph" was chosen;
"diamond graph" is an established technical term in mathematics.
You could get much the same effect by plotting discs of varying sizes,
or sectors of varying width, or little thermometers, or practically
anything varying in area, on a standard horizontal & vertical table,
and then rotating the paper by hand. You won't be able to estimate
sizes accurately, but thanks to the cramped range available you aren't
going to estimate sizes accurately from a diamond graph anyway
(unless you can read the number displayed in the centre, which in my
photocopy of the article I can't).
In short, diamond graphs offer a reasonably clear way to summarise some
kinds of data, particularly for non-statisticians, but neither express
nor lead to any kind of analysis.
Since R is a statistics package rather than a "business presentation
graphics" package, perhaps Tukey-style two-way plots (are they already
available somewhere?) would be more useful than diamond graphs.
More information about the R-help
mailing list