# [R] OT: A test with dependent samples.

Emmanuel Charpentier charpent at bacbuc.dyndns.org
Sun Feb 15 18:00:45 CET 2009

Dear List,

Catching up with my backlog, I stumbled upon this :

On Wed, 11 Feb 2009 10:33:13 +1300, Rolf Turner wrote :

> I am appealing to the general collective wisdom of this list in respect
> of a statistics (rather than R) question.  This question comes to me
> from a friend who is a veterinary oncologist.  In a study that
> she is writing up there were 73 cats who were treated with a drug called
> piroxicam.  None of the cats were observed to be subject to vomiting
> prior
> to treatment; 12 of the cats were subject to vomiting after treatment
> commenced.  She wants to be able to say that the treatment had a
> ``significant''
> impact with respect to this unwanted side-effect.
>
> Initially she did a chi-squared test.  (Presumably on the matrix
> matrix(c(73,0,61,12),2,2) --- she didn't give details and I didn't
> pursue
> this.) I pointed out to her that because of the dependence --- same 73
> cats pre- and post- treatment --- the chi-squared test is inappropriate.
>
> So what *is* appropriate?  There is a dependence structure of some sort,
> but it seems to me to be impossible to estimate.
>
> After mulling it over for a long while (I'm slow!) I decided that a
> non-parametric approach, along the following lines, makes sense:
>
> We have 73 independent pairs of outcomes (a,b) where a or b is 0 if the
> cat didn't barf, and is 1 if it did barf.
>
> We actually observe 61 (0,0) pairs and 12 (0,1) pairs.
>
> If there is no effect from the piroxicam, then (0,1) and (1,0) are
> equally likely.  So given that the outcome is in {(0,1),(1,0)} the
> probability of each is 1/2.
>
> Thus we have a sequence of 12 (0,1)-s where (under the null hypothesis)
> the probability of each entry is 1/2.  Hence the probability of this
> sequence is (1/2)^12 = 0.00024.  So the p-value of the (one-sided) test
> is 0.00024.  Hence the result is ``significant'' at the usual levels,
> and my vet friend is happy.
>
> I would very much appreciate comments on my reasoning.  Have I made any
> goof-ups, missed any obvious pit-falls?  Gone down a wrong garden path?
>
> Is there a better approach?
>
> Most importantly (!!!): Is there any literature in which this approach
> is
> spelled out?  (The journal in which she wishes to publish will almost
> surely
> demand a citation.  They *won't* want to see the reasoning spelled out
> in
> the paper.)
>
> I would conjecture that this sort of scenario must arise reasonably
> often
> in medical statistics and the suggested approach (if it is indeed valid
> and sensible) would be ``standard''.  It might even have a name!  But I
> have no idea where to start looking, so I thought I'd ask this
> wonderfully
> learned list.

I read with interest the answers given, but got frustrated by (among
other points) seeing the main point unanswered : what in hell do you
want to *test* ? And to prove what ?

Classical test theory (Neyman and Pearson's sin, pride and glory) gives
you a (conventionnaly accepted) way to check if your data support your
assertions. It starts by computing somehow the probability of getting
your data by sheer chance under the hypothesis of your assertion being
*false* (i. e. the dreaded "null hypothesis); if this probability is
"too low" (less than 1 in 20, according to a R. A. Fisher's whim, now
idolized as a standard), it proceeds by asserting that this "too low"
probability means "impossible" and, by way of modus tollens (a-->b and
not(b) ==> not(a), in propositional logic), rejects your null
hypothesis. Therefore, what your test "proves" is just the negation of

The "null" hypothesis that "the drug does not cause cats to barf"
implies that the *probability* of seeing a cat barfing is zero. Any barf
is enough to disprove it and your alternative ("some cat(s) may barf
after having the drug") is therefore "supported" at all conventional
(and unconventional) levels. (See below for reasons for which this
reasoning is partially false).

Did you really bother to treat 73 cats to learn this ? In this case,
you've way too much money to burn and time on your hands. You might have
learned that much cheaper by treating cats one at a time and stopping at
the first barf. You coud even have obtained a (low precision) estimate
of the post-treatement barf probability, by remembering that the
distribution of the number of cats treated is binomial negative...

This (point- and interval-) estimation is probably much more
interesting that a nonsensical "test". Getting some precision on this
estimation might well be worth treating 73 cats. In this case, both
classical ("Fisherian") and Bayesian points of view give "interesting"
answers. You may note that "classical" confidence interval and Bayesian
credible interval with a noninformative prior have the same (numerical)
bounds, with very different significations (pick your poison, but be
aware that the Bayesian point of view | gospel | madness is quite
ill-accepted in most medical journals nowadays...). But the "test
significance level" is still 0, meaning that this test is sheer pure,
possible realistic meaning.

Now, another "null" has been suggested : "the cats have the same
you to a symmetry (McNemar) test. This is much more interesting, and
might have some value ... unless, at it has been suggested, your
subjects are not "random cats" but "cats that do not barf before drug
null is effectively (almost) the same as before (i. e. "the drug does
not cause non-previously-barfing-cats to barf"), in which case the same
grilling can be applied to it.

In both cases, the null hypothesis tested is so far away from any
"reasonable" hypothesis that the test turns to a farce. A much better
way to present these results to a referee would be to give a
(well-motivated) point- and interval-estimation and plainly refuse to
"test" it against nonsense (and explaining why, of course). Medical
journals editors and referees have spent way too much time complying to
dead statisticians' fetishes, turning the whole methodological problem
in (para-)clinical research into an exercise of rubing blue mud in the
same place and at the same time their ancestors did, with no place for
statistical (and probabilistic) thinking...

Another, more interesting problem, would be to know if taking the drug
in question does not cause an "unacceptable" probability of barfs. This
would entail 1) defining the lowest "unacceptable" amount of barfing,
2) defining first-species risk and power, 3) computing the number of
subjects necessary to a non-superiority trial against a prespecified
fixed hypothesis and 4) effectively running and analysing such a trial.
Such a test of a *realistic* hypothesis would indeed be worthy.

======== Cut here for a loosely related methodological rant ========

In any case, if I refereed this paper, I would not "buy" it : the causal
*drug*, it is *receiving the drug*, which is *quite* different, even in
animals (maybe especially in animals).

I happen to have worked on a very similar setup (testing an antiemetic
on dogs receiving platinium salts, which is a bit more intricate that
the present setup). I *know* for a fact that the combination of placebo
"platinium salts" and placebo "antiemetic" *will* cause some dogs to
barf (and other stress manifestations) : it is even enough to have one
dog in the same room (or even the same (small) building) starting barfing
to start barfings in other, totally untouched (not even by placebos)
animals. You see, stress can be communicated by sight, sound and
smell, and it's bloody hard to isolate animals from each other on these
three aspects... Furthermore, if you were able to obtain such an
isolation, you'd get animals subject to a *major* stress : loleliness.
Dogs are social beasts, y'know...

And I don't suppose cats being less sensitive (Ô Bubastis, forgive them,
they do not know what they do...).

Therefore, to study the possible emetic effect of the *drug*, the
simple (maybe simplistic) way to test the (semi-realistic) null "the
*drug* doesn't modify the emetic risk entailed by taking any drug", I'd
compare the "barfing rates" of a *random* sample of cats receiving the
drug and another, *distinct*, *random* *control* sample of cats
receiving a suitable placebo (same presentation, same coulour, same
smell, same "galenic",  etc ...). I'd blind any person in contact with
any of the animals to the exact nature of the intervention (dogs and
cats will somehow "feel" your expectations, don't ask me how, but they
can be worse than human patients in this respect...). An I'd analyze
this exactly as a clinical trial (this *is* a veterinary clinical trial,
indeed).

This simplistic scheme does not account for individual sensitivities of
animals. Various ways exist to "absorb" this, the simplest being of
course a properly randomized cross-over trial. The null becomes, of
course, triple : "the emetic properties of the administration do not
depend of the product administered", "the emetic properties of an
emetic properties of a product do not depend of the previous
administration of another product". The interpretation of such an
experiment may become ... delicate.

Other schemes are possible : e. g., repeated experiments on the same
animals may allow to *describe* individual sensitivities and the
variability thereof (analysis by mixed models including a "subject"
factor). However, I'd be very wary of the validity of such an
experiment, given the possibility (almost certainty...) of inducing
a stereotyped comportment in the subjects.

And the importance of the ultimate goal would have to be bloody mighty
in order to justify such a treatment being inflicted to cats (or dogs,
for that matter)...

One might note that my previous proposal of a non-superiority trial does
not involve a control. That's because this trial has a *pragmatic*
goal : checking the acceptability of the administration of a drug on an
a priori set of criteria. It does not allow inferences on the effect of
the drug, and *postulates* that the non-administration of the drug will
result in nothing of interest. This allows us to pull a realistic,
interesting, null hypothesis out of our hats.

On the other hand, the controlled plans, by virtue of having a control,
allow us to be analytic, and separate the effect of the administration
from the effect of the drug itself : this latter one might indeed be
zero, the associated null hypothesis isn't nonsensical and the test of
this null isn't worthless.

======== End of the loosely related methodological rant ========

In any case, my point is : hypothesis testing is *not* the alpha and
omega of biostatistics, and other methods of describing and analysing
experimental results are often much more interesting, nonwhistanding the
fetishes of journal referees. Furthermore, testing of impossible or
worthless hypotheses lead to worthless conclusions. Corollary : do not
test for the sake of testing, because "everybody does it" or because a
referee started a tantrum ; test realistic hypotheses, whose rejection
has at least some relation to your subject matter.

The two cents of someone tired of reading utter nonsense in prestigious
journals...

Emmanuel Charpentier