[R-sig-eco] Log transforming zero value data
Carsten Dormann
carsten.dormann at ufz.de
Wed Jun 24 11:42:05 CEST 2009
Dear Philippe,
while I don't like to quibble about rules-of-thumb (since they are, as
you rightly point out, without foundation in statistical theory), I
would like to correct the impression that you gave in your email.
Let's take an hypothetical example along the lines you proposed (or at
least of what I understand you proposed).
x <- rnorm(100, mean=37, sd=1) # human body temperature
How could we here have zeros?
If we had zero (from a dead body in the snow, for example), then by
transforming the values to Kelvin the value for c should indeed be 271,
not 1! What a strange idea to argue that because a body was dead in the
snow, we can assume it has the same temperature as the universe! No,
sorry, I prefer to stick to the "VERY complicated" rule of thumb of
using half the non-zero minimum.
In my experience, zeros could arise in the following circumstance: We
transplant seedlings into some treatments. Some don't survive, others
thrive. So, at the end of the experiment, we have the choice of given
the dead seedlings a weight of 0 or NA. If we choose 0, then we are
confounding two processes: survival and growth under treatment
conditions. That's what I meant when I wrote that the values come from
different processes. We could/should opt for a mixed distribution
approach (e.g what Zuur et al. refer to as ZAP models, in the book that
Gavin mentioned).
Or we could choose to transform the 0s to match the biomass of the
surviving individuals.
Here come's Philippe's criticism of the two rules-of-thumb: We can use
grams, or milligrams, or tons, and all the time the c-value would
change. Correct: it would. And rightly so, I think. Why should the value
of c be some natural constant (such as 1)? Of course we seek to adjust
it to the distribution of the data, because we are actually imputing (in
a sense) a value we know that cannot be right as it is. Therefore it
seems obvious to me that we must have rules of thumb that provide
different values for different data sets.
In fact, c=1 is not a constant value. Because we log-transform the data,
the c=1 added to 1 is large (turning a log(1)=0 into a log(2)=0.7),
while the same c=1 added to 28377 is small (turing a log(28377)=10.25333
into a 10.25337). I think this gives an impression of the distortion we
add: negligible at the upper end, substantial at the lower. The second
rule of thumb tries to balance that.
And, really, these rules of thumb are not "VERY complicated":
for c(342, 234, 132, 1441, 2, 4443, 23434, 0) rule 1 proposed c=1 (as
in 2/2) and rule 2 proposes 4.5.
quantile() yields:
0% 25% 50% 75% 100%
0.0 99.5 288.0 2191.5 23434.0
99.5*99.5/2191.5 = 4.5
Do you think that is VERY complicated?
Anyone volunteering for a little simulation study, analysing which
rule-of-thumb would be best? Meet the candidates:
1. Eliminator (get rid of zeros)
2. Oner (log1p)
3. little-bitter (half of the smallest non-zero value)
4. quantiler (ratio of squared first and third quantile)
set.seed(11111)
x1 <- rlnorm(100, 5, 2)
x2 <- rlnorm(100, 5.5, 2)
t.test(log(x1), log(x2))
x1[sample(1:100, size=40)] <- 0
x2[sample(1:100, size=40)] <- 0
1. Eliminator:
t.test(log(x1[x1>0]), log(x2[x2>0])) #significant
2. Oner:
t.test(log1p(x1), log1p(x2)) # not significant
3. little-bitter:
sort(unique(c(x1,x2)))[2]/2 # approx. 0.6
t.test(log(x1+.6), log(x2+.6)) # not significant
4. quantiler
quantile(c(x1,x2))
# Ah, now that is interesting: too many zeros and you cannot use this
rule-of-thumb!
quantile(c(x1,x2)[c(x1,x2)>0])
# (34^2)/541 = 2.1
t.test(log(x1+2.1), log(x2+2.1)) # not significant
I wouldn't go so far as to claim that this is a real test of the
contestants, it merely outlines a possible approach for doing so. In any
case, neither rule-of-thump leads to "true" significance, only the
eliminator!
I'll be on holiday for the next weeks, in order to avoid all further
discussions ...
Carsten
Philippe Grosjean wrote:
> Carsten Dormann wrote:
>> Dear Nate,
>>
>> although I learned from Phillippe's response about the existence of
>> log1p, I don't think I will use it (for reasons below). Thierry's
>> response is true for Poisson data, but not for non-integer values.
>> Still, it points into an important direction: All too often zeros
>> emanate from a different process than the other values (see mixed
>> distributions, zero-inflated, hurdle and all that). In that case, you
>> should consult Ben Bolker's excellent book (which is probably still
>> available as a draft on his homepage, but also worth buying).
>>
>> If you want to transform, here is my take:
>>
>> My folk-law guidelines on the c in log(x+c) are:
>> 1. c should roughly be 1/2 of the smallest, non-zero value:
>> signif(0.5*sort(unique(x))[2], 2)
>> 2. c should be quadrat of the first quantile devided by the third
>> quantile: (quantile(x)[2]^2)/quantile(x)[4]
>> For example:
>> set.seed(11011)
>> x <- c(runif(95), rep(0,5))
>>
>> Method 1: c=0.0015
>> Method 2: c=0.015
>> While this looks like a huge difference (an order of magnitude), it
>> actually isn't all that much, given the range of the data:
>>
>> plot(density(x))
>> abline(v=c(0.0015, 0.015))
>
> These are VERY complicated rules for just an empirical rule of thumb
> without connection to theoretical background. Moreover, c is depending
> on the dataset, and it is thus changing from dataset to dataset, which
> is NOT a desired behavior.
>
> So, providing maximum values are large (100s or more), and minimum
> value above zero not too small (let's 1, e.g., for countings), ln(x+1)
> alias log1p() is convenient because log1p(0) = 0. So, given we use
> rules of thumb, this one looks good because (1) it transforms zero to
> zero, and (2) transformation is independent from the content of the
> data. But I agree it is not a good choice when you deal with small
> values.
>
> Now, if you want to be more accurate, you have to determine the actual
> distribution of your data. If your data are (generalized) log-normally
> distributed, c must be defined according to what you know about the
> variable you measure. For instance, temperature expressed in °C would
> require to choose c as being 273.15 to be correct... very far away
> from 1, or from c that would be used with the rules you propose!
>
> Best,
>
> Philippe
>
>> I do have a reference for method 2, but it is German (Stahel, W. A.
>> (2002) Statistische Datenanalyse. Eine Einführung für
>> Naturwissenschaftler. Vieweg, Braunschweig.).
>> _ Method 1 is what my PhD's statistics adviser recommended. Since he
>> was right in everything else, I rely on his advise here, too. That
>> may, I acknowledge, not be good enough for you. But maybe someone
>> else finds a proper reference.
>>
>> The key thing for any value of c is that it doesn't distort the
>> analysis. But then, how do you detect distortion? I used a comparison
>> of rank-transformed data and various values of c. When c was large
>> (in the current example e.g. 0.5 or so), the analysis started to
>> differ from the rank-analysis. To use log1p here would be a dramatic
>> distortion!
>>
>> Another way to look at it is through Box-Cox-transformation. Since
>> Box-Cox transforms towards symmetric (not necessarily normal)
>> distribution, also c should be chosen in such a way as to facilitate
>> the transformation towards symmetry.
>>
>> HTH,
>>
>> Carsten
>>
>>
>> Nate Upham wrote:
>>> I have a general stats question for you guys:
>>>
>>> How does one normally deal with zero (0) values when log
>>> transforming data? I would like to log transform (natural log, ln)
>>> several response variables for use in quantile
>>> regression. But one of my variables includes several zero values.
>>> Since ln(0) = infinity, this is
>>> not readily possible. Is it best to remove all data with zero
>>> values? Or should I add a very small
>>> number to each value (e.g., 0.00001)? This seems problematic. Is
>>> there an easy way to address this
>>> issue?
>>>
>>> Thanks much for your help,
>>> --Nate
>>>
>>> _________________________________
>>> Nathan S. Upham
>>> Ph.D. student
>>> Committee on Evolutionary Biology
>>> University of Chicago
>>> 1025 E. 57th St., Culver 402
>>> Chicago, IL 60637
>>> nsupham at uchicago.edu
>>>
>>> _______________________________________________
>>> R-sig-ecology mailing list
>>> R-sig-ecology at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>>
>>>
>>
>
--
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ
Permoserstr. 15
04318 Leipzig
Germany
Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
Email: carsten.dormann at ufz.de
internet: http://www.ufz.de/index.php?de=4205
More information about the R-sig-ecology
mailing list