[R-sig-eco] Log transforming zero value data
Carsten Dormann
carsten.dormann at ufz.de
Wed Jun 24 09:45:45 CEST 2009
Dear Nate,
although I learned from Phillippe's response about the existence of
log1p, I don't think I will use it (for reasons below). Thierry's
response is true for Poisson data, but not for non-integer values.
Still, it points into an important direction: All too often zeros
emanate from a different process than the other values (see mixed
distributions, zero-inflated, hurdle and all that). In that case, you
should consult Ben Bolker's excellent book (which is probably still
available as a draft on his homepage, but also worth buying).
If you want to transform, here is my take:
My folk-law guidelines on the c in log(x+c) are:
1. c should roughly be 1/2 of the smallest, non-zero value:
signif(0.5*sort(unique(x))[2], 2)
2. c should be quadrat of the first quantile devided by the third
quantile: (quantile(x)[2]^2)/quantile(x)[4]
For example:
set.seed(11011)
x <- c(runif(95), rep(0,5))
Method 1: c=0.0015
Method 2: c=0.015
While this looks like a huge difference (an order of magnitude), it
actually isn't all that much, given the range of the data:
plot(density(x))
abline(v=c(0.0015, 0.015))
I do have a reference for method 2, but it is German (Stahel, W. A.
(2002) Statistische Datenanalyse. Eine Einführung für
Naturwissenschaftler. Vieweg, Braunschweig.).
_ Method 1 is what my PhD's statistics adviser recommended. Since he was
right in everything else, I rely on his advise here, too. That may, I
acknowledge, not be good enough for you. But maybe someone else finds a
proper reference.
The key thing for any value of c is that it doesn't distort the
analysis. But then, how do you detect distortion? I used a comparison of
rank-transformed data and various values of c. When c was large (in the
current example e.g. 0.5 or so), the analysis started to differ from the
rank-analysis. To use log1p here would be a dramatic distortion!
Another way to look at it is through Box-Cox-transformation. Since
Box-Cox transforms towards symmetric (not necessarily normal)
distribution, also c should be chosen in such a way as to facilitate the
transformation towards symmetry.
HTH,
Carsten
Nate Upham wrote:
> I have a general stats question for you guys:
>
> How does one normally deal with zero (0) values when log transforming data?
>
> I would like to log transform (natural log, ln) several response variables for use in quantile
> regression. But one of my variables includes several zero values. Since ln(0) = infinity, this is
> not readily possible. Is it best to remove all data with zero values? Or should I add a very small
> number to each value (e.g., 0.00001)? This seems problematic. Is there an easy way to address this
> issue?
>
> Thanks much for your help,
> --Nate
>
> _________________________________
> Nathan S. Upham
> Ph.D. student
> Committee on Evolutionary Biology
> University of Chicago
> 1025 E. 57th St., Culver 402
> Chicago, IL 60637
> nsupham at uchicago.edu
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>
>
--
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ
Permoserstr. 15
04318 Leipzig
Germany
Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
Email: carsten.dormann at ufz.de
internet: http://www.ufz.de/index.php?de=4205
More information about the R-sig-ecology
mailing list