[R-sig-eco] Log transforming zero value data

Carsten Dormann carsten.dormann at ufz.de
Wed Jun 24 09:45:45 CEST 2009


Dear Nate,

although I learned from Phillippe's response about the existence of 
log1p, I don't think I will use it (for reasons below). Thierry's 
response is true for Poisson data, but not for non-integer values. 
Still, it points into an important direction: All too often zeros 
emanate from a different process than the other values (see mixed 
distributions, zero-inflated, hurdle and all that). In that case, you 
should consult Ben Bolker's excellent book (which is probably still 
available as a draft on his homepage, but also worth buying).

If you want to transform, here is my take:

My folk-law guidelines on the c in log(x+c) are:
1. c should roughly be 1/2 of the smallest, non-zero value: 
signif(0.5*sort(unique(x))[2], 2)
2. c should be quadrat of the first quantile devided by the third 
quantile: (quantile(x)[2]^2)/quantile(x)[4]
For example:
set.seed(11011)
x <- c(runif(95), rep(0,5))

Method 1: c=0.0015
Method 2: c=0.015
While this looks like a huge difference (an order of magnitude), it 
actually isn't all that much, given the range of the data:

plot(density(x))
abline(v=c(0.0015, 0.015))

I do have a reference for method 2, but it is German (Stahel, W. A. 
(2002) Statistische Datenanalyse. Eine Einführung für 
Naturwissenschaftler. Vieweg, Braunschweig.).
_ Method 1 is what my PhD's statistics adviser recommended. Since he was 
right in everything else, I rely on his advise here, too. That may, I 
acknowledge, not be good enough for you. But maybe someone else finds a 
proper reference.

The key thing for any value of c is that it doesn't distort the 
analysis. But then, how do you detect distortion? I used a comparison of 
rank-transformed data and various values of c. When c was large (in the 
current example e.g. 0.5 or so), the analysis started to differ from the 
rank-analysis. To use log1p here would be a dramatic distortion!

Another way to look at it is through Box-Cox-transformation. Since 
Box-Cox transforms towards symmetric (not necessarily normal) 
distribution, also c should be chosen in such a way as to facilitate the 
transformation towards symmetry.

HTH,

Carsten


Nate Upham wrote:
> I have a general stats question for you guys:
>
> How does one normally deal with zero (0) values when log transforming data?  
>
> I would like to log transform (natural log, ln) several response variables for use in quantile
> regression.  But one of my variables includes several zero values.  Since ln(0) = infinity, this is
> not readily possible.  Is it best to remove all data with zero values?  Or should I add a very small
> number to each value (e.g., 0.00001)?  This seems problematic.  Is there an easy way to address this
> issue?
>
> Thanks much for your help,
> --Nate
>
> _________________________________
> Nathan S. Upham
> Ph.D. student
> Committee on Evolutionary Biology
> University of Chicago
> 1025 E. 57th St., Culver 402
> Chicago, IL 60637
> nsupham at uchicago.edu
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>
>   

-- 
Dr. Carsten F. Dormann
Department of Computational Landscape Ecology
Helmholtz Centre for Environmental Research-UFZ 
Permoserstr. 15
04318 Leipzig
Germany

Tel: ++49(0)341 2351946
Fax: ++49(0)341 2351939
Email: carsten.dormann at ufz.de
internet: http://www.ufz.de/index.php?de=4205



More information about the R-sig-ecology mailing list