[R-sig-eco] Log transforming zero value data
Philippe Grosjean
phgrosjean at sciviews.org
Wed Jun 24 10:16:14 CEST 2009
Carsten Dormann wrote:
> Dear Nate,
>
> although I learned from Phillippe's response about the existence of
> log1p, I don't think I will use it (for reasons below). Thierry's
> response is true for Poisson data, but not for non-integer values.
> Still, it points into an important direction: All too often zeros
> emanate from a different process than the other values (see mixed
> distributions, zero-inflated, hurdle and all that). In that case, you
> should consult Ben Bolker's excellent book (which is probably still
> available as a draft on his homepage, but also worth buying).
>
> If you want to transform, here is my take:
>
> My folk-law guidelines on the c in log(x+c) are:
> 1. c should roughly be 1/2 of the smallest, non-zero value:
> signif(0.5*sort(unique(x))[2], 2)
> 2. c should be quadrat of the first quantile devided by the third
> quantile: (quantile(x)[2]^2)/quantile(x)[4]
> For example:
> set.seed(11011)
> x <- c(runif(95), rep(0,5))
>
> Method 1: c=0.0015
> Method 2: c=0.015
> While this looks like a huge difference (an order of magnitude), it
> actually isn't all that much, given the range of the data:
>
> plot(density(x))
> abline(v=c(0.0015, 0.015))
These are VERY complicated rules for just an empirical rule of thumb
without connection to theoretical background. Moreover, c is depending
on the dataset, and it is thus changing from dataset to dataset, which
is NOT a desired behavior.
So, providing maximum values are large (100s or more), and minimum value
above zero not too small (let's 1, e.g., for countings), ln(x+1) alias
log1p() is convenient because log1p(0) = 0. So, given we use rules of
thumb, this one looks good because (1) it transforms zero to zero, and
(2) transformation is independent from the content of the data. But I
agree it is not a good choice when you deal with small values.
Now, if you want to be more accurate, you have to determine the actual
distribution of your data. If your data are (generalized) log-normally
distributed, c must be defined according to what you know about the
variable you measure. For instance, temperature expressed in °C would
require to choose c as being 273.15 to be correct... very far away from
1, or from c that would be used with the rules you propose!
Best,
Philippe
> I do have a reference for method 2, but it is German (Stahel, W. A.
> (2002) Statistische Datenanalyse. Eine Einführung für
> Naturwissenschaftler. Vieweg, Braunschweig.).
> _ Method 1 is what my PhD's statistics adviser recommended. Since he was
> right in everything else, I rely on his advise here, too. That may, I
> acknowledge, not be good enough for you. But maybe someone else finds a
> proper reference.
>
> The key thing for any value of c is that it doesn't distort the
> analysis. But then, how do you detect distortion? I used a comparison of
> rank-transformed data and various values of c. When c was large (in the
> current example e.g. 0.5 or so), the analysis started to differ from the
> rank-analysis. To use log1p here would be a dramatic distortion!
>
> Another way to look at it is through Box-Cox-transformation. Since
> Box-Cox transforms towards symmetric (not necessarily normal)
> distribution, also c should be chosen in such a way as to facilitate the
> transformation towards symmetry.
>
> HTH,
>
> Carsten
>
>
> Nate Upham wrote:
>> I have a general stats question for you guys:
>>
>> How does one normally deal with zero (0) values when log transforming
>> data?
>> I would like to log transform (natural log, ln) several response
>> variables for use in quantile
>> regression. But one of my variables includes several zero values.
>> Since ln(0) = infinity, this is
>> not readily possible. Is it best to remove all data with zero
>> values? Or should I add a very small
>> number to each value (e.g., 0.00001)? This seems problematic. Is
>> there an easy way to address this
>> issue?
>>
>> Thanks much for your help,
>> --Nate
>>
>> _________________________________
>> Nathan S. Upham
>> Ph.D. student
>> Committee on Evolutionary Biology
>> University of Chicago
>> 1025 E. 57th St., Culver 402
>> Chicago, IL 60637
>> nsupham at uchicago.edu
>>
>> _______________________________________________
>> R-sig-ecology mailing list
>> R-sig-ecology at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>>
>>
>
More information about the R-sig-ecology
mailing list