[R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation

Tue Apr 3 12:41:10 CEST 2018

Ahmad,

I agree with Thierry that it is important to know where your zeros are coming from. As for whether you can use log(x+1), the answer depends on what you want to know and the characteristics of your data. If you want to know only whether there is a relationship between your predictor and response variables (i.e., significance testing of a slope), then using a log(x+1) transform can have pretty good statistical properties (see Ives 2015 listed below) in terms of type I errors. Generalized Linear Models (GLMs) are the next option, but naïve application of GLMs can give inflated type I errors. If your dataset is small (<100 points), I’d do simulations to check the type I error rate. Okay, honestly, I’d do simulations regardless of the size of your data. Ways to correct for problems with GLMs are discussed by Warton et al. (2016).

Thierry suggests more complicated models depending on the source of your zeros. I don’t know how these methods perform in terms of type I error rates and power, but again, I’d check with simulations.

I’ve nearly finished writing a book and software tutorial titled “A Conceptual Introduction to Correlated Data: Mixed and Phylogenetic Models”. My goal is to discuss basic statistical issues, such as properties of estimators, so that people understand the conceptual ideas underlying the tests they are performing. There is a chapter on the statistical properties of estimators, the problems they can have, and how to identify and fix them using bootstrapping. I’d be happy to send a draft version if you want. Then again, it might be more information than you are interested in.

Cheers, Tony

Ives, A. R. 2015. For testing the significance of regression coefficients, go ahead and log-transform count data. Methods in Ecology and Evolution 6:828–835.

Warton, D. I., M. Lyons, J. Stoklosa, and A. R. Ives. 2016. Three points to consider when choosing a LM or GLM test for count data. Methods in Ecology and Evolution 7:882-890.

On 4/3/18, 5:05 AM, "R-sig-mixed-models on behalf of Thierry Onkelinx" <r-sig-mixed-models-bounces at r-project.org on behalf of thierry.onkelinx at inbo.be> wrote:

    Dear Ahmad,

    Don't do log(x+1). If you want to see why, then to the analysis with
    log(x+1), log(x+100), log(x+0.001), ... and compare the results.

    What is causing the zeros? Are they non-detects? Then you need threat
    this as censored data (see the NADA package). If they are not, then a
    zero-inflated gamma distribution might be an option.

    Best regards,

    ir. Thierry Onkelinx
    Statisticus / Statistician

    Vlaamse Overheid / Government of Flanders
    INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
    AND FOREST
    Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
    thierry.onkelinx at inbo.be
    Havenlaan 88 bus 73, 1000 Brussel
    www.inbo.be

    ///////////////////////////////////////////////////////////////////////////////////////////
    To call in the statistician after the experiment is done may be no
    more than asking him to perform a post-mortem examination: he may be
    able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
    The plural of anecdote is not data. ~ Roger Brinner
    The combination of some data and an aching desire for an answer does
    not ensure that a reasonable answer can be extracted from a given body
    of data. ~ John Tukey
    ///////////////////////////////////////////////////////////////////////////////////////////

    2018-04-03 11:42 GMT+02:00 Ahmad <ahmadr215 at tpg.com.au>:
    > Hi all
    >
    > My question here is related to my previous query on Geometric mean of log
    > data.
    > I have a continuous dataset with considerable number of zero values, and
    > not-normally distributed. Because of zero values, I won't be able to take
    > the log of this variable. It has been suggested by some to add a constant
    > (e.g. +1) to all data to be able to take the log of data. I can then
    > transform back the output of lm() or Mixed-model to the original scale using
    > exp() or emmeans function with "response" method as suggested by Russell
    > (russell-lenth at uiowa.edu).
    >
    > I searched this (adding a constant) and found that views on this approach
    > are not consistent- I would like to see if anyone has experience on how to
    > deal with such data.
    >
    > Your help is greatly appreciated!
    >
    > Ahmad
    >
    > _______________________________________________
    > R-sig-mixed-models at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

    _______________________________________________
    R-sig-mixed-models at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models