[R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation

Cole, Tim tim.cole at ucl.ac.uk
Tue Apr 3 19:46:53 CEST 2018

Thierry is of course right that using 1 or 100 or 0.001 as the offset gives different answers. But for me that doesn’t rule out using an offset, it just means that it needs treating as an extra model parameter to be estimated. This is easy to do with a grid search to minimise the deviance.

You are right to emphasise the continuous nature of your data, and that zero values are not intrinsically different from non-zero values. The case where I come across this is relating body size with age, where age 0 corresponds to birth. Biologically it makes sense to think of time 0 being at -9 months, i.e. conception rather than birth, in which case the appropriate offset is 0.75 years or 9 months.

Best wishes,
 tim.cole at ucl.ac.uk<mailto:tim.cole at ucl.ac.uk> Phone 020 7905 2666
Population Policy and Practice Programme
UCL Great Ormond Street Institute of Child Health,
30 Guilford Street, London WC1N 1EH, UK

Date: Tue, 3 Apr 2018 22:45:49 +1000
From: "Ahmad" <ahmadr215 at tpg.com.au<mailto:ahmadr215 at tpg.com.au>>
To: "'Anthony R. Ives'" <arives at wisc.edu<mailto:arives at wisc.edu>>
Cc: "'r-sig-mixed-models'" <r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>>,
                "'Thierry Onkelinx'" <thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be>>
Subject: Re: [R-sig-ME] adding a constant to a continuous dataset with
                zero values- to use the log transformation
Message-ID: <000201d3cb49$b3fb7a50$1bf26ef0$@tpg.com.au<mailto:000201d3cb49$b3fb7a50$1bf26ef0$@tpg.com.au>>
Content-Type: text/plain; charset="utf-8"

Hi Tony + Thierry

Thanks for your comments and thoughts,
To answer your first question, these are injection site volume data (cm^3) after injecting a vaccine at X2 and 4X (safety study).
So the injection site volume may resolve sooner in some groups/participants than the others, and then become undetectable (can't be seen with naked eye nor measured) , so these  are considered zero volume. Because this is a repeated measures dataset (continuous data) for a period of 60 days- I assume I need to compare the dataset for the entire 60 days for all treatment groups (and some with zero values). Hope this answers your question.

I can see and agree with your point Thierry (not adding a constant log(x+1)) and have seen similar responses from others. Re zero-inflated gamma distribution approach that you are suggesting- I've never done it before. Based on the nature of data (positive continuous with almost log normal distribution), would the results be easily interpretable to non-analytical people? I don't mind to give it a go- if I can understand how to do it.

Tony, thanks for your comprehensive comments- Yes I am testing the relationship between a predictor and response variable (volume of injection site reaction to a vaccine)- I intend to use a mixed-model with nlme package. I noticed that the title of your paper is on count data (haven't read it yet- but I will)- but my data is positive continuous data. Are you still suggesting that log(x+1) should be ok?

The number of participants are 25/group (3 groups) with 10 observations (repeats) over a 60-day period. I assume it should be considered as a small sample size, but the data points are >100.  So not sure how simulation can sort out the zero values and non-normal distribution of data. Your thoughts?

Thanks for your offer, it would be good if I can have a look at a chapter of your book on statistical properties of estimator- back to basic to learn these stuff.


-----Original Message-----
From: Anthony R. Ives <arives at wisc.edu<mailto:arives at wisc.edu>>
Sent: Tuesday, 3 April 2018 8:41 PM
To: Ahmad <ahmadr215 at tpg.com.au<mailto:ahmadr215 at tpg.com.au>>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>>; Thierry Onkelinx <thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be>>
Subject: Re: [R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation


I agree with Thierry that it is important to know where your zeros are coming from. As for whether you can use log(x+1), the answer depends on what you want to know and the characteristics of your data. If you want to know only whether there is a relationship between your predictor and response variables (i.e., significance testing of a slope), then using a log(x+1) transform can have pretty good statistical properties (see Ives 2015 listed below) in terms of type I errors. Generalized Linear Models (GLMs) are the next option, but naïve application of GLMs can give inflated type I errors. If your dataset is small (<100 points), I’d do simulations to check the type I error rate. Okay, honestly, I’d do simulations regardless of the size of your data. Ways to correct for problems with GLMs are discussed by Warton et al. (2016).

Thierry suggests more complicated models depending on the source of your zeros. I don’t know how these methods perform in terms of type I error rates and power, but again, I’d check with simulations.

I’ve nearly finished writing a book and software tutorial titled “A Conceptual Introduction to Correlated Data: Mixed and Phylogenetic Models”. My goal is to discuss basic statistical issues, such as properties of estimators, so that people understand the conceptual ideas underlying the tests they are performing. There is a chapter on the statistical properties of estimators, the problems they can have, and how to identify and fix them using bootstrapping. I’d be happy to send a draft version if you want. Then again, it might be more information than you are interested in.

Cheers, Tony

Ives, A. R. 2015. For testing the significance of regression coefficients, go ahead and log-transform count data. Methods in Ecology and Evolution 6:828–835.

Warton, D. I., M. Lyons, J. Stoklosa, and A. R. Ives. 2016. Three points to consider when choosing a LM or GLM test for count data. Methods in Ecology and Evolution 7:882-890.

On 4/3/18, 5:05 AM, "R-sig-mixed-models on behalf of Thierry Onkelinx" <r-sig-mixed-models-bounces at r-project.org<mailto:r-sig-mixed-models-bounces at r-project.org> on behalf of thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be>> wrote:

    Dear Ahmad,

    Don't do log(x+1). If you want to see why, then to the analysis with
    log(x+1), log(x+100), log(x+0.001), ... and compare the results.

    What is causing the zeros? Are they non-detects? Then you need threat
    this as censored data (see the NADA package). If they are not, then a
    zero-inflated gamma distribution might be an option.

    Best regards,

    ir. Thierry Onkelinx
    Statisticus / Statistician

    Vlaamse Overheid / Government of Flanders
    Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
    thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be>
    Havenlaan 88 bus 73, 1000 Brussel

    To call in the statistician after the experiment is done may be no
    more than asking him to perform a post-mortem examination: he may be
    able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
    The plural of anecdote is not data. ~ Roger Brinner
    The combination of some data and an aching desire for an answer does
    not ensure that a reasonable answer can be extracted from a given body
    of data. ~ John Tukey

    2018-04-03 11:42 GMT+02:00 Ahmad <ahmadr215 at tpg.com.au<mailto:ahmadr215 at tpg.com.au>>:
    > Hi all
    > My question here is related to my previous query on Geometric mean of log
    > data.
    > I have a continuous dataset with considerable number of zero values, and
    > not-normally distributed. Because of zero values, I won't be able to take
    > the log of this variable. It has been suggested by some to add a constant
    > (e.g. +1) to all data to be able to take the log of data. I can then
    > transform back the output of lm() or Mixed-model to the original scale using
    > exp() or emmeans function with "response" method as suggested by Russell
    > (russell-lenth at uiowa.edu<mailto:russell-lenth at uiowa.edu>).
    > I searched this (adding a constant) and found that views on this approach
    > are not consistent- I would like to see if anyone has experience on how to
    > deal with such data.
    > Your help is greatly appreciated!
    > Ahmad

	[[alternative HTML version deleted]]

More information about the R-sig-mixed-models mailing list