[R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation

Thu Apr 5 00:57:56 CEST 2018

Hi Tony

Thanks for your comments and thoughts,
You raised interesting concepts here that I didn't think about them before. You are right, I may perhaps haven't explained what the objective of this exercise. I assume this is similar to censoring in survival analysis (but with continuous data), showing in which group the injection site volume resolves sooner- by using the volume data. So, the number of zero in a group means that injection site resolved soon. 

Re aggregation vs. hierarchical approaches, as you pointed out I have 25/group (75 total) with 10-13 observations for each participant.
So, I am not sure to consider this as small or big sample size (or data points) to make a decision about one of these objections. I assume I can run both and see the difference, as you suggested.

Thanks again!
Ahmad

-----Original Message-----
From: Anthony R. Ives <arives at wisc.edu> 
Sent: Wednesday, 4 April 2018 7:28 PM
To: Ahmad <ahmadr215 at tpg.com.au>
Cc: 'r-sig-mixed-models' <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation

Ahmad,

I committed the sin of answering your question before understanding your question. Sorry. My answer now is completely different.

It seems like in your study, you have 75 independent data points (participants). For each participant, the information is the rate of resolution, so the question is how to best characterize this. This depends on the pattern of the data. I’m assuming that you have enough observations per participant, so that you can get a reasonable estimate of the resolution rate for each participant. I’m also assuming that you are interested in log-transforming the data because the injection site volume decays exponentially. If you ignore the zeros for the moment and look at the residuals from fits to each of the 75 participants taken separately, if they look okay (i.e., linear and no obvious heteroscedasticity) on the log scale, then taking the slope is probably fine. I might just ignore the zeros altogether if they don’t add information about the rate of resolution. The goal of the fitting is to get a single number for each participant that best describes the resolution rate. Getting a good description of the resolution rate is a biological question, not a statistical one. You are not going to be using the error structure of these fits to the data from individual participants in your hypothesis test. The hypothesis test would be on the 75 points (resolution rates) for the participants, for which you could use an ANOVA or regression treating each participant as independent.

Of course, this suggestion assumes that you can get reasonable resolution rates for each participant. 

I’ve often had colleagues enthuse about hierarchical models because they “use all of the data”; there are more data points than would be the case if you aggregate data (such as I’m suggesting – aggregating the observations for each individual to a single value). However, your data are highly correlated, so if you correctly account for this correlation, then you really have only 75 points. A hierarchical model might help in getting better fits for each participant if you have few observations, but a hierarchical might also cause more statistical problems that are hard to detect. If your treatment effect is strong, aggregating observations to the participant level should detect it. If a test on the aggregated doesn’t detect a pattern while a hierarchical model does, I’d be highly suspicious of the hierarchical model.

Cheers, Tony

On 4/3/18, 7:46 AM, "Ahmad" <ahmadr215 at tpg.com.au> wrote:

    Hi Tony + Thierry

    Thanks for your comments and thoughts,
    To answer your first question, these are injection site volume data (cm^3) after injecting a vaccine at X2 and 4X (safety study).
    So the injection site volume may resolve sooner in some groups/participants than the others, and then become undetectable (can't be seen with naked eye nor measured) , so these  are considered zero volume. Because this is a repeated measures dataset (continuous data) for a period of 60 days- I assume I need to compare the dataset for the entire 60 days for all treatment groups (and some with zero values). Hope this answers your question.

    I can see and agree with your point Thierry (not adding a constant log(x+1)) and have seen similar responses from others. Re zero-inflated gamma distribution approach that you are suggesting- I've never done it before. Based on the nature of data (positive continuous with almost log normal distribution), would the results be easily interpretable to non-analytical people? I don't mind to give it a go- if I can understand how to do it.

    Tony, thanks for your comprehensive comments- Yes I am testing the relationship between a predictor and response variable (volume of injection site reaction to a vaccine)- I intend to use a mixed-model with nlme package. I noticed that the title of your paper is on count data (haven't read it yet- but I will)- but my data is positive continuous data. Are you still suggesting that log(x+1) should be ok? 

    The number of participants are 25/group (3 groups) with 10 observations (repeats) over a 60-day period. I assume it should be considered as a small sample size, but the data points are >100.  So not sure how simulation can sort out the zero values and non-normal distribution of data. Your thoughts?

    Thanks for your offer, it would be good if I can have a look at a chapter of your book on statistical properties of estimator- back to basic to learn these stuff.

     Thanks
    Ahmad

    -----Original Message-----
    From: Anthony R. Ives <arives at wisc.edu> 
    Sent: Tuesday, 3 April 2018 8:41 PM
    To: Ahmad <ahmadr215 at tpg.com.au>
    Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>; Thierry Onkelinx <thierry.onkelinx at inbo.be>
    Subject: Re: [R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation

    Ahmad,

    I agree with Thierry that it is important to know where your zeros are coming from. As for whether you can use log(x+1), the answer depends on what you want to know and the characteristics of your data. If you want to know only whether there is a relationship between your predictor and response variables (i.e., significance testing of a slope), then using a log(x+1) transform can have pretty good statistical properties (see Ives 2015 listed below) in terms of type I errors. Generalized Linear Models (GLMs) are the next option, but naïve application of GLMs can give inflated type I errors. If your dataset is small (<100 points), I’d do simulations to check the type I error rate. Okay, honestly, I’d do simulations regardless of the size of your data. Ways to correct for problems with GLMs are discussed by Warton et al. (2016).

    Thierry suggests more complicated models depending on the source of your zeros. I don’t know how these methods perform in terms of type I error rates and power, but again, I’d check with simulations.

    I’ve nearly finished writing a book and software tutorial titled “A Conceptual Introduction to Correlated Data: Mixed and Phylogenetic Models”. My goal is to discuss basic statistical issues, such as properties of estimators, so that people understand the conceptual ideas underlying the tests they are performing. There is a chapter on the statistical properties of estimators, the problems they can have, and how to identify and fix them using bootstrapping. I’d be happy to send a draft version if you want. Then again, it might be more information than you are interested in.

    Cheers, Tony

    Ives, A. R. 2015. For testing the significance of regression coefficients, go ahead and log-transform count data. Methods in Ecology and Evolution 6:828–835.

    Warton, D. I., M. Lyons, J. Stoklosa, and A. R. Ives. 2016. Three points to consider when choosing a LM or GLM test for count data. Methods in Ecology and Evolution 7:882-890.

    On 4/3/18, 5:05 AM, "R-sig-mixed-models on behalf of Thierry Onkelinx" <r-sig-mixed-models-bounces at r-project.org on behalf of thierry.onkelinx at inbo.be> wrote:

        Dear Ahmad,

        Don't do log(x+1). If you want to see why, then to the analysis with
        log(x+1), log(x+100), log(x+0.001), ... and compare the results.

        What is causing the zeros? Are they non-detects? Then you need threat
        this as censored data (see the NADA package). If they are not, then a
        zero-inflated gamma distribution might be an option.

        Best regards,

        ir. Thierry Onkelinx
        Statisticus / Statistician

        Vlaamse Overheid / Government of Flanders
        INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
        AND FOREST
        Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
        thierry.onkelinx at inbo.be
        Havenlaan 88 bus 73, 1000 Brussel
        www.inbo.be

        ///////////////////////////////////////////////////////////////////////////////////////////
        To call in the statistician after the experiment is done may be no
        more than asking him to perform a post-mortem examination: he may be
        able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
        The plural of anecdote is not data. ~ Roger Brinner
        The combination of some data and an aching desire for an answer does
        not ensure that a reasonable answer can be extracted from a given body
        of data. ~ John Tukey
        ///////////////////////////////////////////////////////////////////////////////////////////

        2018-04-03 11:42 GMT+02:00 Ahmad <ahmadr215 at tpg.com.au>:
        > Hi all
        >
        > My question here is related to my previous query on Geometric mean of log
        > data.
        > I have a continuous dataset with considerable number of zero values, and
        > not-normally distributed. Because of zero values, I won't be able to take
        > the log of this variable. It has been suggested by some to add a constant
        > (e.g. +1) to all data to be able to take the log of data. I can then
        > transform back the output of lm() or Mixed-model to the original scale using
        > exp() or emmeans function with "response" method as suggested by Russell
        > (russell-lenth at uiowa.edu).
        >
        > I searched this (adding a constant) and found that views on this approach
        > are not consistent- I would like to see if anyone has experience on how to
        > deal with such data.
        >
        > Your help is greatly appreciated!
        >
        > Ahmad
        >
        > _______________________________________________
        > R-sig-mixed-models at r-project.org mailing list
        > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

        _______________________________________________
        R-sig-mixed-models at r-project.org mailing list
        https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models