[R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation
John Maindonald
john.maindonald at anu.edu.au
Tue Apr 3 23:01:21 CEST 2018
If it is just a choice between adding 1, adding 1/6 (as Tukey suggested), and adding 0.001,
that is not in principle different from choosing between: a negative binomial, a poisson,
a quasipoisson, negative binomial types I or II, Delaport, Poisson inverse gaussian. etc
(these are choices in the gamlss package; there are others in glmmTMB and in VGAM),
with choices of link function for the various parameters multiplying the range of possibilities.
The issues that this raises for inference, in cases where the choice is to an extent arbitrary,
are insufficiently acknowledged.
On the point about adding a constant to a continuous dataset with zero values to use the
log transformation, see the discussion at:
https://stats.stackexchange.com/questions/114848/negative-binomial-glm-vs-log-transforming-for-count-data-increased-type-i-erro/215080#215080
For the hurricanes dataset (available as DAAG::hurricNamed) that is the basis for an analysis
and for graphs that I posted, a log(count+1) with gaussian error actually does better, as judged
by comparing the quantiles (use, e.g., gamlss::centiles(). The comparison is least favorable to
the negative binomial at the lower end of the damage category.
[It would be good to have a predict(…, type=“quantile”), or suchlike, generally available.]
It is curious that there is such a range of alternatives for count data, but that the only widely
canvassed alternative to the binomial has been the beta binomial, with zero-inflation and
hurdle effects adding to the mix. I have been looking at data recently where the choice
between using glmer() with its ilimitation to binomial errors, and glmmTMB() with beta binomial
errors, with a scale parameter that is a function of the explanatory variable, makes a huge
difference.
John Maindonald email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au>
On 4/04/2018, at 06:24, Farrar, David <Farrar.David at epa.gov<mailto:Farrar.David at epa.gov>> wrote:
If you do that with a grid search, how would you get standard errors? Bootstrap?
-----Original Message-----
From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Cole, Tim
Sent: Tuesday, April 03, 2018 1:47 PM
To: r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation
Thierry is of course right that using 1 or 100 or 0.001 as the offset gives different answers. But for me that doesn’t rule out using an offset, it just means that it needs treating as an extra model parameter to be estimated. This is easy to do with a grid search to minimise the deviance.
You are right to emphasise the continuous nature of your data, and that zero values are not intrinsically different from non-zero values. The case where I come across this is relating body size with age, where age 0 corresponds to birth. Biologically it makes sense to think of time 0 being at -9 months, i.e. conception rather than birth, in which case the appropriate offset is 0.75 years or 9 months.
Best wishes,
Tim
--
tim.cole at ucl.ac.uk<mailto:tim.cole at ucl.ac.uk><mailto:tim.cole at ucl.ac.uk> Phone 020 7905 2666 Population Policy and Practice Programme UCL Great Ormond Street Institute of Child Health,
30 Guilford Street, London WC1N 1EH, UK
Date: Tue, 3 Apr 2018 22:45:49 +1000
From: "Ahmad" <ahmadr215 at tpg.com.au<mailto:ahmadr215 at tpg.com.au><mailto:ahmadr215 at tpg.com.au>>
To: "'Anthony R. Ives'" <arives at wisc.edu<mailto:arives at wisc.edu><mailto:arives at wisc.edu>>
Cc: "'r-sig-mixed-models'" <r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org><mailto:r-sig-mixed-models at r-project.org>>,
"'Thierry Onkelinx'" <thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be><mailto:thierry.onkelinx at inbo.be>>
Subject: Re: [R-sig-ME] adding a constant to a continuous dataset with
zero values- to use the log transformation
Message-ID: <000201d3cb49$b3fb7a50$1bf26ef0$@tpg.com.au<mailto:000201d3cb49$b3fb7a50$1bf26ef0$@tpg.com.au><mailto:000201d3cb49$b3fb7a50$1bf26ef0$@tpg.com.au>>
Content-Type: text/plain; charset="utf-8"
Hi Tony + Thierry
Thanks for your comments and thoughts,
To answer your first question, these are injection site volume data (cm^3) after injecting a vaccine at X2 and 4X (safety study).
So the injection site volume may resolve sooner in some groups/participants than the others, and then become undetectable (can't be seen with naked eye nor measured) , so these are considered zero volume. Because this is a repeated measures dataset (continuous data) for a period of 60 days- I assume I need to compare the dataset for the entire 60 days for all treatment groups (and some with zero values). Hope this answers your question.
I can see and agree with your point Thierry (not adding a constant log(x+1)) and have seen similar responses from others. Re zero-inflated gamma distribution approach that you are suggesting- I've never done it before. Based on the nature of data (positive continuous with almost log normal distribution), would the results be easily interpretable to non-analytical people? I don't mind to give it a go- if I can understand how to do it.
Tony, thanks for your comprehensive comments- Yes I am testing the relationship between a predictor and response variable (volume of injection site reaction to a vaccine)- I intend to use a mixed-model with nlme package. I noticed that the title of your paper is on count data (haven't read it yet- but I will)- but my data is positive continuous data. Are you still suggesting that log(x+1) should be ok?
The number of participants are 25/group (3 groups) with 10 observations (repeats) over a 60-day period. I assume it should be considered as a small sample size, but the data points are >100. So not sure how simulation can sort out the zero values and non-normal distribution of data. Your thoughts?
Thanks for your offer, it would be good if I can have a look at a chapter of your book on statistical properties of estimator- back to basic to learn these stuff.
Thanks
Ahmad
-----Original Message-----
From: Anthony R. Ives <arives at wisc.edu<mailto:arives at wisc.edu><mailto:arives at wisc.edu>>
Sent: Tuesday, 3 April 2018 8:41 PM
To: Ahmad <ahmadr215 at tpg.com.au<mailto:ahmadr215 at tpg.com.au><mailto:ahmadr215 at tpg.com.au>>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org><mailto:r-sig-mixed-models at r-project.org>>; Thierry Onkelinx <thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be><mailto:thierry.onkelinx at inbo.be>>
Subject: Re: [R-sig-ME] adding a constant to a continuous dataset with zero values- to use the log transformation
Ahmad,
I agree with Thierry that it is important to know where your zeros are coming from. As for whether you can use log(x+1), the answer depends on what you want to know and the characteristics of your data. If you want to know only whether there is a relationship between your predictor and response variables (i.e., significance testing of a slope), then using a log(x+1) transform can have pretty good statistical properties (see Ives 2015 listed below) in terms of type I errors. Generalized Linear Models (GLMs) are the next option, but naïve application of GLMs can give inflated type I errors. If your dataset is small (<100 points), I’d do simulations to check the type I error rate. Okay, honestly, I’d do simulations regardless of the size of your data. Ways to correct for problems with GLMs are discussed by Warton et al. (2016).
Thierry suggests more complicated models depending on the source of your zeros. I don’t know how these methods perform in terms of type I error rates and power, but again, I’d check with simulations.
I’ve nearly finished writing a book and software tutorial titled “A Conceptual Introduction to Correlated Data: Mixed and Phylogenetic Models”. My goal is to discuss basic statistical issues, such as properties of estimators, so that people understand the conceptual ideas underlying the tests they are performing. There is a chapter on the statistical properties of estimators, the problems they can have, and how to identify and fix them using bootstrapping. I’d be happy to send a draft version if you want. Then again, it might be more information than you are interested in.
Cheers, Tony
Ives, A. R. 2015. For testing the significance of regression coefficients, go ahead and log-transform count data. Methods in Ecology and Evolution 6:828–835.
Warton, D. I., M. Lyons, J. Stoklosa, and A. R. Ives. 2016. Three points to consider when choosing a LM or GLM test for count data. Methods in Ecology and Evolution 7:882-890.
On 4/3/18, 5:05 AM, "R-sig-mixed-models on behalf of Thierry Onkelinx" <r-sig-mixed-models-bounces at r-project.org<mailto:r-sig-mixed-models-bounces at r-project.org><mailto:r-sig-mixed-models-bounces at r-project.org> on behalf of thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be><mailto:thierry.onkelinx at inbo.be>> wrote:
Dear Ahmad,
Don't do log(x+1). If you want to see why, then to the analysis with
log(x+1), log(x+100), log(x+0.001), ... and compare the results.
What is causing the zeros? Are they non-detects? Then you need threat
this as censored data (see the NADA package). If they are not, then a
zero-inflated gamma distribution might be an option.
Best regards,
ir. Thierry Onkelinx
Statisticus / Statistician
Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be<mailto:thierry.onkelinx at inbo.be><mailto:thierry.onkelinx at inbo.be>
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be<http://www.inbo.be>
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////
2018-04-03 11:42 GMT+02:00 Ahmad <ahmadr215 at tpg.com.au<mailto:ahmadr215 at tpg.com.au><mailto:ahmadr215 at tpg.com.au>>:
Hi all
My question here is related to my previous query on Geometric mean of log
data.
I have a continuous dataset with considerable number of zero values, and
not-normally distributed. Because of zero values, I won't be able to take
the log of this variable. It has been suggested by some to add a constant
(e.g. +1) to all data to be able to take the log of data. I can then
transform back the output of lm() or Mixed-model to the original scale using
exp() or emmeans function with "response" method as suggested by Russell
(russell-lenth at uiowa.edu<mailto:russell-lenth at uiowa.edu><mailto:russell-lenth at uiowa.edu>).
I searched this (adding a constant) and found that views on this approach
are not consistent- I would like to see if anyone has experience on how to
deal with such data.
Your help is greatly appreciated!
Ahmad
[[alternative HTML version deleted]]
_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
More information about the R-sig-mixed-models
mailing list