[R] Statistical question about logistic regression simulation
Ravi Varadhan
RVaradhan at jhmi.edu
Wed Aug 26 17:07:04 CEST 2009
Your exposure variable has very large values, so all your probabilities are
1. You also get a bunch of NaN's because the `expit' (inverse logit)
function to calculate the probabilities cannot be evaluated. You need to use
values of exposure that will yield some 0's and 1's so that the binomial
model can be estimated.
Ravi.
----------------------------------------------------------------------------
-------
Ravi Varadhan, Ph.D.
Assistant Professor, The Center on Aging and Health
Division of Geriatric Medicine and Gerontology
Johns Hopkins University
Ph: (410) 502-2619
Fax: (410) 614-9625
Email: rvaradhan at jhmi.edu
Webpage:
http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
tml
----------------------------------------------------------------------------
--------
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Denis Aydin
Sent: Wednesday, August 26, 2009 10:18 AM
To: r-help at r-project.org
Subject: [R] Statistical question about logistic regression simulation
Hi R help list
I'm simulating logistic regression data with a specified odds ratio
(beta) and have a problem/unexpected behaviour that occurs.
The datasets includes a lognormal exposure and diseased and healthy
subjects.
Here is my loop:
ors <- vector()
for(i in 1:200){
# First, I create a vector with a lognormally distributed exposure:
n <- 10000 # number of study subjects
mean <- 6
sd <- 1
expo <- rlnorm(n, mean, sd)
# Then I assign each study subject a probability of disease with a
# specified Odds ratio (or beta coefficient) according to a logistic
# model:
inter <- 0.01 # intercept
or <- log(1.5) # an odds ratio of 1.5 or a beta of ln(1.5)
p <- exp(inter + or * expo)/(1 + exp(inter + or * expo))
# Then I use the probability to decide who is having the disease and who
# is not:
disease <- rbinom(length(p), 1, p) # 1 = disease, 0 = healthy
# Then I calculate the logistic regression and extract the odds ratio
model <- glm(disease ~ expo, family = binomial)
ors[i] <- exp(summary(model)$coef[2]) # exponentiated beta = OR
}
Now to my questions:
1. I was expecting the mean of the odds ratios over all simulations to
be close to the specified one (1.5 in this case). This is not the case
if the mean of the lognormal distribution is, say 6.
If I reduce the mean of the exposure distribution to say 3, the mean of
the simulated ORs is very close to the specified one. So the simulation
seems to be quite sensitive to the parameters of the exposure distribution.
2. Is it somehow possible to "stabilize" the simulation so that it is
not that sensitive to the parameters of the lognormal exposure
distribution? I can't make up the parameters of the exposure
distribution, they are estimations from real data.
3. Are there general flaws or errors in my approach?
Thanks a lot for any help on this!
All the best,
Denis
--
Denis Aydin
Institute of Social and Preventive Medicine at Swiss Tropical Institute
Basel
Associated Institute of the University of Basel
Steinengraben 49 - 4051 Basel - Switzerland
Phone: +41 (0)61 270 22 04
Fax: +41 (0)61 270 22 25
denis.aydin at unibas.ch
www.ispm-unibasel.ch
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list