[R] Adjusting for heaping in data
Ted.Harding at manchester.ac.uk
Sun Oct 14 12:39:51 CEST 2007
On 14-Oct-07 08:33:41, Thomas Frööjd wrote:
> Hi R users. I am new to the community and have got myself into
> a little problem.
It does not look as though it was yourself who got you into this
problem! You have been given the bathwater along with the baby.
> I have a dataset of birth weights recorded by nurses at a delivery
> clinic in an developing country.
> The weights are entered in KiloGrams with one decimal. However
> there is substantial heaping at each 500g when looking at the
> sample in a histogram. Do anyone of you know a easy way to adjust
> for this and if it exists an R package to implement the method?
> Best regards
> Thomas Fröjd
It is quite a common problem for data to be badly recorded in
this kind of way (as well as other bad kinds of ways).
You can't "adjust" for it (in the sense of "compensate") directly
since such a rounding does not tell where it was rounded from.
There may, howevr, be information in the covariates which could
be relevant to that question.
I'll comment on two extreme approaches and a possible intermediate
1) If you want to treat all data on the same footing, then
you can round every weight to the nearest 500gm. This has
the disadvantage of losing the information in the weights
which have been recorded more precisely. The potential
difference of up to 250gm, in a typical birth weight of
say 2-2.5kgm, could result in a serious disstortion.
However, you could assess the effect of this by performing
your intended analysis using the data as you have them,
the repeating it with the full-rounded data , and seeing
how much difference it makes.
2) You could attempt to evaluate the extra uncertainty which
results from this rounding which has been done by the nurses.
One approach could be to fit a Normal distribution (say)
to the data as you have them. Say this estimates mu0 for
the mean and s0 for the stahdard deviation.
You can then "un-round" the rounded data at random, on
the basis that, given that a weight is say 2.5 kgm, it
might be anywhere from 2.25 to 2.75 according to that
distribution conditional on being in that range. This
is quite easily done in R: if wt=2.5, say,
p0 <- pnorm((wt - 0.25 - mu0)/s00
p1 <- pnorm((wt + 0.25 - mu0)/s0)
X <- runif(1,p0,p1)
rwt <- mu0 + s0*qnorm(X)
rwt <- round(rwt,1) ## see below
If you do this for every truly rounded 'wt', and perform
you intended analysis for the resulting "un-rounded"
dataset (of course after rounfing the results to 100gm,
to be compatible with the 0.1kgm general rounding0, and
then repeat this unrounding+analysis a few times, you
will have an estimate of the ucertainty, in your final
results, which has been introduced by the gross rounding.
However, you will have to make a decision about what proportion
of the data at each whole 500gm have really been rounded!
Some of these are likely to be measurements which would have
been quite appropriately rounded to the nearest 500gm -- e.g.
2.05kgm -> 2.0kgm.
You may be able to estimate this proportion from the heights
of the "factory chimneys" in the histogram. Then apply the
above procedure to that fraction.
3. If you have covariates with your weight data, you may be
able to fit an appropriate model to your original data
which would enable you to estimate, for any given "rounded"
weight, the mu0 and s0 for that weight in terms of the
values of the covariates. Then proceed as in (2).
However, having done that, it may transpire that you should
re-estimate the model, which would imply re-estimating the
m0 and s0 used for the "random unrounding", and then going
round the loop again. You're moving into Multiple Imputation
territory now, and again there are resources in R for doing
it; but it's deeper and more coplex territory!
In both (2) and (3), the same check as in (1) should be carried
out: Has it made any difference that matters to the results,
compared with what you get from the original data?
Hoping this helps (at least a bit).
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Oct-07 Time: 11:39:47
------------------------------ XFMail ------------------------------
More information about the R-help