[R] Survival with different probabilities of censoring

anthonywaldron anthonywaldron at hotmail.com
Wed May 30 16:50:14 CEST 2012


Dear all
I have a fairly funky problem that I think demands some sort of survival
analysis. There are two Red List assessments for mammals: 1986 and 2008.
Some mammals changed their Red List status between those dates. Those
changes can be regarded as "events" and are "interval censored" in the sense
that we don't know at what point between 1986 and 2008 each species declined
far enough to move into another category of extinction risk.

We then allocate fractional responsibility for each decline among the
countries of the world and attempt to model factors in each country that
might cause the species declines. For example, if a declining species is
found in two countries and we decide that the countries share 50:50
responsibility for the decline, then the blame score of each of those
countries gets augmented by 0.50 "species fractions".

The data set therefore looks like:

Y variable: a set of non-integer values representing "blame scores", being
the sum of fractions of status-changing species for which each country
should be blamed. Note that these are mostly changes to worse status but
some are changes to the better i.e. negative values. There are also a lot of
zeroes, for countries where there is no species that changed status.

X variables: various things like governance, population density etc, The
multivariate analysis also includes the total number of species fractions in
each country (not just the species fractions that changed status). The
latter term controls for the influence of total species richness on the
number of species experiencing the event. Please note that total species
richness also influences some of the other x variables, so it is included as
an x term and should not be used as a simple scalor for the y term.

The fun part:
The data are double censored. As I said, they might be considered interval
censored. The countries with zero species fractions changing status in 2008
are right censored (all species do indeed eventually die). However, and this
is the big problem, the probability of a zero is very different for each
country. Countries with very few species fractions are far more likely to
have zeroes i.e. to be right censored. They are therefore far less
informative regarding the influence of the x variables. Indeed, there is a
very clear pattern whereby, if I run a normal regression on data that
excludes the zeroes, I get a statistical expectation at y=0. If I then
include the empirical y=0 values, the values that depart furthest from this
expectation are exactly the ones that have the highest probability of being
a zero at random (big residuals patterned in an S shape on the fitted
values, as you would instinctively imagine will occur when plotting a range
of y=0 points on a sloping regression line) . Zeroes which are rather
UNLIKELY to be zero at random represent countries where you would expect a
species delince AND NONE HAPPENED, and those countries sit very close to the
expectation. Countries effectively get further and further below the ability
of my "instrument" to detect an effect as they become less and less species
rich, until there are so few species that the probability of observing any
event becomes tiny. Countries where probability of an event being observed
are tiny sit furthest from the expectation. If all countries are given equal
weight, therefore, the noise from species-poor countries all but obscures
the signal.

We've tried various approaches for zero-heavy data but I think this
increasingly looks like a survival analysis to me. The question is, how can
I adapt a survival analysis so that it takes into account the different
probability of censoring (the different random probability of being a zero)
and downweights the uninformative zeroes? (With the added fun of double
censoring, non-integer values and a small number of negative values).
Remember that what we have is the number of species fractions changing
status in a single time period, not the more usual "time to event". 

Please also remember: although we can calculate the /percent /of species
fractions in each country that changed status, we can't really use
percentage as the y variable because the denominator (total fractional
species richness) also affects the x variables. We therefore need to use the
raw number of species fractions changing status.

I'm hoping that somebody experienced in survival analysis might have come
across something like this before (including how to deal with the multiple
censoring, the non-integer values and importantly, the different
informativeness of each censored data point).
best regards
Anthony Waldron
Universidade de Santa Cruz, Bahia, Brazil


--
View this message in context: http://r.789695.n4.nabble.com/Survival-with-different-probabilities-of-censoring-tp4631838.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list