[R] Aggregation of data frame with calculations of proportions
Mark Wardle
mark at wardle.org
Tue Jun 26 21:14:00 CEST 2007
Dear all,
I have been stuck on this problem, am rather struggling and would
appreciate some advice if anyone can help. I apologise if this is a
bit long-winded, but I've tried to limit it to the bare essentials,
but don't know how to make it more generic!
I have some slightly odd real world data that I'm looking at
representing number of positive diagnoses for different diseases, plus
the number tested. It's all in a data frame:
This should all be directly runnable with cut+paste:
sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)]
str(sp)
#'data.frame': 46 obs. of 19 variables:
# $ sca1 : int 5 1 NA NA 48 1 NA 17 21 4 ...
# $ sca2 : int 7 3 NA NA 53 1 NA NA 7 7 ...
# $ sca3 : int 3 1 NA NA 2 0 NA 29 0 0 ...
# $ sca6 : int 1 0 NA NA 2 2 NA NA 0 0 ...
# $ sca7 : int NA NA NA NA 2 0 NA NA 1 0 ...
# $ sca8 : int NA NA NA NA 2 1 NA NA NA NA ...
# $ sca10 : int NA NA NA NA 0 0 NA NA NA NA ...
# $ sca12 : int NA NA NA NA 0 0 NA NA NA NA ...
# $ sca17 : int NA NA NA NA 2 1 NA NA NA NA ...
# $ frda : int NA NA NA NA 0 1 NA NA NA NA ...
# $ drpla : int NA NA 1 1 1 1 NA NA 0 0 ...
# $ fmr1 : int NA NA NA NA NA NA 6 NA NA NA ...
# $ diagnosis : int 16 5 1 1 112 8 6 46 29 11 ...
# $ unexplained : int 10 26 415 392 71 35 137 51 0 11 ...
# $ total : int 26 31 416 393 183 43 143 97 29 22 ...
# $ patient.type: Factor w/ 8 levels "","ADCA","ADCA I",..: 2 5 2 5 2
4 8 3 2 2 ...
# $ age.group : Factor w/ 5 levels "","<40",">40",..: 5 5 5 5 1 1 4 5 1 1 ...
# $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8 8 ...
# $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3
3 3 3 3 ...
# It is straightforward to aggregate data:
smart.sum <- function(x,...) {
if (is.numeric(x)) return(sum(x, na.rm=T))
else return(paste(unique(x), collapse=", ")) # for natbib citations
}
sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum)
sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=smart.sum)
print(sp.ag1)
# and even to calculate crude percentages
spp = sp.ag1
spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total, 1)})
# *BUT*, the problem is that this is underestimating the true
proportions, because some of the data is NA. This means the diagnosis
was not tested rather than not found (which would be zero).
# What I want to do is calculate a true proportion, with the number
found divided by the number tested.
# I have managed to do this:
# calculates true proportion by only including values in the
denominator that have non-NAs in the numerator
# x= vector of numbers
# total = vector of numbers
calc.prevalence <- function(x, total) {
sum.x = sum(x, na.rm=T) # calculate numerator
sum.y = sum(total[!is.na(x)]) # calculate denominator
return(sum.x/sum.y)
}
correct.prevalence <- calc.prevalence(sp$drpla, sp$total)
incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T)
print(correct.prevalence)
print(incorrect.prevalence)
This is easy to apply to one column in one table, but I'm finding it
very difficult to do when I manipulate the data using the aggregate()
function above.
Is there a straightforward way, while aggregating columns by a
variable (number of) factors, in generating the sum of a column and
dividing by the sum of another column, only including data from the
second column when the first column is not NA. AND is it possible to
do that to all of the relevant columns (3:15)?
You won't believe how many iterations of by(), aggregate(), tapply(),
apply() and (finally) "for loops" I have tried with no success. The
best partial solution involved a combination of by() calling a
function calling aggregate(), but I can't parse the data returned. I'm
sure I am missing something! Can anyone help?
Many (many many) thanks!
Best wishes,
Mark
P.S. this is enough to drive me to drink!
--
Dr. Mark Wardle
Clinical research fellow and specialist registrar, Neurology
Cardiff, UK
More information about the R-help
mailing list