[R] Create new data frame with conditional sums
Jason Stout, M.D.
j@@on@@tout @end|ng |rom duke@edu
Sat Oct 14 16:16:27 CEST 2023
That's very helpful and instructive, thank you!
Jason Stout, MD, MHS
Box 102359-DUMC
Durham, NC 27710
FAX 919-681-7494
________________________________
From: John Fox <jfox using mcmaster.ca>
Sent: Saturday, October 14, 2023 10:13 AM
To: Jason Stout, M.D. <jason.stout using duke.edu>
Cc: r-help using r-project.org <r-help using r-project.org>
Subject: Re: [R] Create new data frame with conditional sums
Dear Jason,
I don't think that there's anything wrong with using a loop to solve
this problem, but it's generally a good idea to pre-allocate space for
the result rather than build it up one value at a time, which may cause
unnecessary copying of the object.
Here are three solutions:
f1 <- function(Cutoff, Pct, Totpop){
Pop <- numeric(0)
for (i in seq_along(Cutoff))
Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
cbind(Cutoff, Pop)
}
f2 <- function(Cutoff, Pct, Totpop){
Pop <- numeric(length(Cutoff))
for (i in seq_along(Cutoff))
Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
cbind(Cutoff, Pop)
}
f3 <- function(Cutoff, Pct, Totpop){
Pop <- sapply(Cutoff, function(c) sum(Totpop[Pct >= c]))
cbind(Cutoff, Pop)
}
The first is similar to yours; the second pre-allocates space for the
result but still uses a loop; and the third avoids the loop. All produce
the same result, for example,
> with(dummydata, f3(seq(0, 0.15, by=0.01), Pct, Totpop))
Cutoff Pop
[1,] 0.00 43800
[2,] 0.01 43800
[3,] 0.02 39300
[4,] 0.03 39300
[5,] 0.04 31000
[6,] 0.05 26750
[7,] 0.06 22750
[8,] 0.07 17800
[9,] 0.08 12700
[10,] 0.09 12700
[11,] 0.10 8000
[12,] 0.11 8000
[13,] 0.12 8000
[14,] 0.13 3900
[15,] 0.14 3900
[16,] 0.15 3900
Here are some timings:
> microbenchmark::microbenchmark(
+ preallocate=with(dummydata, f2(seq(0, 0.15, by=0.01),
+ Pct, Totpop)),
+ yourloop=with(dummydata, f1(seq(0, 0.15, by=0.01),
+ Pct, Totpop)),
+ sapply=with(dummydata, f3(seq(0, 0.15, by=0.01),
+ Pct, Totpop)),
+ times=1000
+ )
Unit: microseconds
expr min lq mean median uq max neval cld
preallocate 13.776 14.3910 15.74195 14.9240 16.318 56.908 1000 a
yourloop 15.129 15.7645 17.26809 16.3795 18.368 73.964 1000 b
sapply 22.304 23.2060 25.19868 24.1080 26.814 48.544 1000 c
So, for this very small problem, there are small but reliable
differences in timing among the three solutions, and the version that
avoids the loop is slowest. I suspect, but haven't verified, that for a
much larger problem, your solution would be slowest.
I hope this helps,
John
--
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
web: https://urldefense.com/v3/__https://www.john-fox.ca/__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtJEwobdDQ$
On 2023-10-13 4:13 p.m., Jason Stout, M.D. wrote:
> Caution: External email.
>
>
> This seems like it should be simple but I can't get it to work properly. I'm starting with a data frame like this:
>
> Tract Pct Totpop
> 1 0.05 4000
> 2 0.03 3500
> 3 0.01 4500
> 4 0.12 4100
> 5 0.21 3900
> 6 0.04 4250
> 7 0.07 5100
> 8 0.09 4700
> 9 0.06 4950
> 10 0.03 4800
>
> And I want to end up with a data frame with two columns, a "Cutoff" column that is a simple sequence of equally spaced cutoffs (let's say in this case from 0-0.15 by 0.01) and a "Pop" column which equals the sum of "Totpop" in the prior data frame in which "Pct" is greater than or equal to "cutoff." So in this toy example, this is what I want for a result:
>
> Cutoff Pop
> 1 0.00 43800
> 2 0.01 43800
> 3 0.02 39300
> 4 0.03 39300
> 5 0.04 31000
> 6 0.05 26750
> 7 0.06 22750
> 8 0.07 17800
> 9 0.08 12700
> 10 0.09 12700
> 11 0.10 8000
> 12 0.11 8000
> 13 0.12 8000
> 14 0.13 3900
> 15 0.14 3900
> 16 0.15 3900
>
> I can do this with a for loop but it seems there should be an easier, vectorized way that would be more efficient. Here is a reproducible example:
>
> dummydata<-data.frame(Tract=seq(1,10,by=1),Pct=c(0.05,0.03,0.01,0.12,0.21,0.04,0.07,0.09,0.06,0.03),Totpop=c(4000,3500,4500,4100,
> 3900,4250,5100,4700,
> 4950,4800))
> dfrm<-data.frame(matrix(ncol=2,nrow=0,dimnames=list(NULL,c("Cutoff","Pop"))))
> for (i in seq(0,0.15,by=0.01)) {
> temp<-sum(dummydata[dummydata$Pct>=i,"Totpop"])
> dfrm[nrow(dfrm)+1,]<-c(i,temp)
> }
>
> Jason Stout, MD, MHS
> Division of Infectious Diseases
> Dept of Medicine
> Duke University
> Box 102359-DUMC
> Durham, NC 27710
> FAX 919-681-7494
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtL8RrekaA$
> PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtKGvEhDNw$
> and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
More information about the R-help
mailing list