[R] Speeding up code?
jim holtman
jholtman at gmail.com
Thu Jul 16 19:37:05 CEST 2015
Here is one improvement. Avoid dataframes in some of these cases. This
create a character matrix and then converts to a dataframe after doing the
transpose of the matrix. This just takes less than 10 seconds on my system:
> library(stringr)
> # create character matrix; avoid dataframes in this case
> print(proc.time())
user system elapsed
15.52 5.24 587.70
> xm <- do.call(rbind, str_split(string = AllpairsTmp, pattern = "-"))
> # convert to dataframe and do transpose on matrix and not dataframe
> separoPairs <- as.data.frame(t(xm), stringsAsFactors = FALSE)
> print(proc.time()
+
+ )
user system elapsed
20.90 5.36 596.57
>
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Thu, Jul 16, 2015 at 7:56 AM, Ignacio Martinez <ignacio82 at gmail.com>
wrote:
> Hi Collin,
>
> The objective of the gen.names function is to generate N *unique *random
> names, where N is a *large *number. In my computer `gen.names(n = 50000)`
> takes under a second, so is probably not the root problem in my code. That
> said, I would love to improve it. I'm not exactly sure how you propose to
> change it using sample. What is the object that I would be sampling? I
> would love to run a little benchmark to compare my version with yours.
>
> What really takes a long time to run is:
>
> separoPairs <- as.data.frame(str_split(string = AllpairsTmp, pattern =
> "-"))
>
> So that and the chunk of code before that is probably where I would get big
> gains in speed. Sadly, I have no clue how to do it differently
>
> Thanks a lot for the help!!
>
> Ignacio
>
>
> On Wed, Jul 15, 2015 at 11:34 PM Collin Lynch <cflynch at ncsu.edu> wrote:
>
> > Hi Ignacio, If I am reading your code correctly then the top while loop
> is
> > essentially seeking to select a random set of names from the original
> set,
> > then using unique to reduce it down, you then iterate until you have
> built
> > your quota. Ultimately this results in a very inefficient attempt at
> > sampling without replacement. Why not just sample without replacement
> > rather than loop iteratively and use unique? Or if the set of possible
> > names are short enough why not just randomize it and then pull the first
> n
> > items off?
> >
> > Best,
> > Collin.
> >
> > On Wed, Jul 15, 2015 at 11:15 PM, Ignacio Martinez <ignacio82 at gmail.com>
> > wrote:
> >
> >> Hi R-Help!
> >>
> >> I'm hoping that some of you may give me some tips that could make my
> code
> >>
> > more efficient. More precisely, I would like to make the answer to my
> >> stakoverflow
> >> <
> >>
> http://stackoverflow.com/questions/31137940/randomly-assign-teachers-to-classrooms-imposing-restrictions
> >> >
> >
> >
> >> question more efficient.
> >>
> >> This is the code:
> >>
> >> library(dplyr)
> >> library(randomNames)
> >> library(geosphere)
> >>
> > set.seed(7142015)# Define Parameters
> >
> >
> >> n.Schools <- 20
> >> first.grade<-3
> >> last.grade<-5
> >> n.Grades <-last.grade-first.grade+1
> >> n.Classrooms <- 20 # THIS IS WHAT I WANTED TO BE ABLE TO CHANGE
> >> n.Teachers <- (n.Schools*n.Grades*n.Classrooms)/2 #Two classrooms per
> >> teacher
> >> # Define Random names function:
> >> gen.names <- function(n, which.names = "both", name.order =
> "last.first"){
> >> names <- unique(randomNames(n=n, which.names = which.names,
> >> name.order = name.order))
> >> need <- n - length(names)
> >> while(need>0){
> >> names <- unique(c(randomNames(n=need, which.names = which.names,
> >> name.order = name.order), names))
> >> need <- n - length(names)
> >> }
> >> return(names)}
> >> # Generate n.Schools names
> >> gen.schools <- function(n.schools) {
> >> School.ID <-
> >> paste0(gen.names(n = n.schools, which.names = "last"), ' School')
> >> School.long <- rnorm(n = n.schools, mean = 21.7672, sd = 0.025)
> >> School.lat <- rnorm(n = n.schools, mean = 58.8471, sd = 0.025)
> >> School.RE <- rnorm(n = n.schools, mean = 0, sd = 1)
> >> Schools <-
> >> data.frame(School.ID, School.lat, School.long, School.RE) %>%
> >> mutate(School.ID = as.character(School.ID)) %>%
> >> rowwise() %>% mutate (School.distance = distHaversine(
> >> p1 = c(School.long, School.lat),
> >> p2 = c(21.7672, 58.8471), r = 3961
> >> ))
> >> return(Schools)}
> >>
> >> Schools <- gen.schools(n.schools = n.Schools)
> >> # Generate Grades
> >> Grades <- c(first.grade:last.grade)
> >> # Generate n.Classrooms
> >>
> >> Classrooms <- LETTERS[1:n.Classrooms]
> >> # Group schools and grades
> >>
> >> SchGr <- outer(paste0(Schools$School.ID, '-'), paste0(Grades, '-'),
> >> FUN="paste")#head(SchGr)
> >> # Group SchGr and Classrooms
> >>
> >> SchGrClss <- outer(SchGr, paste0(Classrooms, '-'),
> >> FUN="paste")#head(SchGrClss)
> >> # These are the combination of School-Grades-Classroom
> >> SchGrClssTmp <- as.matrix(SchGrClss, ncol=1, nrow=length(SchGrClss) )
> >> SchGrClssEnd <- as.data.frame(SchGrClssTmp)
> >> # Assign n.Teachers (2 classroom in a given school-grade)
> >> Allpairs <- as.data.frame(t(combn(SchGrClssTmp, 2)))
> >> AllpairsTmp <- paste(Allpairs$V1, Allpairs$V2, sep=" ")
> >>
> >> library(stringr)
> >> separoPairs <- as.data.frame(str_split(string = AllpairsTmp, pattern =
> >> "-"))
> >> separoPairs <- as.data.frame(t(separoPairs))
> >> row.names(separoPairs) <- NULL
> >> separoPairs <- separoPairs %>% select(-V7) %>% #Drops empty column
> >> mutate(V1=as.character(V1), V4=as.character(V4), V2=as.numeric(V2),
> >> V5=as.numeric(V5)) %>% mutate(V4 = trimws(V4, which = "both"))
> >>
> >> separoPairs[120,]$V4#Only the rows with V1=V4 and V2=V5 are valid
> >
> >
> >> validPairs <- separoPairs %>% filter(V1==V4 & V2==V5) %>% select(V1, V2,
> >> V3, V6)
> >> # Generate n.Teachers
> >>
> >> gen.teachers <- function(n.teachers){
> >> Teacher.ID <- gen.names(n = n.teachers, name.order = "last.first")
> >> Teacher.exp <- runif(n = n.teachers, min = 1, max = 30)
> >> Teacher.Other <- sample(c(0,1), replace = T, prob = c(0.5, 0.5),
> >> size = n.teachers)
> >> Teacher.RE <- rnorm(n = n.teachers, mean = 0, sd = 1)
> >> Teachers <- data.frame(Teacher.ID, Teacher.exp, Teacher.Other,
> >> Teacher.RE)
> >> return(Teachers)}
> >> Teachers <- gen.teachers(n.teachers = n.Teachers) %>%
> >> mutate(Teacher.ID = as.character(Teacher.ID))
> >> # Randomly assign n.Teachers teachers to the "ValidPairs"
> >> TmpAssignments <- validPairs[sample(1:nrow(validPairs), n.Teachers), ]
> >> Assignments <- cbind.data.frame(Teachers$Teacher.ID, TmpAssignments)
> >> names(Assignments) <- c("Teacher.ID", "School.ID", "Grade", "Class_1",
> >> "Class_2")
> >> # Tidy Data
> >> library(tidyr)
> >> TeacherClassroom <- Assignments %>%
> >> gather(x, Classroom, Class_1,Class_2) %>%
> >> select(-x) %>%
> >> mutate(Teacher.ID = as.character(Teacher.ID))
> >> # Merge
> >> DF_Classrooms <- TeacherClassroom %>% full_join(Teachers,
> >> by="Teacher.ID") %>% full_join(Schools, by="School.ID")
> >> rm(list=setdiff(ls(), "DF_Classrooms")) # Clean the work space!
> >>
> >> *I want to end up with the same* 'DF_Classrooms *data frame* but
> getting
> >
> >
> >> there in a more efficient way. In particular, when is use n.Classrooms
> >> <-4 the
> >>
> > code run fast, but *if I increase it to something like 20 it is painfully
> >> slow.*
> >>
> >> Thanks!!!
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list