[R] how to get rid of 2 for-loops and optimize runtime
William Dunlap
wdunlap at tibco.com
Mon Oct 19 18:07:32 CEST 2009
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Ian Willems
> Sent: Monday, October 19, 2009 6:50 AM
> To: 'r-help at r-project.org'
> Subject: [R] how to get rid of 2 for-loops and optimize runtime
>
> Short: get rid of the loops I use and optimize runtime
>
> Dear all,
>
> I want to calculate for each row the amount of the month ago.
> I use a matrix with 2100 rows and 22 colums (which is still a
> very small matrix. nrows of other matrixes can easily be more
> then 100000)
>
> Table before
> Year month quarter yearmonth Service ... Amount
> 2009 9 Q3 092009 A
> ... 120
> 2009 9 Q3 092009 B
> ... 80
> 2009 8 Q3 082009 A
> ... 40
> 2009 7 Q3 072009 A
> ... 50
>
> The result I want
> Year month quarter yearmonth Service ... Amount amound_lastmonth
> 2009 9 Q3 092009 A ... 120 40
> 2009 9 Q3 092009 B ... 80 ...
> 2009 8 Q3 082009 A ... 40 50
> 2009 7 Q3 072009 A ... 50 ...
>
> Table is not exactly the same but gives a good idea what I
> have and what I want
>
> The code I have written (see below) does what I want but it
> is very very slow. It takes 129s for 400 rows. And the time
> gets four times higher each time I double the amount of rows.
> I'm new in programming in R, but I found that you can use
> Rprof and summaryRprof to analyse your code (output see below)
> But I don't really understand the output
> I guess I need code that requires linear time and need to get
> rid of the 2 for loops.
> can someone help me or tell me what else I can do to optimize
> my runtime
>
> I use R 2.9.2
> windows Xp service pack3
>
> Thank you in advance
>
> Best regards,
>
> Willems Ian
>
>
> *****************************
> dataset[,5]= month
> dataset[,3]= year
> dataset[,22]= amount
> dataset[,14]= servicetype
>
> [CODE]
> #for each row of the matrix check if each row has..
> > for (j in 1:Number_rows) {
> + sum<-0
> + for(i in 1:Number_rows){
> + if (dataset[j,14]== dataset[i,14]) #..the same service type
> + {if (dataset[j,18]== dataset[i,18]) # .. the same department
> + {if (dataset[j,5]== "1") # if month=1, month ago is
> 12 and year is -1
> + {if ("12"== dataset[i,5])
> + {if ((dataset[j,3]-1)== dataset[i,3])
> +
> + { sum<-sum + dataset[i,22]}
> + }}
> + else {
> + if ((dataset[j,5]-1)== dataset[i,5]) " if month != 1,
> month ago is month -1
> + { if (dataset[j,3]== dataset[i,3])
> + {sum<-sum + dataset[i,22]}
> + }}}}}}
match() is often useful for quickly finding the locations of
many items in a vector. It has no special methods for data.frames
so you must combine the columns of interest into a character
vector of keys and use match on the key vectors. E.g.
# your test data in a format that mail readers
# can copy and paste into R:
d <- read.table(header=TRUE, textConnection("
Year month quarter yearmonth Service Amount
2009 9 Q3 092009 A 120
2009 9 Q3 092009 B 80
2009 8 Q3 082009 A 40
2009 7 Q3 072009 A 50
"))
# The key functions
dKey <- function(d) {
with(d, paste(d$Year, d$month, d$Service, sep=";"))
}
keyThisMonth <- function(d)dKey(d)
keyPrevMonth <- function(d) {
stopifnot(!is.null(d$Year), !is.null(d$month), !is.null(d$Service))
isJan <- d$month==1
d$Year[isJan] <- d$Year[isJan] - 1
d$month[isJan] <- 12
d$month[!isJan] <- d$month[!isJan] - 1
dKey(d)
}
# Make the new column:
d$AmountPrevMonth <- d$Amount[ match(keyPrevMonth(d), keyThisMonth(d)) ]
# The result
print(d)
Year month quarter yearmonth Service Amount AmountPrevMonth
1 2009 9 Q3 92009 A 120 40
2 2009 9 Q3 92009 B 80 NA
3 2009 8 Q3 82009 A 40 50
4 2009 7 Q3 72009 A 50 NA
This assumes there is only one match per row. Is this the
result you are looking for?
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
>
> [\Code]
>
> > summaryRprof()
> $by.self
> self.time self.pct total.time total.pct
> [.data.frame 33.92 26.2 80.90 62.5
> NextMethod 12.68 9.8 12.68 9.8
> [.factor 8.60 6.6 18.36 14.2
> Ops.factor 8.10 6.3 40.08 31.0
> sort.int 6.82 5.3 13.70 10.6
> [ 6.70 5.2 85.44 66.0
> names 6.54 5.1 6.54 5.1
> length 5.66 4.4 5.66 4.4
> == 5.04 3.9 44.92 34.7
> levels 4.80 3.7 5.56 4.3
> is.na 4.24 3.3 4.24 3.3
> dim 3.66 2.8 3.66 2.8
> switch 3.60 2.8 3.80 2.9
> vector 2.68 2.1 8.02 6.2
> inherits 1.90 1.5 1.90 1.5
> any 1.68 1.3 1.68 1.3
> noNA.levels 1.46 1.1 7.84 6.1
> .Call 1.40 1.1 1.40 1.1
> ! 1.26 1.0 1.26 1.0
> attr<- 1.06 0.8 1.06 0.8
> .subset 1.00 0.8 1.00 0.8
> class<- 0.82 0.6 0.82 0.6
> != 0.80 0.6 0.80 0.6
> levels.default 0.68 0.5 0.76 0.6
> all 0.62 0.5 0.62 0.5
> < 0.54 0.4 0.54 0.4
> - 0.48 0.4 0.48 0.4
> is.factor 0.44 0.3 2.34 1.8
> .subset2 0.38 0.3 0.38 0.3
> attr 0.36 0.3 0.36 0.3
> is.character 0.28 0.2 0.28 0.2
> is.null 0.28 0.2 0.28 0.2
> | 0.26 0.2 0.26 0.2
> oldClass<- 0.20 0.2 0.20 0.2
> is.atomic 0.16 0.1 0.16 0.1
> nzchar 0.10 0.1 0.10 0.1
> is.numeric 0.06 0.0 0.06 0.0
> oldClass 0.06 0.0 0.06 0.0
> ( 0.04 0.0 0.04 0.0
> [.data 0.02 0.0 0.02 0.0
>
> $by.total
> total.time total.pct self.time self.pct
> [ 85.44 66.0 6.70 5.2
> [.data.frame 80.90 62.5 33.92 26.2
> == 44.92 34.7 5.04 3.9
> Ops.factor 40.08 31.0 8.10 6.3
> [.factor 18.36 14.2 8.60 6.6
> sort.int 13.70 10.6 6.82 5.3
> NextMethod 12.68 9.8 12.68 9.8
> vector 8.02 6.2 2.68 2.1
> noNA.levels 7.84 6.1 1.46 1.1
> names 6.54 5.1 6.54 5.1
> length 5.66 4.4 5.66 4.4
> levels 5.56 4.3 4.80 3.7
> is.na 4.24 3.3 4.24 3.3
> switch 3.80 2.9 3.60 2.8
> dim 3.66 2.8 3.66 2.8
> is.factor 2.34 1.8 0.44 0.3
> inherits 1.90 1.5 1.90 1.5
> any 1.68 1.3 1.68 1.3
> .Call 1.40 1.1 1.40 1.1
> ! 1.26 1.0 1.26 1.0
> attr<- 1.06 0.8 1.06 0.8
> .subset 1.00 0.8 1.00 0.8
> class<- 0.82 0.6 0.82 0.6
> != 0.80 0.6 0.80 0.6
> levels.default 0.76 0.6 0.68 0.5
> all 0.62 0.5 0.62 0.5
> < 0.54 0.4 0.54 0.4
> - 0.48 0.4 0.48 0.4
> .subset2 0.38 0.3 0.38 0.3
> attr 0.36 0.3 0.36 0.3
> is.character 0.28 0.2 0.28 0.2
> is.null 0.28 0.2 0.28 0.2
> | 0.26 0.2 0.26 0.2
> oldClass<- 0.20 0.2 0.20 0.2
> is.atomic 0.16 0.1 0.16 0.1
> nzchar 0.10 0.1 0.10 0.1
> is.numeric 0.06 0.0 0.06 0.0
> oldClass 0.06 0.0 0.06 0.0
> ( 0.04 0.0 0.04 0.0
> [.data 0.02 0.0 0.02 0.0
>
> $sampling.time
> [1] 129.38
>
> Warning message:
> In readLines(filename, n = chunksize) :
> incomplete final line found on 'Rprof.out'
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list