[R] how to get rid of 2 for-loops and optimize runtime

Ian Willems ian.willems at uz.kuleuven.ac.be
Tue Oct 20 15:46:00 CEST 2009


Hi William,

Your programs works perfect and very fast for the table I'm using right now (only one match per row)
If I want to reuse this code other tables, it can match with more than one row.
Is it possible to adapt your code easily, if I have to sum the values of last month from different rows?

Thank u for your help
regards,

Ian


-----Oorspronkelijk bericht-----
Van: William Dunlap [mailto:wdunlap at tibco.com] 
Verzonden: maandag 19 oktober 2009 18:08
Aan: Ian Willems; r-help at r-project.org
Onderwerp: RE: [R] how to get rid of 2 for-loops and optimize runtime


> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Ian Willems
> Sent: Monday, October 19, 2009 6:50 AM
> To: 'r-help at r-project.org'
> Subject: [R] how to get rid of 2 for-loops and optimize runtime
> 
> Short: get rid of the loops I use and optimize runtime
> 
> Dear all,
> 
> I want to calculate for each row the amount of the month ago. 
> I use a matrix with 2100 rows and 22 colums (which is still a 
> very small matrix. nrows of other matrixes can easily be more 
> then 100000)
> 
> Table before
> Year  month quarter yearmonth Service ...  Amount
> 2009  9        Q3            092009          A                
> ...    120
> 2009  9        Q3            092009          B                
>  ...     80
> 2009  8        Q3           082009           A                
>   ...     40
> 2009  7        Q3           072009           A                
>    ...      50
> 
> The result I want
> Year month  quarter yearmonth Service ...    Amount   amound_lastmonth
> 2009 9      Q3      092009          A ...    120      40
> 2009 9      Q3      092009          B ...    80       ...
> 2009 8      Q3      082009          A ...    40       50
> 2009 7      Q3      072009          A ...    50       ...
> 
> Table is not exactly the same but gives a good idea what I 
> have and what I want
> 
> The code I have written (see below) does what I want but it 
> is very very slow. It takes 129s for 400 rows. And the time 
> gets four times higher each time I double the amount of rows.
> I'm new in programming in R, but I found that you can use 
> Rprof and summaryRprof to analyse your code (output see below)
> But I don't really understand the output
> I guess I need code that requires linear time and need to get 
> rid of the 2 for loops.
> can someone help me or tell me what else I can do to optimize 
> my runtime
> 
> I use R 2.9.2
> windows Xp service pack3
> 
> Thank you in advance
> 
> Best regards,
> 
> Willems Ian
> 
> 
> *****************************
> dataset[,5]= month
> dataset[,3]= year
> dataset[,22]= amount
> dataset[,14]= servicetype
> 
> [CODE]
> #for each row of the matrix check if each row has..
> > for (j in 1:Number_rows) {
> + sum<-0
> + for(i in 1:Number_rows){
> + if (dataset[j,14]== dataset[i,14]) #..the same service type
> +   {if (dataset[j,18]== dataset[i,18]) # .. the same department
> +        {if (dataset[j,5]== "1")  # if month=1, month ago is 
> 12 and year is -1
> +           {if ("12"== dataset[i,5])
> +            {if ((dataset[j,3]-1)== dataset[i,3])
> +
> +         { sum<-sum + dataset[i,22]}
> +      }}
> +      else {
> +       if ((dataset[j,5]-1)== dataset[i,5]) " if month != 1, 
> month ago is month -1
> +         { if (dataset[j,3]== dataset[i,3])
> +         {sum<-sum + dataset[i,22]}
> +      }}}}}}

match() is often useful for quickly finding the locations of
many items in a vector.  It has no special methods for data.frames
so you must combine the columns of interest into a character
vector of keys and use match on the key vectors.  E.g.

# your test data in a format that mail readers
# can copy and paste into R:
d <- read.table(header=TRUE, textConnection("
   Year  month quarter yearmonth Service  Amount
   2009  9        Q3   092009          A     120
   2009  9        Q3   092009          B      80
   2009  8        Q3   082009          A      40
   2009  7        Q3   072009          A      50
   "))
# The key functions
dKey <- function(d) {
   with(d, paste(d$Year, d$month, d$Service, sep=";"))
}
keyThisMonth <- function(d)dKey(d)
keyPrevMonth <- function(d) {
   stopifnot(!is.null(d$Year), !is.null(d$month), !is.null(d$Service))
   isJan <- d$month==1
   d$Year[isJan] <- d$Year[isJan] - 1
   d$month[isJan] <- 12
   d$month[!isJan] <- d$month[!isJan] - 1
   dKey(d)
}
# Make the new column:
d$AmountPrevMonth <- d$Amount[ match(keyPrevMonth(d), keyThisMonth(d)) ]
# The result
print(d)

  Year month quarter yearmonth Service Amount AmountPrevMonth
1 2009     9      Q3     92009       A    120              40
2 2009     9      Q3     92009       B     80              NA
3 2009     8      Q3     82009       A     40              50
4 2009     7      Q3     72009       A     50              NA

This assumes there is only one match per row.  Is this the
result you are looking for?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> [\Code]
> 
> > summaryRprof()
> $by.self
>                self.time self.pct total.time total.pct
> [.data.frame       33.92  26.2    80.90      62.5
> NextMethod         12.68  9.8     12.68       9.8
> [.factor            8.60  6.6      18.36      14.2
> Ops.factor          8.10  6.3      40.08      31.0
> sort.int            6.82  5.3      13.70      10.6
> [                   6.70  5.2      85.44      66.0
> names               6.54  5.1       6.54       5.1
> length              5.66  4.4       5.66       4.4
> ==                  5.04  3.9      44.92      34.7
> levels              4.80  3.7       5.56       4.3
> is.na               4.24  3.3       4.24       3.3
> dim                 3.66  2.8       3.66       2.8
> switch              3.60  2.8       3.80       2.9
> vector              2.68  2.1       8.02       6.2
> inherits            1.90  1.5       1.90       1.5
> any                 1.68  1.3       1.68       1.3
> noNA.levels         1.46  1.1       7.84       6.1
> .Call               1.40  1.1       1.40       1.1
> !                   1.26  1.0       1.26       1.0
> attr<-              1.06  0.8       1.06       0.8
> .subset             1.00  0.8       1.00       0.8
> class<-             0.82  0.6       0.82       0.6
> !=                  0.80  0.6       0.80       0.6
> levels.default      0.68  0.5       0.76       0.6
> all                 0.62  0.5       0.62       0.5
> <                   0.54  0.4       0.54       0.4
> -                   0.48  0.4       0.48       0.4
> is.factor           0.44  0.3       2.34       1.8
> .subset2            0.38  0.3       0.38       0.3
> attr                0.36  0.3       0.36       0.3
> is.character        0.28  0.2       0.28       0.2
> is.null             0.28  0.2       0.28       0.2
> |                   0.26  0.2       0.26       0.2
> oldClass<-          0.20  0.2       0.20       0.2
> is.atomic           0.16  0.1       0.16       0.1
> nzchar              0.10  0.1       0.10       0.1
> is.numeric          0.06  0.0       0.06       0.0
> oldClass            0.06  0.0       0.06       0.0
> (                   0.04  0.0       0.04       0.0
> [.data              0.02  0.0       0.02       0.0
> 
> $by.total
>                total.time total.pct self.time self.pct
> [                   85.44  66.0      6.70      5.2
> [.data.frame        80.90  62.5     33.92     26.2
> ==                  44.92  34.7      5.04      3.9
> Ops.factor          40.08  31.0      8.10      6.3
> [.factor            18.36  14.2      8.60      6.6
> sort.int            13.70  10.6      6.82      5.3
> NextMethod          12.68  9.8     12.68      9.8
> vector               8.02  6.2      2.68      2.1
> noNA.levels          7.84  6.1      1.46      1.1
> names                6.54  5.1      6.54      5.1
> length               5.66  4.4      5.66      4.4
> levels               5.56  4.3      4.80      3.7
> is.na                4.24  3.3      4.24      3.3
> switch               3.80  2.9      3.60      2.8
> dim                  3.66  2.8      3.66      2.8
> is.factor            2.34  1.8      0.44      0.3
> inherits             1.90  1.5      1.90      1.5
> any                  1.68  1.3      1.68      1.3
> .Call                1.40  1.1      1.40      1.1
> !                    1.26  1.0      1.26      1.0
> attr<-               1.06  0.8      1.06      0.8
> .subset              1.00  0.8      1.00      0.8
> class<-              0.82  0.6      0.82      0.6
> !=                   0.80  0.6      0.80      0.6
> levels.default       0.76  0.6      0.68      0.5
> all                  0.62  0.5      0.62      0.5
> <                    0.54  0.4      0.54      0.4
> -                    0.48  0.4      0.48      0.4
> .subset2             0.38  0.3      0.38      0.3
> attr                 0.36  0.3      0.36      0.3
> is.character         0.28  0.2      0.28      0.2
> is.null              0.28  0.2      0.28      0.2
> |                    0.26  0.2      0.26      0.2
> oldClass<-           0.20  0.2      0.20      0.2
> is.atomic            0.16  0.1      0.16      0.1
> nzchar               0.10  0.1      0.10      0.1
> is.numeric           0.06  0.0      0.06      0.0
> oldClass             0.06  0.0      0.06      0.0
> (                    0.04  0.0      0.04      0.0
> [.data               0.02  0.0      0.02      0.0
> 
> $sampling.time
> [1] 129.38
> 
> Warning message:
> In readLines(filename, n = chunksize) :
>   incomplete final line found on 'Rprof.out'
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 




More information about the R-help mailing list