[R] Band-wise Conditional Sum - Actual problem

Mon Aug 30 16:43:03 CEST 2010

On Aug 30, 2010, at 4:05 AM, Vincy Pyne wrote:

> Dear R helpers,
>
> Thanks a lot for your earlier guidance esp. Mr Davind Winsemius Sir.  
> However, there seems to be mis-communication from my end  
> corresponding to my requirement. As I had mentioned in my earlier  
> mail, I am dealing with a very large database of borrowers and I had  
> given a part of it in my earlier mail as given below. For a given  
> rating say "A", I needed to have the bad-wise sums of ead's (where  
> bands are constructed using the ead size itself.) and not the number  
> of borrowers falling in a particular band.
>
> I am reproducing the data and solution as provided by Winsemius Sir  
> (which generates the number of band-wise borrowers for a given rating.
>
> rating <- c("A", "AAA", "A", "BBB","AA","A","BB", "BBB", "AA", "AA",  
> "AA", "A", "A", "AA","BB","BBB","AA", "A", "AAA","BBB","BBB", "BB",  
> "A", "BB", "A", "AA", "B","A", "AA", "BBB", "A", "BBB")
>
> ead <- c(169229.93,100, 5877794.25, 9530148.63, 75040962.06, 21000,  
> 1028360,  6000000, 17715000,  14430325.24, 1180946.57, 150000,  
> 167490, 81255.16, 54812.5, 3000, 1275702.94, 9100, 1763142.3,  
> 3283048.61, 1200000, 11800, 3000,  96894.02,  453671.72,  7590,  
> 106065.24, 940711.67,  2443000, 9500000, 39000, 1501939.67)
>
> df$ead.cat <- cut(df$ead, breaks=c(0, 100000, 500000, 1000000,  
> 2000000, 5000000 , 10000000, 100000000) )
>
> df
>
> df_sorted <- df[order(df$rating),]      # the output is as given  
> below.
>
> > df_sorted
>    rating         ead                     ead.cat
> 1       A          169229.93        (1e+05,5e+05]
> 3       A         5877794.25        (5e+06,1e+07]
> 6       A            21000.00               (0,1e+05]
> 12      A          150000.00       (1e+05,5e+05]
> 13      A          167490.00       (1e+05,5e+05]
> 18      A             9100.00               (0,1e+05]
> 23      A             3000.00               (0,1e+05]
> 25      A          453671.72       (1e+05,5e+05]
> 28      A          940711.67       (5e+05,1e+06]
> 31      A            39000.00              (0,1e+05]
> 5      AA       75040962.06      (1e+07,1e+08]
> 9      AA       17715000.00      (1e+07,1e+08]
> 10     AA      14430325.24      (1e+07,1e+08]
> 11     AA        1180946.57      (1e+06,2e+06]
> 14     AA            81255.16             (0,1e+05]
> 17     AA         1275702.94     (1e+06,2e+06]
> 26     AA              7590.00            (0,1e+05]
> 29     AA         2443000.00     (2e+06,5e+06]
> 2     AAA               100.00             (0,1e+05]
> 19    AAA       1763142.30      (1e+06,2e+06]
> 27      B           106065.24      (1e+05,5e+05]
> 7      BB         1028360.00      (1e+06,2e+06]
> 15     BB            54812.50             (0,1e+05]
> 22     BB            11800.00             (0,1e+05]
> 24     BB            96894.02             (0,1e+05]
> 4     BBB        9530148.63      (5e+06,1e+07]
> 8     BBB        6000000.00      (5e+06,1e+07]
> 16    BBB            3000.00              (0,1e+05]
> 20    BBB       3283048.61       (2e+06,5e+06]
> 21    BBB       1200000.00       (1e+06,2e+06]
> 30    BBB       9500000.00       (5e+06,1e+07]
> 32    BBB       1501939.67       (1e+06,2e+06]
>
>
> ## The following command fetches rating-wise and ead size no of  
> borrowers. Thus, for rating A, there are 4 borrowers in the ead  
> range (0, 1e+05], 4 borrowers in the range (1e+05 to 5e+05] and so  
> on......
>
> > with(df, tapply(ead.cat, rating, table))
> $A
>
>     (0,1e+05] (1e+05,5e+05] (5e+05,1e+06] (1e+06,2e+06] (2e+06,5e 
> +06] (5e+06,1e+07] (1e+07,1e+08]
>             4             4             1             0              
> 0             1             0
>
> $AA
>
>     (0,1e+05] (1e+05,5e+05] (5e+05,1e+06] (1e+06,2e+06] (2e+06,5e 
> +06] (5e+06,1e+07] (1e+07,1e+08]
>             2             0             0             2              
> 1             0             3
>
> $AAA
>
>     (0,1e+05] (1e+05,5e+05] (5e+05,1e+06] (1e+06,2e+06] (2e+06,5e 
> +06] (5e+06,1e+07] (1e+07,1e+08]
>             1             0             0             1              
> 0             0             0
>
> $B
>
>     (0,1e+05] (1e+05,5e+05] (5e+05,1e+06] (1e+06,2e+06] (2e+06,5e 
> +06] (5e+06,1e+07] (1e+07,1e+08]
>             0             1             0             0              
> 0             0             0
>
> $BB
>
>     (0,1e+05] (1e+05,5e+05] (5e+05,1e+06] (1e+06,2e+06] (2e+06,5e 
> +06] (5e+06,1e+07] (1e+07,1e+08]
>             3             0             0             1              
> 0             0             0
>
> $BBB
>
>     (0,1e+05] (1e+05,5e+05] (5e+05,1e+06] (1e+06,2e+06] (2e+06,5e 
> +06] (5e+06,1e+07] (1e+07,1e+08]
>             1             0             0             2              
> 1             3             0
>
>
> #### My ACTUAL REQUIREMENT
>
> Actually for a given rating, I don't want the number of borrowers  
> falling in each of the ead_range. What I want is sum of eads falling  
> in each range. Thus, say for rating "A", I need following.
>
>
>          rating        ead.cat                  ead_total
> 1          A           (0,1e+05]                 72100.00    #  
> (21000+9100+3000+39000)
> 2          A           (1e+05, 5e+05]       940391.65
>
> #(169229.93+150000.00+167490.00+453671.72)

So you just wanted simple sums within rating and ead.cat:

with(df_sorted, tapply(ead, list(rating,ead.cat), sum, na.rm=TRUE))

     (0,1e+05] (1e+05,5e+05] (5e+05,1e+06] (1e+06,2e+06] (2e+06,5e+06]  
(5e+06,1e+07]
A    72100.00      940391.6      940711.7            NA             
NA       5877794
AA   88845.16            NA            NA       2456650        
2443000            NA
AAA    100.00            NA            NA       1763142             
NA            NA
B          NA      106065.2            NA            NA             
NA            NA
BB  163506.52            NA            NA       1028360             
NA            NA
BBB   3000.00            NA            NA       2701940        
3283049      25030149
     (1e+07,1e+08]
A              NA
AA      107186287
AAA            NA
B              NA
BB             NA
BBB            NA

-- 
David.

> and so on.
>
> I am extremely sorry for any mis-communication in my earlier mail. I  
> could test the reply sent to me earlier by Winsemius Sir only today  
> as I was traveling over weekends. Also, I have tried to go through  
> earlier emails dealing with such conditional sums. Unfortunately, I  
> couldn't understand as I have recently started my venture with R.
>
>
> Thanking you in advance and sincerely apologize for any mis- 
> communication if it had occurred in my earlier mail.
>
> Regards
>
> Vincy
>
>
> --- On Fri, 8/27/10, David Winsemius <dwinsemius at comcast.net> wrote:
>
> From: David Winsemius <dwinsemius at comcast.net>
> Subject: Re: [R] Band-wise Sum
> To: "Vincy Pyne" <vincy_pyne at yahoo.ca>
> Cc: r-help at r-project.org
> Received: Friday, August 27, 2010, 2:36 PM
>
>
> On Aug 27, 2010, at 9:49 AM, Vincy Pyne wrote:
>
> > Hi
> >
> > I have a large credit portfolio (exceeding 50000 borrowers). For  
> particular process I need to add up the exposures based on the  
> bands. I am giving a small test data below.
>
> I would think that cut() would be the accepted method for defining a  
> factor variable based on specified cutpoints. If you then wanted to  
> see what the cumsum() was across the range of possible levels, that  
> to would be a fairly simple task.
>
> df$ead.cat <- cut(df$ead, breaks=c(0, 100000, 500000, 1000000,  
> 2000000, 5000000 , 10000000, 100000000) )
> df
> with(df, tapply(ead.cat, rating, length))
> #  A  AA AAA   B  BB BBB
> # 10   8   2   1   4   7
> with(df, tapply(ead.cat, rating, table))
> # returns a list of table objects by bond rating
>
> lapply( with(df, tapply(ead.cat, rating, table)) , cumsum)
> #returns the cumsum of those tables
>
> # sapply gives a more compact output of that result:
> sapply( with(df, tapply(ead.cat, rating, table)) , cumsum)
>                A AA AAA B BB BBB
> (0,1e+05]      4  2   1 0  3   1
> (1e+05,5e+05]  8  2   1 1  3   1
> (5e+05,1e+06]  9  2   1 1  3   1
> (1e+06,2e+06]  9  4   2 1  4   3
> (2e+06,5e+06]  9  5   2 1  4   4
> (5e+06,1e+07] 10  5   2 1  4   7
> (1e+07,1e+08] 10  8   2 1  4   7
>
> Loops, you say we need loops? We don't need no stinkin' loops.
>
> --David.
>
> >
> > rating <- c("A", "AAA", "A", "BBB","AA","A","BB", "BBB", "AA",  
> "AA", "AA", "A", "A", "AA","BB","BBB","AA", "A", "AAA","BBB","BBB",  
> "BB", "A", "BB", "A", "AA", "B","A", "AA", "BBB", "A", "BBB")
> >
> > ead <- c(169229.93,100, 5877794.25, 9530148.63, 75040962.06,  
> 21000, 1028360,  6000000, 17715000,  14430325.24, 1180946.57,  
> 150000, 167490, 81255.16, 54812.5, 3000, 1275702.94, 9100,  
> 1763142.3, 3283048.61, 1200000, 11800, 3000,  96894.02,  453671.72,   
> 7590, 106065.24, 940711.67,  2443000, 9500000, 39000, 1501939.67)
> >
> > ## First I have sorted the data rating-wise as
> >
> > df <- data.frame(rating, ead)
> >
> > df_sorted <-
> > df[order(df$rating),]
> >
> > df_sorted_AAA <- subset(df_sorted, rating=="AAA")
> > df_sorted_AA <- subset(df_sorted, rating=="AA")
> > df_sorted_A <- subset(df_sorted, rating=="A")
> > df_sorted_BBB <- subset(df_sorted, rating=="BBB")
> > df_sorted_BB <- subset(df_sorted, rating=="BB")
> > df_sorted_B <- subset(df_sorted, rating=="B")
> > df_sorted_CCC <- subset(df_sorted, rating=="CCC")
> >
> > ## we begin with BBB rating. The R output for df_sorted_BBB is as  
> follows
> >
> >> df_sorted_BBB
> >       rating      ead
> > 4     BBB      9530149
> > 8     BBB      6000000
> > 16    BBB     3000
> > 20    BBB     3283049
> > 21    BBB     1200000
> > 30    BBB     9500000
> > 32    BBB     1501940
> >
> > My problem is I need to totals of eads falling in the respective  
> bands
> >
> > I
> > am defining bands in millions as
> >
> > seq_BBB <- seq(1000000, max(df_sorted_BBB$ead), by = 1000000)
> >
> > # The output is
> > [1] 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06
> >
> > So for the sub data pertaining to Rating "BBB", I want  
> corresponding ead totals i.e. I want ead totals where ead < 1e+06,  
> then I want ead totals where 1+e06 < ead < 2e+06, 2e+06 < ead < 3e 
> +06 ...and so on.
> >
> > I have tried the following code
> >
> > s_BBB <- NULL
> >
> > for (i in 1:length(s_BBB))
> > {
> > s_BBB[i] = sum(subset(df_sorted_BBB$ead, df_sorted_BBB$ead <  
> s_BBB[i]))
> > }
> >
> > I was trying to find totals ofads < 1e+06, ead < 2e+06, ead<3e 
> +06and so on.
> >
> > but the result is
> >
> >> s_BBB
> > [1] 0
> >
> >
> > I apologize if I am not able to express my problem properly. My  
> only objective is first to sort the whole portfolio rating-wise and  
> then within each of these rating-wise sorted data, I wish to find  
> out total of eads based
> > on various bands starting <1000000,  1000000 - 200000, 2000000 -  
> 3000000, 3000000 - 4000000 and so on. Since the database contains  
> more than 50000 records, various ead amounts ranging from few 000's  
> to billion are available.
> >
> > Please guide
> >
> > Thanking  you all in advance

>

David Winsemius, MD
West Hartford, CT