[R] Subseting a data.frame

Thu Oct 17 23:33:23 CEST 2013

Hi Bill,

#seq_along() worked in the cases you showed.

 ave(seq_along(fac),fac,FUN=length)
#[1] 3 1 3 3
  ave(seq_along(num), num, FUN=length) 
#[1] 3 1 3 3
  ave(seq_along(char), char, FUN=length) 
#[1] 3 1 3 3

I thought, there might be some advantages in speed, but they were similar in speed.
set.seed(195)
 num1 <- sample(1e3,1e7,replace=TRUE)
 system.time(res1 <- ave(integer(length(num1)),num1,FUN=length))
  # user  system elapsed 
  #4.148   0.228   4.382 
system.time(res2 <- ave(seq_along(num1),num1,FUN=length))
#   user  system elapsed 
 # 3.944   0.228   4.181 
system.time(res3 <- ave(num1,num1,FUN=length))
#   user  system elapsed 
 # 3.740   0.264   4.012 
identical(res1,res2)
#[1] TRUE
 identical(res2,res3)
#[1] TRUE

A.K. 

On Thursday, October 17, 2013 4:34 PM, William Dunlap <wdunlap at tibco.com> wrote:
  May I ask why:
    count_by_class <- with(dat, ave(numeric(length(basel_asset_class)), basel_asset_class, FUN=length))
  should not be more simply done as:
    count_by_class <- with(dat, ave(basel_asset_class, basel_asset_class, FUN=length))

The way I did it would work if basel_asset_class were non-numeric.
In ave(x, group, FUN=FUN), FUN's return value should be the same type as x (or
you can get some odd type conversions).  E.g.,

   > num <- c(2,3,2,2) ;  char <- c("Two","Three","Two","Two")
   > ave(num, num, FUN=length) # good
   [1] 3 1 3 3
   > ave(char, char, FUN=length) # bad
   [1] "3" "1" "3" "3"
   > fac <- factor(char, levels=c("One","Two","Three"))
   > ave(fac, fac, FUN=length)
   [1] <NA> <NA> <NA> <NA>
   Levels: One Two Three
   Warning messages:
   1: In `[<-.factor`(`*tmp*`, i, value = 0L) :
     invalid factor level, NA generated
   2: In `[<-.factor`(`*tmp*`, i, value = 3L) :
     invalid factor level, NA generated
   3: In `[<-.factor`(`*tmp*`, i, value = 1L) :
     invalid factor level, NA generated
but x=integer(length(group)) works in all cases:
   > ave(integer(length(fac)), fac, FUN=length)
   [1] 3 1 3 3
   > ave(integer(length(char)), char, FUN=length)
      [1] 3 1 3 3

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

From: Bert Gunter [mailto:gunter.berton at gene.com]
Sent: Thursday, October 17, 2013 1:06 PM
To: William Dunlap
Cc: Katherine Gobin; r-help at r-project.org
Subject: Re: [R] Subseting a data.frame

May I ask why:

count_by_class <- with(dat, ave(numeric(length(basel_
asset_class)), basel_asset_class, FUN=length))
should not be more simply done as:

count_by_class <- with(dat, ave(basel_asset_class, basel_asset_class, FUN=length))

?
-- Bert

On Thu, Oct 17, 2013 at 12:36 PM, William Dunlap <wdunlap at tibco.com<mailto:wdunlap at tibco.com>> wrote:
> What I need is to select only those records for which there are more than two default
> frequencies (defa_frequency),

Here is one way.  There are many others:
   > dat <- data.frame( # slightly less trivial example
        basel_asset_class=c(4,8,8,8,74,3,74),
        defa_frequency=(1:7)/8)
   > count_by_class <- with(dat, ave(numeric(length(basel_asset_class)), basel_asset_class, FUN=length))
   > cbind(dat, count_by_class) # see what we just computed
     basel_asset_class defa_frequency count_by_class
   1                 4          0.125              1
   2                 8          0.250              3
   3                 8          0.375              3
   4                 8          0.500              3
   5                74          0.625              2
   6                 3          0.750              1
   7                74          0.875              2
   > mydat[count_by_class>1, ] # I think this is what you are asking for
     basel_asset_class defa_frequency
   2                 8          0.250
   3                 8          0.375
   4                 8          0.500
   5                74          0.625
   7                74          0.875

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com<http://tibco.com>

> -----Original Message-----
> From: r-help-bounces at r-project.org<mailto:r-help-bounces at r-project.org> [mailto:r-help-bounces at r-project.org<mailto:r-help-bounces at r-project.org>] On Behalf
> Of Katherine Gobin
> Sent: Thursday, October 17, 2013 11:05 AM
> To: Bert Gunter
> Cc: r-help at r-project.org<mailto:r-help at r-project.org>
> Subject: Re: [R] Subseting a data.frame
>
> Correction. (2nd para first three lines)
>
> Pl read following line
>
> What I need is to select only those records for which there are more than two default
> frequencies (defa_frequency), Thus, there is only one default frequency = 0.150 w.r.t
> basel_asset_class = 4 whereas there are default frequencies w.r.t. basel aseet class 4,
>
>
> as
>
> What I need is to select only those records for which there are more than two default
> frequencies (defa_frequency), Thus, there is only one default frequency = 0.150 w.r.t
> basel_asset_class = 4 whereas there are THREE default frequencies w.r.t. basel aseet
> class 8,
>
>
>
> I alpologize for the incovenience.
>
> Regards
>
> KAtherine
>
>
>
>
>
>
>
>
> On , Katherine Gobin <katherine_gobin at yahoo.com<mailto:katherine_gobin at yahoo.com>> wrote:
>
>  I am sorry perhaps  was not able to put the question properly. I am not looking for the
> subset of the data.frame where the basel_asset_class is > 2. I do agree that would have
> been a basic requirement. Let me try to put the question again.
>
> I have a data frame as
>
> mydat = data.frame(basel_asset_class = c(4, 8, 8 ,8), defa_frequency = c(0.15, 0.07, 0.03,
> 0.001))
>
> # Please note I have changed the basel_asset_class to 4 from 2, to avoid confusion.
>
> > mydat
>   basel_asset_class defa_frequency
> 1                 4          0.150
> 2                 8          0.070
> 3                 8          0.030
> 4                 8          0.001
>
>
>
> This is just an representative example. In reality, I may have no of basel asset classes. 4, 8
> etc are the IDs can be anything thus I cant hard code it as subset(mydat,
> mydat$basel_asset_class > 2).
>
>
> What I need is to select only those records for which there are more than two default
> frequencies (defa_frequency), Thus, there is only one default frequency = 0.150 w.r.t
> basel_asset_class = 4 whereas there are default frequencies w.r.t. basel aseet class 4,
> similarly there could be another basel asset class having say 5 default frequncies. Thus, I
> need to take subset of the data.frame s.t. the no of corresponding defa_frequencies is
> greater than 2.
>
> The idea is we try to fit exponential curve Y = A exp( BX ) for each of the basel asset
> classes and to estimate values of A and B, mathematically one needs to have at least two
> values of X.
>
> I hope I may be able to express my requirement. Its not that I need the subset of mydat
> s.t. basel asset class is > 2 (now 4 in revised example), but sbuset s.t. no of default
> frequencies is greater than or equal to 2. This 2 is not same as basel asset class 2.
>
> Kindly guide
>
> With warm regards
>
> Katherine Gobin
>
>
>
>
> On Thursday, 17 October 2013 9:33 PM, Bert Gunter <gunter.berton at gene.com<mailto:gunter.berton at gene.com>> wrote:
>
> "Kindly guide" ...
>
> This is a very basic question, so the kindest guide I can give is to read an Introduction to R
> (ships with R) or a R web tutorial of your choice so that you can learn how R works
> instead of posting to this list.
>
> Cheers,
> Bert
>
>
>
>
> On Wed, Oct 16, 2013 at 11:55 PM, Katherine Gobin <katherine_gobin at yahoo.com<mailto:katherine_gobin at yahoo.com>>
> wrote:
>
> Dear Forum,
> >
> >I have a data frame as
> >
> >mydat = data.frame(basel_asset_class = c(2, 8, 8 ,8), defa_frequency = c(0.15, 0.07,
> 0.03, 0.001))
> >
> >> mydat
> >  basel_asset_class defa_frequency
> >1                 2          0.150
> >2                 8          0.070
> >3                 8          0.030
> >4                 8          0.001
> >
> >
> >I need to get the subset of this data.frame where no of records for the given
> basel_asset_class is > 2, i.e. I need to obtain subset of above data.frame as (since there
> is only 1 record, against basel_asset_class = 2, I want to filter it)
> >
> >> mydat_a
> >  basel_asset_class defa_frequency
> >1                 8          0.070
> >2                 8          0.030
> >3                 8          0.001
> >
> >Kindly guide
> >
> >Katherine
> >        [[alternative HTML version deleted]]
> >
> >
> >______________________________________________
> >R-help at r-project.org<mailto:R-help at r-project.org> mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
> >
> >
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> (650) 467-7374<tel:%28650%29%20467-7374>
>       [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list

https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--

Bert Gunter
Genentech Nonclinical Biostatistics

(650) 467-7374

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.