[R] drop rare factors

William Dunlap wdunlap at tibco.com
Thu Jan 19 22:11:33 CET 2012


> That's the only thing I see, *except* that df() and drop() are base functions,
> so you shouldn't use those as variable names.

I don't think that is much of a problem.  The local
versions will be used in the function.

A bigger problem is naming your function 'drop.levels'.
There is a core R function called 'droplevels' that drops
unused levels from factors.  I would hate to have to
remember the difference between the dotted and dotless
versions.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sarah Goslee
> Sent: Thursday, January 19, 2012 1:01 PM
> To: sds at gnu.org; Sarah Goslee; r-help at r-project.org
> Subject: Re: [R] drop rare factors
> 
> Everywhere that you use
> df[column]
> 
> should be
> 
> df[[column]]
> 
> That's the only thing I see, *except* that df() and drop() are base functions,
> so you shouldn't use those as variable names.
> 
> >> Remind the list what you're trying to do. The list gets lots of traffic;
> >> if you delete out all the context nobody will remember what you need.
> >
> > Sorry, I assumed that people can easily access the parent messages.
> 
> I at least don't save the entire R-help archive in my inbox. And if
> you're asking
> folks for help, why not make it easy for them?
> 
> Next time, please follow the posting guide and include context.
> 
> Sarah
> 
> 
> On Thu, Jan 19, 2012 at 3:43 PM, Sam Steingold <sds at gnu.org> wrote:
> > create data:
> >
> > mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000, 2000, 30, 4))), something =
> runif(3034))
> >
> > define function:
> >
> > drop.levels <- function (df, column, threshold) {
> >  size <- nrow(df)
> >  if (threshold < 1) threshold <- threshold * size
> >  tab <- table(df[column])
> >  keep <- names(tab)[tab >  threshold]
> >  drop <- names(tab)[tab <= threshold]
> >  cat("Keep(",column,")",length(keep),"\n"); print(tab[keep])
> >  cat("Drop(",column,")",length(drop),"\n"); print(tab[drop])
> >  str(df)
> >  df <- df[df[column] %in% keep, ]
> >  str(df)
> >  size1 <- nrow(df)
> >  cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n")
> >  df[column] <- factor(df[column], levels=keep)
> >  df
> > }
> >
> > call the function on the data:
> >
> > drop.levels(mydata,"MyFactor",5)
> > Keep( MyFactor ) 3
> >
> >   A    B    C
> > 1000 2000   30
> > Drop( MyFactor ) 1
> > D
> > 4
> > 'data.frame':   3034 obs. of  2 variables:
> >  $ MyFactor : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
> >  $ something: num  0.725 0.741 0.608 0.681 0.993 ...
> > 'data.frame':   0 obs. of  2 variables:
> >  $ MyFactor : Factor w/ 4 levels "A","B","C","D":
> >  $ something: num
> > Rows: 3034 --> 0 (dropped 100 %)
> > Error in `[<-.data.frame`(`*tmp*`, column, value = NA_integer_) :
> >  replacement has 1 rows, data has 0
> >
> > ----- why is there a blank line between "Keep( MyFactor ) 3" and "A    B    C"
> >  but no blank line between "Drop" and "D"?
> >
> > ----- why does "df[df[column] %in% keep, ]" empty out the data frame?
> >
> > thanks!
> >
> >
> >> Remind the list what you're trying to do. The list gets lots of traffic;
> >> if you delete out all the context nobody will remember what you need.
> >
> > Sorry, I assumed that people can easily access the parent messages.
> >
> 
> 
> --
> Sarah Goslee
> http://www.functionaldiversity.org
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list