[R] drop rare factors
William Dunlap
wdunlap at tibco.com
Thu Jan 19 22:11:33 CET 2012
> That's the only thing I see, *except* that df() and drop() are base functions,
> so you shouldn't use those as variable names.
I don't think that is much of a problem. The local
versions will be used in the function.
A bigger problem is naming your function 'drop.levels'.
There is a core R function called 'droplevels' that drops
unused levels from factors. I would hate to have to
remember the difference between the dotted and dotless
versions.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sarah Goslee
> Sent: Thursday, January 19, 2012 1:01 PM
> To: sds at gnu.org; Sarah Goslee; r-help at r-project.org
> Subject: Re: [R] drop rare factors
>
> Everywhere that you use
> df[column]
>
> should be
>
> df[[column]]
>
> That's the only thing I see, *except* that df() and drop() are base functions,
> so you shouldn't use those as variable names.
>
> >> Remind the list what you're trying to do. The list gets lots of traffic;
> >> if you delete out all the context nobody will remember what you need.
> >
> > Sorry, I assumed that people can easily access the parent messages.
>
> I at least don't save the entire R-help archive in my inbox. And if
> you're asking
> folks for help, why not make it easy for them?
>
> Next time, please follow the posting guide and include context.
>
> Sarah
>
>
> On Thu, Jan 19, 2012 at 3:43 PM, Sam Steingold <sds at gnu.org> wrote:
> > create data:
> >
> > mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000, 2000, 30, 4))), something =
> runif(3034))
> >
> > define function:
> >
> > drop.levels <- function (df, column, threshold) {
> > size <- nrow(df)
> > if (threshold < 1) threshold <- threshold * size
> > tab <- table(df[column])
> > keep <- names(tab)[tab > threshold]
> > drop <- names(tab)[tab <= threshold]
> > cat("Keep(",column,")",length(keep),"\n"); print(tab[keep])
> > cat("Drop(",column,")",length(drop),"\n"); print(tab[drop])
> > str(df)
> > df <- df[df[column] %in% keep, ]
> > str(df)
> > size1 <- nrow(df)
> > cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n")
> > df[column] <- factor(df[column], levels=keep)
> > df
> > }
> >
> > call the function on the data:
> >
> > drop.levels(mydata,"MyFactor",5)
> > Keep( MyFactor ) 3
> >
> > A B C
> > 1000 2000 30
> > Drop( MyFactor ) 1
> > D
> > 4
> > 'data.frame': 3034 obs. of 2 variables:
> > $ MyFactor : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
> > $ something: num 0.725 0.741 0.608 0.681 0.993 ...
> > 'data.frame': 0 obs. of 2 variables:
> > $ MyFactor : Factor w/ 4 levels "A","B","C","D":
> > $ something: num
> > Rows: 3034 --> 0 (dropped 100 %)
> > Error in `[<-.data.frame`(`*tmp*`, column, value = NA_integer_) :
> > replacement has 1 rows, data has 0
> >
> > ----- why is there a blank line between "Keep( MyFactor ) 3" and "A B C"
> > but no blank line between "Drop" and "D"?
> >
> > ----- why does "df[df[column] %in% keep, ]" empty out the data frame?
> >
> > thanks!
> >
> >
> >> Remind the list what you're trying to do. The list gets lots of traffic;
> >> if you delete out all the context nobody will remember what you need.
> >
> > Sorry, I assumed that people can easily access the parent messages.
> >
>
>
> --
> Sarah Goslee
> http://www.functionaldiversity.org
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list