[R] drop rare factors

Sarah Goslee sarah.goslee at gmail.com
Thu Jan 19 22:00:34 CET 2012


Everywhere that you use
df[column]

should be

df[[column]]

That's the only thing I see, *except* that df() and drop() are base functions,
so you shouldn't use those as variable names.

>> Remind the list what you're trying to do. The list gets lots of traffic;
>> if you delete out all the context nobody will remember what you need.
>
> Sorry, I assumed that people can easily access the parent messages.

I at least don't save the entire R-help archive in my inbox. And if
you're asking
folks for help, why not make it easy for them?

Next time, please follow the posting guide and include context.

Sarah


On Thu, Jan 19, 2012 at 3:43 PM, Sam Steingold <sds at gnu.org> wrote:
> create data:
>
> mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000, 2000, 30, 4))), something = runif(3034))
>
> define function:
>
> drop.levels <- function (df, column, threshold) {
>  size <- nrow(df)
>  if (threshold < 1) threshold <- threshold * size
>  tab <- table(df[column])
>  keep <- names(tab)[tab >  threshold]
>  drop <- names(tab)[tab <= threshold]
>  cat("Keep(",column,")",length(keep),"\n"); print(tab[keep])
>  cat("Drop(",column,")",length(drop),"\n"); print(tab[drop])
>  str(df)
>  df <- df[df[column] %in% keep, ]
>  str(df)
>  size1 <- nrow(df)
>  cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n")
>  df[column] <- factor(df[column], levels=keep)
>  df
> }
>
> call the function on the data:
>
> drop.levels(mydata,"MyFactor",5)
> Keep( MyFactor ) 3
>
>   A    B    C
> 1000 2000   30
> Drop( MyFactor ) 1
> D
> 4
> 'data.frame':   3034 obs. of  2 variables:
>  $ MyFactor : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
>  $ something: num  0.725 0.741 0.608 0.681 0.993 ...
> 'data.frame':   0 obs. of  2 variables:
>  $ MyFactor : Factor w/ 4 levels "A","B","C","D":
>  $ something: num
> Rows: 3034 --> 0 (dropped 100 %)
> Error in `[<-.data.frame`(`*tmp*`, column, value = NA_integer_) :
>  replacement has 1 rows, data has 0
>
> ----- why is there a blank line between "Keep( MyFactor ) 3" and "A    B    C"
>  but no blank line between "Drop" and "D"?
>
> ----- why does "df[df[column] %in% keep, ]" empty out the data frame?
>
> thanks!
>
>
>> Remind the list what you're trying to do. The list gets lots of traffic;
>> if you delete out all the context nobody will remember what you need.
>
> Sorry, I assumed that people can easily access the parent messages.
>


-- 
Sarah Goslee
http://www.functionaldiversity.org



More information about the R-help mailing list