[R] anyone know why package "RandomForest" na.roughfix is so slow??

Fri Jul 2 02:07:54 CEST 2010

Here's another version that's a bit easier to read:

na.roughfix2 <- function (object, ...) {
  res <- lapply(object, roughfix)
  structure(res, class = "data.frame", row.names = seq_len(nrow(object)))
}

roughfix <- function(x) {
  missing <- is.na(x)
  if (!any(missing)) return(x)

  if (is.numeric(x)) {
    x[missing] <- median.default(x[!missing])
  } else if (is.factor(x)) {
    freq <- table(x)
    x[missing] <- names(freq)[which.max(freq)]
  } else {
    stop("na.roughfix only works for numeric or factor")
  }
  x
}

I'm cheating a bit because as.data.frame is so slow.

Hadley

On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <this.is.mvw at gmail.com> wrote:
> Jim, Andy,
>
>    Thanks for your suggestions!
>
>    I found some time today to futz around with it, and I found a "home
> made" script to fill in NA values to be much quicker.  For those who are
> interested, instead of using:
>
>          dataSet <- na.roughfix(dataSet)
>
>
>
>    I used:
>
>                    origCols <- names(dataSet)
>                    ## Fix numeric values...
>                    dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
> {
>                        if(!is.numeric(x)) { x } else {
>                            ifelse(is.na(x), median(x, na.rm=TRUE), x) } }
> ),
>                                             row.names=row.names(dataSet) )
>                    ## Fix factors...
>                    dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
> {
>                        if(!is.factor(x)) { x } else {
>                            levels(x)[ifelse(!is.na
> (x),x,table(max(table(x)))
>                                                          ) ] } } ),
>                                             row.names=row.names(dataSet) )
>                    names(dataSet) <- origCols
>
>
>
>    In one case study that I ran, the na.roughfix() algo took 296 seconds
> whereas the homemade one above took 16 seconds.
>
>                                      Regards,
>                                            Mike
>
>
>
> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> Some x-ray slides, a music score, Minard's Napoleanic war:
> The most exciting frontier is charting what's already here."
>  -- xkcd
>
> --
> Help protect Wikipedia. Donate now:
> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>
>
> On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
>
>>  You need to isolate the problem further, or give more detail about your
>> data.  This is what I get:
>>
>> R> nr <- 2134
>> R> nc <- 14037
>> R> x <- matrix(runif(nr*nc), nr, nc)
>> R> n.na <- round(nr*nc/10)
>> R> x[sample(nr*nc, n.na)] <- NA
>> R> system.time(x.fixed <- na.roughfix(x))
>>    user  system elapsed
>>    8.44    0.39    8.85
>> R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB
>> ram.
>>
>> Andy
>>
>>  ------------------------------
>> *From:* Mike Williamson [mailto:this.is.mvw at gmail.com]
>> *Sent:* Thursday, July 01, 2010 12:48 PM
>> *To:* Liaw, Andy
>> *Cc:* r-help
>> *Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is
>> so slow??
>>
>> Andy,
>>
>>     You're right, I didn't supply any code, because my call was very simple
>> and it was the call itself at question.  However, here is the associated
>> code I am using:
>>
>>
>>         naFixTime <- system.time( {
>>             if (fltrResponse) {  ## TRUE: there are no NA's in the
>> response... cleared via earlier steps
>>                 message(paste(iAm,": Missing values will now be
>> imputed...\n", sep=""))
>>         try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
>> response)],
>>                                          dataSet[,response]) )
>>             } else {  ## In this case, there is no "response" column in the
>> data set
>>                 message(paste(iAm,": Missing values will now be filled in
>> with median",
>>                               " values or most frequent levels", sep=""))
>>                 try( dataSet <- na.roughfix(dataSet) )
>>             }
>>         } )
>>
>>
>>
>>     As you can see, the "na.roughfix" call is made as simply as possible:
>> I supply the entire dataSet (only parameters, no responses).  I am not doing
>> the prediction here (that is done later, and the prediction itself is not
>> taking very long).
>>     Here are some calculation times that I experienced:
>>
>> # rows       # cols       time to run na.roughfix
>> =======     =======     ====================
>>   2046          2833             ~ 2 minutes
>>   2066          5626             ~ 6 minutes
>>   2134         14037             ~ 30 minutes
>>
>>     These numbers are on a Windows server using the 64-bit version of 'R'.
>>
>>                                           Regards,
>>                                                    Mike
>>
>>
>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
>> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
>> Some x-ray slides, a music score, Minard's Napoleanic war:
>> The most exciting frontier is charting what's already here."
>>  -- xkcd
>>
>> --
>> Help protect Wikipedia. Donate now:
>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>>
>>
>> On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
>>
>>> You have not shown any code on exactly how you use na.roughfix(), so I
>>> can only guess.
>>>
>>> If you are doing something like:
>>>
>>>  randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
>>>
>>> I would not be surprised that it's taking very long on large datasets.
>>> Most likely it's caused by the formula interface, not na.roughfix()
>>> itself.
>>>
>>> If that is your case, try doing the imputation beforehand and run
>>> randomForest() afterward; e.g.,
>>>
>>> myroughfixed <- na.roughfix(mybigdata)
>>> randomForest(myroughfixed[list.of.predictor.columns],
>>> myroughfixed[[myresponse]],...)
>>>
>>> HTH,
>>> Andy
>>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> On Behalf Of Mike Williamson
>>> Sent: Wednesday, June 30, 2010 7:53 PM
>>> To: r-help
>>> Subject: [R] anyone know why package "RandomForest" na.roughfix is so
>>> slow??
>>>
>>> Hi all,
>>>
>>>    I am using the package "random forest" for random forest
>>> predictions.  I
>>> like the package.  However, I have fairly large data sets, and it can
>>> often
>>> take *hours* just to go through the "na.roughfix" call, which simply
>>> goes
>>> through and cleans up any NA values to either the median (numerical
>>> data) or
>>> the most frequent occurrence (factors).
>>>    I am going to start doing some comparisons between na.roughfix() and
>>> some apply() functions which, it seems, are able to do the same job more
>>> quickly.  But I hesitate to duplicate a function that is already in the
>>> package, since I presume the na.roughfix should be as quick as possible
>>> and
>>> it should also be well "tailored" to the requirements of random forest.
>>>
>>>    Has anyone else seen that this is really slow?  (I haven't noticed
>>> rfImpute to be nearly as slow, but I cannot say for sure:  my "predict"
>>> data
>>> sets are MUCH larger than my model data sets, so cleaning the prediction
>>> data set simply takes much longer.)
>>>    If so, any ideas how to speed this up?
>>>
>>>                              Thanks!
>>>                                   Mike
>>>
>>>
>>>
>>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
>>> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
>>> Some x-ray slides, a music score, Minard's Napoleanic war:
>>> The most exciting frontier is charting what's already here."
>>>  -- xkcd
>>>
>>> --
>>> Help protect Wikipedia. Donate now:
>>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> Notice:  This e-mail message, together with any attachments, contains
>>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>>> New Jersey, USA 08889), and/or its affiliates Direct contact information
>>> for affiliates is available at
>>> http://www.merck.com/contact/contacts.html) that may be confidential,
>>> proprietary copyrighted and/or legally privileged. It is intended solely
>>> for the use of the individual or entity named on this message. If you are
>>> not the intended recipient, and have received this message in error,
>>> please notify us immediately by reply e-mail and then delete it from
>>> your system.
>>>
>>>
>> Notice:  This e-mail message, together with any attach...{{dropped:15}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/