[R] anyone know why package "RandomForest" na.roughfix is so slow??
Hadley Wickham
hadley at rice.edu
Fri Jul 2 02:07:54 CEST 2010
Here's another version that's a bit easier to read:
na.roughfix2 <- function (object, ...) {
res <- lapply(object, roughfix)
structure(res, class = "data.frame", row.names = seq_len(nrow(object)))
}
roughfix <- function(x) {
missing <- is.na(x)
if (!any(missing)) return(x)
if (is.numeric(x)) {
x[missing] <- median.default(x[!missing])
} else if (is.factor(x)) {
freq <- table(x)
x[missing] <- names(freq)[which.max(freq)]
} else {
stop("na.roughfix only works for numeric or factor")
}
x
}
I'm cheating a bit because as.data.frame is so slow.
Hadley
On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <this.is.mvw at gmail.com> wrote:
> Jim, Andy,
>
> Thanks for your suggestions!
>
> I found some time today to futz around with it, and I found a "home
> made" script to fill in NA values to be much quicker. For those who are
> interested, instead of using:
>
> dataSet <- na.roughfix(dataSet)
>
>
>
> I used:
>
> origCols <- names(dataSet)
> ## Fix numeric values...
> dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
> {
> if(!is.numeric(x)) { x } else {
> ifelse(is.na(x), median(x, na.rm=TRUE), x) } }
> ),
> row.names=row.names(dataSet) )
> ## Fix factors...
> dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
> {
> if(!is.factor(x)) { x } else {
> levels(x)[ifelse(!is.na
> (x),x,table(max(table(x)))
> ) ] } } ),
> row.names=row.names(dataSet) )
> names(dataSet) <- origCols
>
>
>
> In one case study that I ran, the na.roughfix() algo took 296 seconds
> whereas the homemade one above took 16 seconds.
>
> Regards,
> Mike
>
>
>
> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
> Some x-ray slides, a music score, Minard's Napoleanic war:
> The most exciting frontier is charting what's already here."
> -- xkcd
>
> --
> Help protect Wikipedia. Donate now:
> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>
>
> On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
>
>> You need to isolate the problem further, or give more detail about your
>> data. This is what I get:
>>
>> R> nr <- 2134
>> R> nc <- 14037
>> R> x <- matrix(runif(nr*nc), nr, nc)
>> R> n.na <- round(nr*nc/10)
>> R> x[sample(nr*nc, n.na)] <- NA
>> R> system.time(x.fixed <- na.roughfix(x))
>> user system elapsed
>> 8.44 0.39 8.85
>> R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB
>> ram.
>>
>> Andy
>>
>> ------------------------------
>> *From:* Mike Williamson [mailto:this.is.mvw at gmail.com]
>> *Sent:* Thursday, July 01, 2010 12:48 PM
>> *To:* Liaw, Andy
>> *Cc:* r-help
>> *Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is
>> so slow??
>>
>> Andy,
>>
>> You're right, I didn't supply any code, because my call was very simple
>> and it was the call itself at question. However, here is the associated
>> code I am using:
>>
>>
>> naFixTime <- system.time( {
>> if (fltrResponse) { ## TRUE: there are no NA's in the
>> response... cleared via earlier steps
>> message(paste(iAm,": Missing values will now be
>> imputed...\n", sep=""))
>> try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
>> response)],
>> dataSet[,response]) )
>> } else { ## In this case, there is no "response" column in the
>> data set
>> message(paste(iAm,": Missing values will now be filled in
>> with median",
>> " values or most frequent levels", sep=""))
>> try( dataSet <- na.roughfix(dataSet) )
>> }
>> } )
>>
>>
>>
>> As you can see, the "na.roughfix" call is made as simply as possible:
>> I supply the entire dataSet (only parameters, no responses). I am not doing
>> the prediction here (that is done later, and the prediction itself is not
>> taking very long).
>> Here are some calculation times that I experienced:
>>
>> # rows # cols time to run na.roughfix
>> ======= ======= ====================
>> 2046 2833 ~ 2 minutes
>> 2066 5626 ~ 6 minutes
>> 2134 14037 ~ 30 minutes
>>
>> These numbers are on a Windows server using the 64-bit version of 'R'.
>>
>> Regards,
>> Mike
>>
>>
>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
>> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
>> Some x-ray slides, a music score, Minard's Napoleanic war:
>> The most exciting frontier is charting what's already here."
>> -- xkcd
>>
>> --
>> Help protect Wikipedia. Donate now:
>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>>
>>
>> On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
>>
>>> You have not shown any code on exactly how you use na.roughfix(), so I
>>> can only guess.
>>>
>>> If you are doing something like:
>>>
>>> randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
>>>
>>> I would not be surprised that it's taking very long on large datasets.
>>> Most likely it's caused by the formula interface, not na.roughfix()
>>> itself.
>>>
>>> If that is your case, try doing the imputation beforehand and run
>>> randomForest() afterward; e.g.,
>>>
>>> myroughfixed <- na.roughfix(mybigdata)
>>> randomForest(myroughfixed[list.of.predictor.columns],
>>> myroughfixed[[myresponse]],...)
>>>
>>> HTH,
>>> Andy
>>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
>>> On Behalf Of Mike Williamson
>>> Sent: Wednesday, June 30, 2010 7:53 PM
>>> To: r-help
>>> Subject: [R] anyone know why package "RandomForest" na.roughfix is so
>>> slow??
>>>
>>> Hi all,
>>>
>>> I am using the package "random forest" for random forest
>>> predictions. I
>>> like the package. However, I have fairly large data sets, and it can
>>> often
>>> take *hours* just to go through the "na.roughfix" call, which simply
>>> goes
>>> through and cleans up any NA values to either the median (numerical
>>> data) or
>>> the most frequent occurrence (factors).
>>> I am going to start doing some comparisons between na.roughfix() and
>>> some apply() functions which, it seems, are able to do the same job more
>>> quickly. But I hesitate to duplicate a function that is already in the
>>> package, since I presume the na.roughfix should be as quick as possible
>>> and
>>> it should also be well "tailored" to the requirements of random forest.
>>>
>>> Has anyone else seen that this is really slow? (I haven't noticed
>>> rfImpute to be nearly as slow, but I cannot say for sure: my "predict"
>>> data
>>> sets are MUCH larger than my model data sets, so cleaning the prediction
>>> data set simply takes much longer.)
>>> If so, any ideas how to speed this up?
>>>
>>> Thanks!
>>> Mike
>>>
>>>
>>>
>>> "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
>>> Tacoma Narrows bridge collapse explained with abstract phase-space maps,
>>> Some x-ray slides, a music score, Minard's Napoleanic war:
>>> The most exciting frontier is charting what's already here."
>>> -- xkcd
>>>
>>> --
>>> Help protect Wikipedia. Donate now:
>>> http://wikimediafoundation.org/wiki/Support_Wikipedia/en
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> Notice: This e-mail message, together with any attachments, contains
>>> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
>>> New Jersey, USA 08889), and/or its affiliates Direct contact information
>>> for affiliates is available at
>>> http://www.merck.com/contact/contacts.html) that may be confidential,
>>> proprietary copyrighted and/or legally privileged. It is intended solely
>>> for the use of the individual or entity named on this message. If you are
>>> not the intended recipient, and have received this message in error,
>>> please notify us immediately by reply e-mail and then delete it from
>>> your system.
>>>
>>>
>> Notice: This e-mail message, together with any attach...{{dropped:15}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/
More information about the R-help
mailing list