[R] Can't seem to finish a randomForest.... Just goes and goes!

David L. Van Brunt, Ph.D. dvanbrunt at well-wired.com
Tue Apr 6 03:58:59 CEST 2004


Removing that first 39 level variable, the trees ran just fine. I had also
taken the shorter categoricals (day of week, for example) and read them in
as numerics.

Still working on it. Need that 30 level puppy in there somehow, but it
really is not anything like a rank... It is a nominal variable.

With numeric values, only assigning the outcome (last column) to be a factor
using "as.factor()" it runs fine, and fast.

I may be misusing this analysis. That first column is indeed nominal, and I
was including it because the data within that name are repeated observations
of that subject. But I suppose there's no guarantee that that information
would be selected, so what does that do to the forest?  Sigh. I'm not much
of a lumberjack. Logistic regression is more my style, but this is pretty
interesting stuff.

If interested, here's a link to the data;
http://www.well-wired.com/reflibrary/uploads/1081216314.txt

 

On 4/5/04 1:40, "Bill.Venables at csiro.au" <Bill.Venables at csiro.au> wrote:

> Alternatively, if you can arrive at a sensible ordering of the levels
> you can declare them ordered factors and make the computation feasible
> once again.
> 
> Bill Venables.
> 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Torsten Hothorn
> Sent: Monday, 5 April 2004 4:27 PM
> To: David L. Van Brunt, Ph.D.
> Cc: R-Help
> Subject: Re: [R] Can't seem to finish a randomForest.... Just goes and
> goes!
> 
> 
> On Sun, 4 Apr 2004, David L. Van Brunt, Ph.D. wrote:
> 
>> Playing with randomForest, samples run fine. But on real data, no go.
>> 
>> Here's the setup: OS X, same behavior whether I'm using R-Aqua 1.8.1
>> or the Fink compile-of-my-own with X-11, R version 1.8.1.
>> 
>> This is on OS X 10.3 (aka "Panther"), G4 800Mhz with 512M physical
>> RAM.
>> 
>> I have not altered the Startup options of R.
>> 
>> Data set is read in from a text file with "read.table", and has 46
>> variables and 1,855 cases. Trying the following:
>> 
>> The DV is categorical, 0 or 1. Most of the IV's are either continuous,
> 
>> or correctly read in as factors. The largest factor has 30 levels....
>> Only the
>                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This means: there are 2^(30-1) = 536.870.912 possible splits to be
> evaluated everytime this variable is picked up (minus something due to
> empty levels). At least the last time I looked at the code, randomForest
> used an exhaustive search over all possible splits. Try reducing the
> number of levels to something reasonable (or for a first shot: remove
> this variable from the learning sample).
> 
> Best,
> 
> Torsten
> 
> 
>> DV seems to need identifying as a factor to force class trees over
>> regresssion:
>> 
>>> Mydata$V46<-as.factor(Mydata$V46)
>>> Myforest.rf<-randomForest(V46~.,data=Mydata,ntrees=100,mtry=7,proximi
>>> ties=FALSE
>> , importance=FALSE)
>> 
>> 5 hours later, R.bin was still taking up 75% of my processor.  When
>> I've tried this with larger data, I get errors referring to the buffer
> 
>> (sorry, not in front of me right now).
>> 
>> Any ideas on this? The data don't seem horrifically large. Seems like
>> there are a few options for setting memory size, but I'm  not sure
>> which of them to try tweaking, or if that's even the issue.
>> 
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide!
>> http://www.R-project.org/posting-guide.html
>> 
>> 
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html

-- 
David L. Van Brunt, Ph.D.
Outlier Consulting & Development
mailto: <ocd at well-wired.com>




More information about the R-help mailing list