[R] caretNWS and training data set sizes

Max Kuhn mxkuhn at gmail.com
Mon Mar 10 19:03:16 CET 2008


Peter,

You are certainly up to date. Can you try replicating this using only
two nodes (since you only have two processors)? I'm not sure that
specifying 5 really helps. Using 2 nodes on my mac usually gets me
about a 30-40% decrease in time.

Also, are the processes just hanging or is there an error? These
models may take a while. Perhaps testing with pls, lm or some other
fast model might help troubleshoot.

If you are not passing a sleigh object into the trainNWS call, you can
do this by using

trainNWSControl(
         start = makeSleighStarter(workerCount = 2))

The only other thing I can suggest is to send me the data (or an
anonymized knock-off) so that I can test. You certainly should be able
to do this, but you may be limited by your machine.

Max

On Mon, Mar 10, 2008 at 1:18 PM, Tait, Peter <ptait at skura.com> wrote:
> Hi Max,
>  Thank you for the fast response.
>
>  Here are the versions of the R packages I am using:
>
>  caret 3.13
>  caretNWS 0.16
>  nws 1.62
>
>  Here are the python versions
>
>  Active Python 2.5.1.1
>  nws server 1.5.2 for py2.5
>  twisted 2.5.9 py2.5
>
>  The computer I am using has 1 Xeon dual core cpu at 1.86 GHz with 4 GB of RAM. R is currently set up to use 2 GB of it (it starts with "C:\Program Files\R\R-2.6.2\bin\Rgui.exe" --max-mem-size=2047M). The OS is Windows Server 2003 R2 with SP2.
>
>  I am running one R job/process (Rgui.exe) and almost nothing else on the computer while R is running (no databases, web servers, office apps etc..)
>
>  I really appreciate your help.
>  Cheers
>  Peter
>
>
>
>
>  >-----Original Message-----
>  >From: Max Kuhn [mailto:mxkuhn at gmail.com]
>  >Sent: Monday, March 10, 2008 12:41 PM
>  >To: Tait, Peter
>  >Cc: r-help at R-project.org
>  >Subject: Re: [R] caretNWS and training data set sizes
>  >
>  >What version of caret and caretNWS are you using? Also, what version
>  >of the nws server and twisted are you using? What kind of machine (#
>  >processors, how much physical memory etc)?
>  >
>  >I haven't seen any real limitations with one exception: if you are
>  >running P jobs on the same machine, you are replicating the memory
>  >needs P times.
>  >
>  >I've been running jobs with 4K to 90K samples and 1200 predictors
>  >without issues, so I'll need a lot more information to help you.
>  >
>  >Max
>  >
>  >
>  >On Mon, Mar 10, 2008 at 12:04 PM, Tait, Peter <ptait at skura.com> wrote:
>  >> Hi,
>  >>
>  >>  I am using the caretNWS package to train some supervised regression
>  >models (gbm, lasso, random forest and mars). The problem I have encountered
>  >started when my training data set increased in the number of predictors and
>  >the number of observations.
>  >>
>  >>  The training data set has 347 numeric columns. The problem I have is
>  >when there are more then 2500 observations the 5 sleigh objects start but
>  >do not use any CPU resources and do not process any data.
>  >>
>  >>  N=100                     cpu(%)       memory(K)
>  >>  Rgui.exe                   0           91737
>  >>  5x sleighs (RTerm.exe)    15-25         ~27000
>  >>
>  >>  N=2500
>  >>  Rgui.exe                  0             160000
>  >>  5x sleighs (RTerm.exe)    15-25         ~74000
>  >>
>  >>  N=5000
>  >>  Rgui.exe                  50             193000
>  >>  5x sleighs (RTerm.exe)    0             ~19000
>  >>
>  >>
>  >>  A 10% sample of my overall data is ~22000 observations.
>  >>
>  >>  Can someone give me an idea of the limitations of the nws and caretNWS
>  >packages in terms of the number of columns and rows of the training
>  >matrices and if there are other tuning/training functions that work faster
>  >on large datasets?
>  >>
>  >>  Thanks for your help.
>  >>  Peter
>  >>
>  >>
>  >>  > version
>  >>                _
>  >>  platform       i386-pc-mingw32
>  >>  arch           i386
>  >>  os             mingw32
>  >>  system         i386, mingw32
>  >>  status
>  >>  major          2
>  >>  minor          6.2
>  >>  year           2008
>  >>  month          02
>  >>  day            08
>  >>  svn rev        44383
>  >>  language       R
>  >>  version.string R version 2.6.2 (2008-02-08)
>  >>
>  >>  > memory.limit()
>  >>  [1] 2047
>  >>
>  >>  ______________________________________________
>  >>  R-help at r-project.org mailing list
>  >>  https://stat.ethz.ch/mailman/listinfo/r-help
>  >>  PLEASE do read the posting guide http://www.R-project.org/posting-
>  >guide.html
>  >>  and provide commented, minimal, self-contained, reproducible code.
>  >>
>  >
>  >
>  >
>  >--
>  >
>  >Max
>



-- 

Max



More information about the R-help mailing list