[R] Ranger could not work with caret

Sat Jul 2 23:58:35 CEST 2022

@Rui Barradas <ruipbarradas using sapo.pt>

I tried the code according to your comments and it works. However, when I
try it for another dataset with a different number of input features, it
again shows the same error message. I tried it with different types of
datasets and the same error appeared.

Best regards

On Fri, Jul 1, 2022 at 9:18 PM Neha gupta <neha.bologna90 using gmail.com> wrote:

> @Rui Barradas <ruipbarradas using sapo.pt>
>
> Thank you again for the useful explanation.
>
> Best regards
>
> On Fri, Jul 1, 2022 at 8:26 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:
>
>> Hello,
>>
>> The error doesn't arise in randomForest because rf has a function tuneRF
>> that looks for the best mtry (best relative to OOB error estimate). And
>> it's this value that it uses.
>>
>> The question's code gives Ranger errors but it also gives R warnings:
>>
>> Warning messages:
>> 1: model fit failed for Fold01: mtry=48, min.node.size=5,
>> splitrule=variance Error in ranger::ranger(dependent.variable.name =
>> ".outcome", data = x,  :
>>    User interrupt or internal error.
>>
>>
>> As you can see, mtry=48 is the double of ncol(tr) when should *never* be
>> greater than the number of variables in the data set. Why it is using
>> this value, I don't know. Function bug? Ask the package maintainer?
>>
>> And, by the way, package caret does or can do a grid search for optimal
>> parameter values. If that is giving errors and you are calling rf
>> directly why bother whith caret's error? Use the original function. Here
>> is an example with tuneRF. Setting argument doBest to TRUE you'll have
>> both the optimal value for mtry and the fitted random forest. 2 in 1.
>>
>>
>> library(randomForest)
>> #  randomForest 4.7-1.1
>> #  Type rfNews() to see new features/changes/bug fixes.
>>
>> c2 <- tuneRF(
>>    x = tr[-ncol(tr)],
>>    y = tr$act_effort,
>>    mtryStart = ncol(tr)/2,
>>    doBest = TRUE
>> )
>> #  mtry = 12  OOB error = 139920.7
>> #  Searching left ...
>> #  mtry = 6     OOB error = 170909.3
>> #  -0.2214729 0.05
>> #  Searching right ...
>> #  mtry = 23    OOB error = 128566.7
>> #  0.08114586 0.05
>>
>> c2
>> #
>> #  Call:
>> #   randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
>> #                 Type of random forest: regression
>> #                       Number of trees: 500
>> #  No. of variables tried at each split: 23
>> #
>> #            Mean of squared residuals: 129734.8
>> #                      % Var explained: 39.98
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>
>>
>> Às 17:18 de 01/07/2022, Neha gupta escreveu:
>> > Thank you so much for your help. I hope it will work.
>> >
>> > However, why the same error doesn't arise when I am using rf. They both
>> > have the same parameters and it's default values.
>> >
>> > Best regards
>> >
>> > On Friday, July 1, 2022, Rui Barradas <ruipbarradas using sapo.pt
>> > <mailto:ruipbarradas using sapo.pt>> wrote:
>> >
>> >     Hello,
>> >
>> >     The error is in Ranger parameter mtry becoming greater than the
>> >     number of variables (columns).
>> >     mtry can be set manually in caret::train argument tuneGrid. But for
>> >     random forests you must also set the split rule and the minimum
>> node.
>> >
>> >
>> >     library(caret)
>> >     library(farff)
>> >
>> >     boot <- trainControl(method = "cv", number = 10)
>> >
>> >     # set the maximum mtry manually to ncol(tr)
>> >     # this creates a sequence of mtry values
>> >     mtry <- var_seq(ncol(tr), len = 3)  # 3 is the default value
>> >     mtry
>> >     #  [1]  2 13 24
>> >     #[1]  2 13 24
>> >
>> >     splitrule <- c("variance", "extratrees")
>> >     min.node.size <- 1:10
>> >     mtrygrid <- expand.grid(mtry, splitrule, min.node.size)
>> >     names(mtrygrid) <- c("mtry", "splitrule", "min.node.size")
>> >
>> >     c1 <- train(act_effort ~ ., data = tr,
>> >                 method = "ranger",
>> >                 tuneLength = 5,
>> >                 metric = "MAE",
>> >                 preProc = c("center", "scale", "nzv"),
>> >                 tuneGrid = mtrygrid,
>> >                 trControl = boot)
>> >     c1
>> >     #  Random Forest
>> >     #
>> >     #  30 samples
>> >     #  23 predictors
>> >     #
>> >     #  Pre-processing: centered (48), scaled (48), remove (58)
>> >     #  Resampling: Cross-Validated (10 fold)
>> >     #  Summary of sample sizes: 28, 27, 27, 28, 27, 27, ...
>> >     #  Resampling results across tuning parameters:
>> >     #
>> >     #    mtry  splitrule   min.node.size  RMSE      Rsquared   MAE
>> >     #     2    variance     1             256.6391  0.8103759  186.3609
>> >     #     2    variance     2             249.7120  0.8628109  183.6696
>> >     #     2    variance     3             258.8240  0.8284449  189.0712
>> >     #
>> >     # [...omit...]
>> >     #
>> >     #    13    extratrees  10             254.9569  0.8918014  191.2524
>> >     #    24    variance     1             177.7188  0.9458652  112.2800
>> >     #    24    variance     2             172.6826  0.9204287  108.5943
>> >     #    24    variance     3             172.9954  0.9271006  109.2554
>> >     #    24    variance     4             172.2467  0.9523067  110.0776
>> >     #    24    variance     5             175.2485  0.9283317  112.8798
>> >     #    24    variance     6             177.9285  0.9369881  115.8970
>> >     #    24    variance     7             180.5959  0.9485035  117.5816
>> >     #    24    variance     8             178.8037  0.9358033  117.8725
>> >     #    24    variance     9             176.5849  0.9210959  117.0055
>> >     #    24    variance    10             178.6439  0.9257969  119.8035
>> >     #    24    extratrees   1             219.1368  0.8801770  141.0720
>> >     #    24    extratrees   2             216.1900  0.8550002  140.9263
>> >     #    24    extratrees   3             212.4138  0.8979379  141.4282
>> >     #    24    extratrees   4             218.2631  0.9121471  146.2908
>> >     #    24    extratrees   5             212.5679  0.9279598  144.2715
>> >     #    24    extratrees   6             218.9856  0.9141754  152.2099
>> >     #    24    extratrees   7             222.8540  0.9412682  152.4614
>> >     #    24    extratrees   8             228.1156  0.9423414  161.8456
>> >     #    24    extratrees   9             226.6182  0.9408306  160.5264
>> >     #    24    extratrees  10             226.9280  0.9429413  165.6878
>> >     #
>> >     #  MAE was used to select the optimal model using the smallest
>> value.
>> >     #  The final values used for the model were mtry = 24, splitrule =
>> >     variance
>> >     #   and min.node.size = 2.
>> >     plot(c1)
>> >
>> >
>> >
>> >     Hope this helps,
>> >
>> >     Rui Barradas
>> >
>> >
>> >     Às 23:03 de 30/06/2022, Neha gupta escreveu:
>> >
>> >         Ok, the data is pasted below
>> >
>> >         But on the same data (everything the same) and with other models
>> >         like RF, SVM etc, it works fine.
>> >
>> >           > dput(head(tr, 30))
>> >         structure(list(recordnumber = c(0, 0.02, 0.04, 0.06, 0.07, 0.08,
>> >         0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.16, 0.17, 0.18, 0.23, 0.24,
>> >         0.25, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.35, 0.36, 0.37, 0.38,
>> >         0.4, 0.41), projectname = structure(c(1L, 1L, 1L, 1L, 2L, 3L,
>> >         3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
>> >         4L, 4L, 4L, 4L, 4L, 4L, 5L, 6L), levels = c("de", "erb", "gal",
>> >         "X", "hst", "slp", "spl", "Y"), class = "factor"), cat2 =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L,
>> >         9L, 11L, 5L, 4L, 6L, 8L, 3L, 9L, 9L, 9L, 9L, 6L, 7L), levels =
>> >         c("Avionics",
>> >         "application_ground", "avionicsmonitoring",
>> "batchdataprocessing",
>> >         "communications", "datacapture", "launchprocessing",
>> >         "missionplanning",
>> >         "monitor_control", "operatingsystem", "realdataprocessing",
>> >         "science",
>> >         "simulation", "utility"), class = "factor"), forg =
>> structure(c(2L,
>> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels =
>> c("f",
>> >         "g"), class = "factor"), center = structure(c(2L, 2L, 2L, 2L,
>> >         2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
>> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 6L), levels = c("1", "2",
>> >         "3", "4", "5", "6"), class = "factor"), year = c(0.5, 0.5, 0.5,
>> >         0.5, 0.6875, 0.5625, 0.5625, 0.8125, 0.5625, 0.875, 0.5625,
>> 0.75,
>> >         0.5625, 0.8125, 0.75, 0.9375, 0.9375, 0.9375, 0.6875, 0.6875,
>> >         0.6875, 0.6875, 0.875, 1, 0.9375, 0.9375, 0.9375, 0.9375,
>> 0.5625,
>> >         0.25), mode = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
>> >         3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
>> >         3L, 3L, 3L, 3L, 3L), levels = c("embedded", "organic",
>> >         "semidetached"
>> >         ), class = "factor"), rely = structure(c(4L, 4L, 4L, 4L, 4L,
>> >         4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L, 3L, 3L, 3L,
>> >         3L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 4L), levels = c("vl", "l", "n",
>> >         "h", "vh", "xh"), class = "factor"), data = structure(c(2L, 2L,
>> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
>> >         5L, 5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 2L), levels =
>> c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), cplx =
>> >         structure(c(4L,
>> >         4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L,
>> >         3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), time =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L,
>> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 3L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), stor =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L,
>> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), virt =
>> >         structure(c(2L,
>> >         2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 3L, 3L,
>> >         3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), turn =
>> >         structure(c(2L,
>> >         2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
>> >         3L, 4L, 4L, 4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 2L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), acap =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
>> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), aexp =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 4L, 5L, 5L, 4L, 5L, 4L, 4L,
>> >         4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), pcap =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 4L, 5L, 4L, 5L, 3L, 4L, 4L, 5L, 4L, 4L, 4L, 4L,
>> >         4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 4L, 4L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), vexp =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
>> >         3L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), lexp =
>> >         structure(c(4L,
>> >         4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 1L, 4L, 4L, 4L, 4L, 3L, 3L,
>> >         3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 3L, 4L, 3L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), modp =
>> >         structure(c(4L,
>> >         4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
>> >         3L, 5L, 5L, 5L, 5L, 4L, 4L, 3L, 3L, 4L, 3L, 4L, 4L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), tool =
>> >         structure(c(3L,
>> >         3L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
>> >         3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 1L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), sced =
>> >         structure(c(2L,
>> >         2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
>> >         3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 3L), levels =
>> >         c("vl",
>> >         "l", "n", "h", "vh", "xh"), class = "factor"), equivphyskloc =
>> >         c(0.025534,
>> >         0.006945, 0.008988, 0.002655, 0.067102, 0.006741, 0.019508,
>> >         0.005209,
>> >         0.101215, 0.010622, 0.101215, 0.019508, 0.152283, 0.031253,
>> >         0.014401,
>> >         0.014401, 0.037892, 0.009294, 0.015729, 0.012154, 0.032377,
>> >         0.035339,
>> >         0.004698, 0.009703, 0.00572, 0.012358, 0.091002, 0.007252,
>> 0.180778,
>> >         0.307527), act_effort = c(117.6, 31.2, 25.2, 10.8, 352.8, 72,
>> >         72, 24, 360, 36, 215, 48, 324, 60, 48, 90, 210, 48, 82, 62, 170,
>> >         192, 18, 50, 42, 60, 444, 42, 1248, 2400)), row.names = c(1L,
>> >         3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 17L, 18L, 19L,
>> >         24L, 25L, 26L, 29L, 30L, 31L, 32L, 33L, 34L, 36L, 37L, 38L, 39L,
>> >         41L, 42L), class = "data.frame")
>> >
>> >
>> >
>> >         On Thu, Jun 30, 2022 at 11:28 PM Rui Barradas
>> >         <ruipbarradas using sapo.pt <mailto:ruipbarradas using sapo.pt>
>> >         <mailto:ruipbarradas using sapo.pt <mailto:ruipbarradas using sapo.pt>>>
>> wrote:
>> >
>> >              Hello,
>> >
>> >              Please post data in dput format, without it it's difficult
>> >         to tell.
>> >              If I substitute
>> >
>> >              mpg for act_effort
>> >              mtcars for tr
>> >
>> >              keeping everything else, I don't get any errors.
>> >              And the error message says clearly that the error is in tr
>> >         (data).
>> >
>> >              Can you post the output of dput(head(tr, 30))?
>> >
>> >              Rui Barradas
>> >
>> >
>> >              Às 19:32 de 30/06/2022, Neha gupta escreveu:
>> >               > I posted it for the second time as I didn't get any
>> >         response from
>> >              group
>> >               > members. I am not sure if some problem is with the
>> question.
>> >               >
>> >               >
>> >               >
>> >               > I cannot run the "ranger" model with caret. I am only
>> >         using the
>> >              farff and
>> >               > caret libraries and the following code:
>> >               >
>> >               > boot <- trainControl(method = "cv", number=10)
>> >               >
>> >               > c1 <-train(act_effort ~ ., data = tr,
>> >               >                method = "ranger",
>> >               >                 tuneLength = 5,
>> >               >                metric = "MAE",
>> >               >                preProc = c("center", "scale", "nzv"),
>> >               >                trControl = boot)
>> >               >
>> >               > The error I get is the repeating of the following
>> >         message until I
>> >              interrupt
>> >               > it.
>> >               >
>> >               > Error: mtry can not be larger than number of variables
>> >         in data.
>> >              Ranger will
>> >               > EXIT now.
>> >               >
>> >               >       [[alternative HTML version deleted]]
>> >               >
>> >               > ______________________________________________
>> >               > R-help using r-project.org <mailto:R-help using r-project.org>
>> >         <mailto:R-help using r-project.org <mailto:R-help using r-project.org>>
>> >         mailing list
>> >              -- To UNSUBSCRIBE and more, see
>> >               > https://stat.ethz.ch/mailman/listinfo/r-help
>> >         <https://stat.ethz.ch/mailman/listinfo/r-help>
>> >              <https://stat.ethz.ch/mailman/listinfo/r-help
>> >         <https://stat.ethz.ch/mailman/listinfo/r-help>>
>> >               > PLEASE do read the posting guide
>> >         http://www.R-project.org/posting-guide.html
>> >         <http://www.R-project.org/posting-guide.html>
>> >              <http://www.R-project.org/posting-guide.html
>> >         <http://www.R-project.org/posting-guide.html>>
>> >               > and provide commented, minimal, self-contained,
>> >         reproducible code.
>> >
>>
>

	[[alternative HTML version deleted]]