Type: | Package |
Title: | Predictive Power of Linear and Tree Modeling |
Version: | 3.8.0 |
Author: | Seyma Kalay <seymakalay@hotmail.com> |
Maintainer: | Seyma Kalay <seymakalay@hotmail.com> |
Description: | Runs generalized and multinominal logistic (GLM and MLM) models, as well as random forest (RF), Bagging (BAG), and Boosting (BOOST). This package prints out to predictive outcomes easy for the selected data and data splits. |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.1.2 |
URL: | https://github.com/seymakalay/pomodoro, https://seymakalay.github.io/pomodoro/ |
BugReports: | https://github.com/seymakalay/pomodoro/issues |
Suggests: | knitr, rmarkdown |
VignetteBuilder: | knitr |
Imports: | tibble, caret, gbm, stats, randomForest, pROC, ipred |
Depends: | R (≥ 2.10) |
NeedsCompilation: | no |
Packaged: | 2022-03-26 11:46:43 UTC; Seyma |
Repository: | CRAN |
Date/Publication: | 2022-03-26 12:10:02 UTC |
Bagging Model
Description
Bagging Model
Usage
BAG_Model(Data, xvar, yvar)
Arguments
Data |
The name of the Dataset. |
xvar |
X variables. |
yvar |
Y variable. |
Details
Decision trees suffer from high
variance (If we split the training data-set randomly into two parts and set a decision tree to both parts, the results might be quite different).
Bagging is an ensemble procedure which reduces the variance and increases the prediction accuracy of a statistical learning method
by considering many training sets
(\hat{f}^{1}(x),\hat{f}^{2}(x),\ldots,\hat{f}^{B}(x)
)
from the population. Since we can not have multiple training-sets, from a single training data-set, we can generate
B
different bootstrapped training data-sets
(\hat{f}^{*1}(x), \hat{f}^{*2}(x), \ldots,\hat{f}^{*B}(x)
)
by each B
trees and take a majority vote. Therefore, bagging for classification problem defined as
\hat{f}(x)=arg\max_{k}\hat{f}^{*b}(x)
Value
The output from BAG_Model
.
Examples
yvar <- c("Loan.Type")
sample_data <- sample_data[c(1:750),]
xvar <- c("sex", "married", "age", "havejob", "educ", "political.afl",
"rural", "region", "fin.intermdiaries", "fin.knowldge", "income")
BchMk.BAG <- BAG_Model(sample_data, c(xvar, "networth"), yvar )
BchMk.BAG$Roc$auc
Combined Performance of the Data Splits
Description
Combined Performance of the Data Splits
Usage
Combined_Performance(Sub.Est.Mdls)
Arguments
Sub.Est.Mdls |
is the total perfomance of exog. |
Value
The output from Combined_Performance
.
Examples
sample_data <- sample_data[c(1:750),]
yvar <- c("Loan.Type")
xvar <- c("sex", "married", "age", "havejob", "educ", "political.afl",
"rural", "region", "fin.intermdiaries", "fin.knowldge", "income")
CCP.RF <- Estimate_Models(sample_data, yvar, xvec = xvar, exog = "political.afl",
xadd = c("networth", "networth_homequity", "liquid.assets"),
type = "RF", dnames = c("0","1"))
Sub.CCP.RF <- list (Mdl.1 = CCP.RF$EstMdl$`D.1+networth`,
Mdl.0 = CCP.RF$EstMdl$`D.0+networth`)
CCP.NoCCP.RF <- Combined_Performance (Sub.CCP.RF)
Results of the Each Data and Data Splits
Description
Results of the Each Data and Data Splits
Usage
Estimate_Models(DataSet, yvar, exog = NULL, xvec, xadd, type, dnames)
Arguments
DataSet |
The name of the Dataset. |
yvar |
Y variable. |
exog |
is a vector to be subtract from the calculation. |
xvec |
is a vector of the variables to be used. |
xadd |
is an additional vector to be used. |
type |
can be RF, GLM, MLM, BAG, and GBM. |
dnames |
is the unique values of exog. |
Value
The output from Estimate_Models
.
Examples
sample_data <- sample_data[c(1:750),]
m2.xvar0 <- c("sex","married","age","havejob","educ","rural","region","income")
CCP.RF <- Estimate_Models(sample_data, yvar = c("Loan.Type"),
exog = "political.afl", xvec = m2.xvar0,
xadd = "networth", type = "RF", dnames = c("0","1"))
Gradient Boosting Model
Description
Gradient Boosting Model
Usage
GBM_Model(Data, xvar, yvar)
Arguments
Data |
The name of the Dataset. |
xvar |
X variables. |
yvar |
Y variable. |
Details
Unlike bagging trees, boosting does not use bootstrap sampling, rather each tree is fit using information from previous trees. An event probability of stochastic gradient boosting model is given by
\hat{\pi_i} = \frac{1}{1 + exp[-f(x)]^\prime}
where f(x)
is in the range of [-\infty,\infty]
and its initial estimate of the model is
f^{(0)}_i=log(\frac{\pi_{i}}{1-\pi_{i}})
,
where \hat{\pi}
is the estimated sample proportion of a single class from the training set.
Value
The output from GBM_Model
.
Examples
yvar <- c("Loan.Type")
sample_data <- sample_data[c(1:120),]
xvar <- c("sex", "married", "age", "havejob", "educ", "political.afl",
"rural", "region", "fin.intermdiaries", "fin.knowldge", "income")
BchMk.GBM <- GBM_Model(sample_data, c(xvar, "networth"), yvar )
BchMk.GBM$finalModel
BchMk.GBM$Roc$auc
Generalized Linear Model
Description
Generalized Linear Model
Usage
GLM_Model(Data, xvar, yvar)
Arguments
Data |
The name of the Dataset. |
xvar |
X variables. |
yvar |
Y variable. |
Details
Let y be a vector of response variable of accessing credit for each applicant
n
, such that y_{i}=1
if the applicant-i
has access to credit, and zero otherwise. Furthermore, let
let \bold{x} = x_{ij}
, where
i=1,\ldots,n
and j=1,\ldots,p
characteristics of the applicants.
The log-odds can be define as:
log(\frac{\pi_{i}}{1-\pi_{i}}) = \beta_{0}+\bold{x}_{\bold{i}}\beta = \beta_{0}+\sum_{i=1}^{p}\beta_{i}\bold{x}_{i}
\beta_{0}
is the intercept, \beta = (\beta_{1},\ldots, \beta_{p})
is
a p
x
1
vector of coefficients and
\bold{x_{i}}
is the i_{th}
row of x.
Value
The output from GLM_Model
.
Examples
yvar <- c("multi.level")
sample_data <- sample_data[c(1:750),]
xvar <- c("sex", "married", "age", "havejob", "educ", "political.afl",
"rural", "region", "fin.intermdiaries", "fin.knowldge", "income")
BchMk.GLM <- GLM_Model(sample_data, c(xvar, "networth"), yvar )
BchMk.GLM$finalModel
BchMk.GLM$Roc$auc
Multinominal Logistic Model
Description
Multinominal Logistic Model
Usage
MLM_Model(Data, xvar, yvar)
Arguments
Data |
The name of the Dataset. |
xvar |
X variables. |
yvar |
Y variable. |
Details
Multi-nominal model is the generalized form of generalized logistic model and can be define as
\pi_{i}^{h} = P(y_{i}^{h} = 1 | \bold{x}_{\bold{i}}^{h})
where h
presents the class labels ("1-of-h") on the basis of an input vector
x_j
, in our case x_j
is loan types ("Formal Loan", "Informal Loan", "Both Loan", and "No Loan"). Furthermore,
y_{i}^h = 1
if the weight w
of x_j
corresponds to belong a class and y_{i}^h=0
otherwise.
For i
\in
1,\ldots,h
and
the weight vectors w^i corresponds to class i
.
We set {\bold{{w}}^{h}} = 0
and the parameters to be learned are the weight vectors w^i
for i
\in
1,\ldots,h-1
. And the class probabilities must satisfy
\sum_{i=1}^{h} P(y_{i}^{h} = 1 | \bold{x}_{\bold{i}}^{h}, \bold{w}) = 1.
Value
The output from MLM_Model
.
Examples
yvar <- c("Loan.Type")
sample_data <- sample_data[c(1:750),]
xvar <- c("sex", "married", "age", "havejob", "educ", "political.afl",
"rural", "region", "fin.intermdiaries", "fin.knowldge", "income")
BchMk.MLM <- MLM_Model(sample_data, c(xvar, "networth"), yvar )
BchMk.MLM$finalModel
BchMk.MLM$Roc$auc
Random Forest
Description
Random Forest
Usage
RF_Model(Data, xvar, yvar)
Arguments
Data |
The name of the Dataset. |
xvar |
X variables. |
yvar |
Y variable. |
Details
Rather than considering the random sample of m
predictors
from the total of p
predictors in each split,
random forest does not consider a majority of the p
predictors, and considers in each split a
fresh sample of m_{try}
which we usually set to m_{try} \approx \sqrt{p}
Random forests which de-correlate the trees by considering m_{try} \approx \sqrt{p}
show an improvement over bagged trees m = p
.
Value
The output from RF_Model
.
Examples
sample_data <- sample_data[c(1:750),]
yvar <- c("Loan.Type")
xvar <- c("sex", "married", "age", "havejob", "educ", "political.afl",
"rural", "region", "fin.intermdiaries", "fin.knowldge", "income")
BchMk.RF <- RF_Model(sample_data, c(xvar, "networth"), yvar )
BchMk.RF
Sample data for analysis. A dataset containing information of access to credit.
Description
Sample data for analysis.
A dataset containing information of access to credit.
Usage
sample_data
Format
A data_frame
with 53940 rows and 10 variables:
- x1
hhid, household id number
- x2
swgt, survey weight
- x3
region, 3 factor level, west, east, and center
- x4
No.Loan, if the household has no loan
- x5
Formal, if the household has formal loan
- x6
Both, if the household has both loan
- x7
Informal, if the household has informal loan
- x8
sex, if the household has male
- y1
Loan.Type, 4 factor level type of the loan
- y2
multi.level, 2 factor level if the household has access to loan or not
...