--- title: "Introduction to LDATree" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to LDATree} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(LDATree) ``` `LDATree` is an R modeling package for fitting classification trees with oblique splits. * If you are unfamiliar with classification trees, here is a [tutorial](http://www.sthda.com/english/articles/35-statistical-machine-learning-essentials/141-cart-model-decision-tree-essentials/) about the traditional CART and its R implementation `rpart`. * More details about the LDATree can be found in Wang, S. (2024). *FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique Splits and Feature Selection*. arXiv preprint arXiv:2410.23147. [Link](https://arxiv.org/abs/2410.23147). # Why use the `LDATree` package? Compared to other similar trees, `LDATree` distinguishes itself in the following ways: * Using Uncorrelated Linear Discriminant Analysis (ULDA) from the `folda` package, it can **efficiently find oblique splits**. * It provides both ULDA and forward ULDA as the splitting rule and node model. Forward ULDA has intrinsic **variable selection**, which helps mitigate the influence of noise variables. * It automatically **handles missing values**. * It can output both predicted class and **class probability**. * It supports **downsampling**, which can be used to balance classes or accelerate the model fitting process. * It includes several **visualization** tools to provide deeper insights into the data. # Basic Usage of `LDATree` We offer two main tree types in the `LDATree` package: LDATree and FoLDTree. For the splitting rule and node model, LDATree uses ULDA, while FoLDTree uses forward ULDA. To build an LDATree (or FoLDTree): ```{r,fig.asp=0.618,out.width = "100%",fig.align = "center"} library(LDATree) set.seed(443) diamonds <- as.data.frame(ggplot2::diamonds)[sample(53940, 2000),] datX <- diamonds[, -2] response <- diamonds[, 2] # we try to predict "cut" fit <- Treee(datX = datX, response = response, verbose = FALSE) # by default, it is a pre-stopping FoLDTree # fit <- Treee(datX = datX, response = response, verbose = FALSE, ldaType = "all", pruneMethod = "post") # if you want to fit a post-pruned LDATree. ``` To plot the LDATree (or FoLDTree): ```{r,fig.asp=0.618,out.width = "80%",fig.align = "center", eval=FALSE} # View the overall tree. plot(fit) ``` ```{r out.width = '100%',fig.align = "center", echo = FALSE} knitr::include_graphics("README-plot1-1.png") ``` ```{r,echo=TRUE, eval=FALSE} # Three types of individual plots # 1. Scatter plot on first two LD scores plot(fit, datX = datX, response = response, node = 1) ``` ```{r, out.width = '100%',fig.align = "center", echo = FALSE} knitr::include_graphics("README-plot2-1.png") ``` ```{r,echo=TRUE, eval=FALSE} # 2. Density plot on the first LD score plot(fit, datX = datX, response = response, node = 7) ``` ```{r, out.width = '100%',fig.align = "center", echo = FALSE} knitr::include_graphics("README-plot2-2.png") ``` ```{r} # 3. A message plot(fit, datX = datX, response = response, node = 2) ``` To make predictions: ```{r,fig.asp=0.618,out.width = "100%",fig.align = "center", echo=TRUE} # Prediction only. predictions <- predict(fit, datX) head(predictions) ``` ```{r,fig.asp=0.618,out.width = "100%",fig.align = "center", echo=TRUE} # A more informative prediction predictions <- predict(fit, datX, type = "all") head(predictions) ``` # Additional Features * **Missing values**: The solution to the missing value problem is inherited from the `folda` package. Check [here](https://iamwangsiyu.com/folda/articles/folda.html#handling-missing-values) for more details. * **Downsampling**: Optional downsampling occurs only when fitting the ULDA model. Check [here](https://iamwangsiyu.com/folda/articles/folda.html#downsampling) for more details. * **`misClassCost`**: This parameter is useful in situations where misclassifying certain classes has a more severe impact than others. Check [here](https://iamwangsiyu.com/folda/articles/folda.html#additional-features) for more details. # References * Wang, S. (2024). A new forward discriminant analysis framework based on Pillai's trace and ULDA. *arXiv preprint*, arXiv:2409.03136. Retrieved from https://arxiv.org/abs/2409.03136. * Wang, S. (2024). FoLDTree: A ULDA-based decision tree framework for efficient oblique splits and feature selection. *arXiv preprint*, arXiv:2410.23147. Retrieved from https://arxiv.org/abs/2410.23147.