--- title: "AnalysisLin-vignette" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{AnalysisLin-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- - [**Introduction**](#introduction) - [**Descriptive Statistic**](#descriptive-statistic) - [**Data Visualization**](#data-visualization) - [Numeric Plot](#numeric-plot) - [Categorical Plot](#categorical-plot) - [**Correlation Analysis**](#correlation-analysis) - [Correlation Matrix](#correlation-matrix) - [Correlation Clustering](#correlation-clustering) - [**Feature Engineering**](#feature-engineering) - [Missing Value Imputation](#missing-value-imputation) - [Principle Component Analysis](#principle-component-analysis) # Introduction Hi everyone, this page is to introduce the package AnalysisLin, which is my personal package for exploratory data analysis. It includes several useful functions designed to assist with exploratory data analysis (EDA). These functions are based on my learnings throughout my academic years, and I personally use them for EDA. Table below summarize the functions that would be going over in this page ```{r,echo=FALSE} library(knitr) library(AnalysisLin) ``` ```{r} df <- data.frame( Descriptive_Statistics = c("desc_stat()","","","","",""), Data_Visualization = c("hist_plot()","dens_plot()", "bar_plot()","pie_plot()","qq_plot()","missing_value_plot()"), Correlation_Analysis = c("corr_matrix()", "corr_cluster()","","","",""), Feature_Engineering = c("missing_impute()", "pca()","","","","") ) kable(df) ``` Some famous and very useful pre-installed datasets, such as iris, mtcars, and airquality, would be used to demonstrate what does each function in the package do. If you have not installed the package, please do the following: install.packages("AnalysisLin") ```{r} data("iris") data("mtcars") data("Titanic") data("airquality") ``` Exploratory Data Analysis, in simple words, is the process to get to know your data. ## Descriptive Statistic First function in package is desc_stat. This function computes numerous useful statistical metrics so that you gain a profound understanding of your data - **Count**: Number of values in a variable. - **Unique**: Number of values that are unique in a variable. - **Duplicate**: Number of rows that are duplicate in a dataset. - **Null**: Number of values that are missing in a variable. - **Null Rate**: Percentage of values that are missing in a variable. - **Type**: Type of variable (e.g., numeric, character, factor). - **Min**: Smallest value. - **P25**: Median of the first half. - **Mean**: Mean value. - **Median**: Median value. - **P75**: Median of the second half. - **Max**: Largest value. - **SD**: Standard deviation. - **Kurtosis**: A measure of the tailedness of a distribution. - **Skewness**: A measure of the asymmetry of a distribution. - **Shapiro-Wilk Test**: Checks if a sample follows a normal distribution by comparing its statistics to expected values under the assumption of normality. - **Kolmogorov-Smirnov Test**:Checks if a sample follows a normal distribution by comparing its cumulative distribution function to the expected normal distribution. - **Anderson-Darling Test**:Assesses normality by emphasizing tail behavior, determining if a sample conforms to a specified distribution. - **Lilliefors Test**:A variant of Kolmogorov-Smirnov, is tailored for small sample sizes, testing whether data is normally distributed. - **Jarque-Bera Test P-value**: Checks whether data have the skewness and kurtosis matching a normal distribution. ```{r,eval=FALSE} desc_stat(mtcars) ``` ```{r,eval=FALSE} desc_stat(iris) ``` ```{r,eval=FALSE} desc_stat(airquality) ``` These metrics provide valuable insights into the dataset in a deep level. If you don't want any of these metrics to be computed, you can set them to `FALSE`. This way, the unwanted metrics won't appear in the output. Furthermore, desc_stat() can also compute Kurtosis, Skewness, Shapiro-Wilk Test, Anderson-Darling Test, Lilliefors Test,Jarque-Bera Test ```{r,eval=FALSE} desc_stat(mtcars,max = F, min=F, sd=F,kurtosis = T,skewness = T,shapiro = T,anderson = T,lilliefors = T, jarque = T) ``` ## Data Visualization ### Numeric Plot To visualize histogram for all numerical variables ```{r,eval=FALSE} hist_plot(iris,subplot=F) ``` To visualize desnity for all numerical variables in two rows of subplots ```{r,eval=FALSE} dens_plot(iris,subplot=T,nrow=2) ``` A Quantile-Quantile (QQ) plot is a graphical tool used to assess whether a dataset follows a normal distribution. It compares the quantiles of the observed data to the quantiles of the expected distribution. if you want to check the normality for numerical variables by drawing QQ plot. ```{r,eval=FALSE} qq_plot(iris,subplot = T) ``` ### Categorical Plot To visualize bar charts for all categorical variables ```{r,eval=FALSE} bar_plot(iris) ``` if you want pie chart: ```{r,eval=FALSE} pie_plot(iris) ``` ## Correlation Analysis ### Correlation Matrix To visualize correlation table for all variables. ```{r,eval=FALSE} corr_matrix(mtcars) ``` if you want to visualize correlation map along with correlation table: ```{r,eval=FALSE} corr_matrix(mtcars,corr_plot=T) ``` you may also choose type of correlation:Pearson correlation and Spearman correlation. ```{r,eval=FALSE} corr_matrix(mtcars,type='pearson') corr_matrix(mtcars,type='spearman') ``` ### Correlation Clustering Correlation clustering partitioning data points into groups based on their similarity(correlation) ```{r,eval=FALSE} corr_cluster(mtcars,type='pearson') ``` ```{r,eval=FALSE} corr_cluster(mtcars, type='spearman') ``` ## Feature_Engineering ### Missing Value Plot To visualize the percentage of missing values in each variable. ```{r,eval=FALSE} missing_values_plot(airquality) ``` ### Missing Value Imputation Imputing mssing value is a way to get more information from a data with missing values. However, one need to carefully choose what method to use to impute missing values in order to reach most accuracy. - **mean**: use mean value to replace missing value. ```{r,results='hide'} impute_missing(airquality,method='mean') ``` - **mode**: use most frequency value to replace missing value. - **median**: use median value to replace missing value. - **locf**: use last observation value to replace missing value. - **knn**: use k-nearest nerighbor to replace missing value, k needs to be chosen. ```{r,results='hide'} impute_missing(airquality,method='mode') impute_missing(airquality,method='median') impute_missing(airquality,method='locf') impute_missing(airquality,method='knn',k=5) ``` ### Principle Component Analysis Principle Component Analysis can help you to reduce the number of variables in a dataset. To perform and visualize PCA on some selected variables ```{r,eval=FALSE} pca(mtcars,variance_threshold = 0.9,scale=T) ``` to visualize the scree plot and biplot ```{r,eval=FALSE} pca(mtcars,variance_threshold = 0.9,scale=TRUE,scree_plot=TRUE,biplot=TRUE) ```