# Data Transformation

## Preface

After you have acquired the data, you should do the following:

• Diagnose data quality.
• If there is a problem with data quality,
• The data must be corrected or re-acquired.
• Explore data to understand the data and find scenarios for performing the analysis.
• Derive new variables or perform variable transformations.

The dlookr package makes these steps fast and easy:

• Performs an data diagnosis or automatically generates a data diagnosis report.
• Discover data in a variety of ways, and automatically generate EDA(exploratory data analysis) report.
• Imputate missing values and outliers, resolve skewed data, and binarize continuous variables into categorical variables. And generates an automated report to support it.

This document introduces data transformation methods provided by the dlookr package. You will learn how to transform of tbl_df data that inherits from data.frame and data.frame with functions provided by dlookr.

dlookr synergy with dplyr increases. Particularly in data transformation and data wrangle, it increases the efficiency of the tidyverse package group.

## datasets

To illustrate the basic use of EDA in the dlookr package, I use a Carseats datasets. Carseats in the ISLR package is simulation dataset that sells children’s car seats at 400 stores. This data is a data.frame created for the purpose of predicting sales volume.

library(ISLR)
str(Carseats)
'data.frame':   400 obs. of  11 variables:
$Sales : num 9.5 11.22 10.06 7.4 4.15 ...$ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
$Income : num 73 48 35 100 64 113 105 81 110 113 ...$ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
$Population : num 276 260 269 466 340 501 45 425 108 131 ...$ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
$ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...$ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
$Education : num 17 10 12 14 13 16 15 10 10 17 ...$ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
Sales_minmax = transform(carseats$Sales, method = "minmax")) %>% select(Income_minmax, Sales_minmax) %>% boxplot() ### Resolving Skewness data with transform() find_skewness() calculates the skewness and finds the skewed data. # find index of skewed variables find_skewness(carseats) [1] 4 # find names of skewed variables find_skewness(carseats, index = FALSE) [1] "Advertising" # compute the skewness find_skewness(carseats, value = TRUE) Sales CompPrice Income Advertising Population Price 0.185 -0.043 NA 0.637 -0.051 -0.125 Age Education -0.077 0.044 # compute the skewness & filtering with threshold find_skewness(carseats, value = TRUE, thres = 0.1) Sales Advertising Price 0.185 0.637 -0.125  The skewness of Advertising is 0.637, which is a little slanted to the left, so I use transformation () to convert it to log. summary() summarizes the transformation information, and plot() visualizes the transformation information. Advertising_log = transform(carseats$Advertising, method = "log")

# result of transformation
[1] 2.397895 2.772589 2.302585 1.386294 1.098612 2.564949
# summary of transformation
Warning: cols is now required.
Please use cols = c(statistic)
* Resolving Skewness with log

* Information of Transformation (before vs after)
Original Transformation
n        400.0000000    400.0000000
na         0.0000000      0.0000000
mean       6.6350000           -Inf
sd         6.6503642            NaN
se_mean    0.3325182            NaN
IQR       12.0000000            Inf
skewness   0.6395858            NaN
kurtosis  -0.5451178            NaN
p00        0.0000000           -Inf
p01        0.0000000           -Inf
p05        0.0000000           -Inf
p10        0.0000000           -Inf
p20        0.0000000           -Inf
p25        0.0000000           -Inf
p30        0.0000000           -Inf
p40        2.0000000      0.6931472
p50        5.0000000      1.6094379
p60        8.4000000      2.1265548
p70       11.0000000      2.3978953
p75       12.0000000      2.4849066
p80       13.0000000      2.5649494
p90       16.0000000      2.7725887
p95       19.0000000      2.9444390
p99       23.0100000      3.1359198
p100      29.0000000      3.3672958
# viz of transformation
plot(Advertising_log)

It seems that the raw data contains 0, as there is a -Inf in the log converted value. So this time we convert it to “log + 1”.

Advertising_log <- transform(carseats$Advertising, method = "log+1") # result of transformation head(Advertising_log) [1] 2.484907 2.833213 2.397895 1.609438 1.386294 2.639057 # summary of transformation summary(Advertising_log) Warning: cols is now required. Please use cols = c(statistic) * Resolving Skewness with log+1 * Information of Transformation (before vs after) Original Transformation n 400.0000000 400.00000000 na 0.0000000 0.00000000 mean 6.6350000 1.46247709 sd 6.6503642 1.19436323 se_mean 0.3325182 0.05971816 IQR 12.0000000 2.56494936 skewness 0.6395858 -0.19852549 kurtosis -0.5451178 -1.66342876 p00 0.0000000 0.00000000 p01 0.0000000 0.00000000 p05 0.0000000 0.00000000 p10 0.0000000 0.00000000 p20 0.0000000 0.00000000 p25 0.0000000 0.00000000 p30 0.0000000 0.00000000 p40 2.0000000 1.09861229 p50 5.0000000 1.79175947 p60 8.4000000 2.23936878 p70 11.0000000 2.48490665 p75 12.0000000 2.56494936 p80 13.0000000 2.63905733 p90 16.0000000 2.83321334 p95 19.0000000 2.99573227 p99 23.0100000 3.17846205 p100 29.0000000 3.40119738 # viz of transformation plot(Advertising_log) ## Binning ### Binning of individual variables using binning() binning() transforms a numeric variable into a categorical variable by binning it. The following types of binning are supported. • “quantile” : categorize using quantile to include the same frequencies • “equal” : categorize to have equal length segments • “pretty” : categorized into moderately good segments • “kmeans” : categorization using K-means clustering • “bclust” : categorization using bagged clustering technique The following example illustrates some ways to Income binning using binning().: # Binning the carat variable. default type argument is "quantile" bin <- binning(carseats$Income)
# Print bins class object
bin
binned type: quantile
number of bins: 10
x
[21,30]     (30,39]     (39,48]     (48,62]     (62,69]     (69,78]
40          37          38          40          42          33
(78,86.6] (86.6,96.6]  (96.6,109]   (109,120]        <NA>
36          38          39          37          20
# Summarise bins class object
summary(bin)
levels freq   rate
1      [21,30]   40 0.1000
2      (30,39]   37 0.0925
3      (39,48]   38 0.0950
4      (48,62]   40 0.1000
5      (62,69]   42 0.1050
6      (69,78]   33 0.0825
7    (78,86.6]   36 0.0900
8  (86.6,96.6]   38 0.0950
9   (96.6,109]   39 0.0975
10   (109,120]   37 0.0925
11        <NA>   20 0.0500
# Plot bins class object
plot(bin)

# Using labels argument
bin <- binning(carseats$Income, nbins = 4, labels = c("LQ1", "UQ1", "LQ3", "UQ3")) bin binned type: quantile number of bins: 4 x LQ1 UQ1 LQ3 UQ3 <NA> 95 102 89 94 20 # Using another type argument binning(carseats$Income, nbins = 5, type = "equal")
binned type: equal
number of bins: 5
x
[21,40.8] (40.8,60.6] (60.6,80.4]  (80.4,100]   (100,120]        <NA>
81          65          94          80          60          20
binning(carseats$Income, nbins = 5, type = "pretty") binned type: pretty number of bins: 5 x [20,40] (40,60] (60,80] (80,100] (100,120] <NA> 81 65 94 80 60 20 binning(carseats$Income, nbins = 5, type = "kmeans")
binned type: kmeans
number of bins: 5
x
[21,49]   (49,70.5] (70.5,86.5]  (86.5,104]   (104,120]        <NA>
115          86          65          65          49          20
binning(carseats$Income, nbins = 5, type = "bclust") binned type: bclust number of bins: 5 x [21,46.5] (46.5,64.5] (64.5,78.5] (78.5,95.5] (95.5,120] <NA> 109 58 63 71 79 20 # ------------------------- # Using pipes & dplyr # ------------------------- library(dplyr) carseats %>% mutate(Income_bin = binning(carseats$Income)) %>%
group_by(ShelveLoc, Income_bin) %>%
summarise(freq = n()) %>%
arrange(desc(freq)) %>%
Warning: Factor Income_bin contains implicit NA, consider using
forcats::fct_explicit_na
# A tibble: 10 x 3
# Groups:   ShelveLoc [1]
ShelveLoc Income_bin  freq
<fct>     <ord>      <int>
1 Medium    [21,30]       25
2 Medium    (62,69]       24
3 Medium    (48,62]       23
4 Medium    (39,48]       21
# … with 6 more rows

### Optimal Binning with binning_by()

binning_by() converts a numeric variable into a categorical variable by optimal binning. This method is often used when developing a scorecard model.

The following binning_by() example optimally binning Advertising if US is a target variable with a binary class.

# optimal binning
Warning in binning_by(carseats, "US", "Advertising"): The factor y has been
changed to a numeric vector consisting of 0 and 1.
bin
binned type: optimal
number of bins: 3
x
[-1,0]  (0,6] (6,29]
144     69    187

# summary optimal_bins class
summary(bin)
levels freq   rate
1 [-1,0]  144 0.3600
2  (0,6]   69 0.1725
3 (6,29]  187 0.4675

# information value
attr(bin, "iv")
[1] 4.8349

# information value table
attr(bin, "ivtable")
1     <= 0    144      19    125       144         19       125 0.3600   0.1319
2     <= 6     69      54     15       213         73       140 0.1725   0.7826
3      > 6    187     185      2       400        258       142 0.4675   0.9893
4  Missing      0       0      0       400        258       142 0.0000      NaN
5    Total    400     258    142        NA         NA        NA 1.0000   0.6450
1  0.8681  0.1520 -1.8839 -2.4810 2.0013
2  0.2174  3.6000  1.2809  0.6838 0.0709
3  0.0107 92.5000  4.5272  3.9301 2.7627
4     NaN     NaN     NaN     NaN    NaN
5  0.3550  1.8169  0.5971  0.0000 4.8349

# visualize optimal_bins class
plot(bin, sub = "bins of Advertising variable")

## Creating a data transformation report using transformation_report()

transformation_report() creates a data transformation report for all the variables in the data frame or objects that inherit the data frame (tbl_df, tbl, etc.).

transformation_report() creates a data transformation report in two forms:

• pdf file based on Latex
• html file

The contents of the report are as follows.:

• Imputation
• Missing Values
• Missing values imputation information
• (variable names)
• Outliers
• Outliers imputation information
• (variable names)
• Resolving Skewness
• Skewed variables information
• (variable names)
• Binning
• Numerical Variables for Binning
• Binning
• (variable names)
• Optimal Binning
• (variable names)

The following creates a data transformation report for carseats. The file format is pdf, and the file name is Transformation_Report.pdf.

carseats %>%
transformation_report(target = US)

The following generates a report in html format called transformation.html.

carseats %>%
transformation_report(target = US, output_format = "html",
output_file = "transformation.html")

Data transformation reports are automated reports to assist in the data transformation process. Design data conversion scenarios by referring to the report results.

### Data transformation report contents

#### Contents of pdf file

• The cover of the report is shown in the following figure.
• The report’s argenda is shown in the following figure.
• Much of the information is displayed in tables and visualization results in reports. An example is shown in the following figure.

#### Contents of html file

• The title and contents of the report are shown in the following figure.
• Much information is represented in tables in the report. An example of a table in an html file is shown in the following figure.
• Binning information in the data transformation report includes visualization results. The result of the html file is shown in the following figure.