# Custom functions with bestNormalize

This vignette will go over the steps required to implement a custom user-defined function within the bestNormalize framework.

There are 3 steps.

1. Create transformation function

2. Create predict method for transformation function (that can be applied to new data)

3. Pass through new function and predict method to bestNormalize

## S3 methods

Here, we start by defining a new function that we’ll call cuberoot_x, which will take an argument a (as does the sqrt_x function) which will try to add a constant if it sees any negative numbers in x. It will also take the argument standardize which will center and scale the transformed data so that it’s centered at 0 with SD = 1.

## Define user-function
cuberoot_x <- function(x, a = NULL, standardize = TRUE, ...) {
stopifnot(is.numeric(x))

min_a <- max(0, -(min(x, na.rm = TRUE)))
if(!length(a))
a <- min_a
if(a < min_a) {
warning("Setting a <  max(0, -(min(x))) can lead to transformation issues",
"Standardize set to FALSE")
standardize <- FALSE
}

x.t <- (x + a)^(1/3)
mu <- mean(x.t, na.rm = TRUE)
sigma <- sd(x.t, na.rm = TRUE)
if (standardize) x.t <- (x.t - mu) / sigma

# Get in-sample normality statistic results
ptest <- nortest::pearson.test(x.t)

val <- list(
x.t = x.t,
x = x,
mean = mu,
sd = sigma,
a = a,
n = length(x.t) - sum(is.na(x)),
norm_stat = unname(ptest$statistic / ptest$df),
standardize = standardize
)

# Assign class
class(val) <- c('cuberoot_x', class(val))
val
}

Note that we assigned a class to the object this returns of the same name; this is necessary for successful implementation within bestNormalize. We’ll also need an associated predict method that is used to apply the transformation to newly observed data. 

predict.cuberoot_x <- function(object, newdata = NULL, inverse = FALSE, ...) {

# If no data supplied and not inverse
if (is.null(newdata) & !inverse)
newdata <- object$x # If no data supplied and inverse if (is.null(newdata) & inverse) newdata <- object$x.t

# Actually performing transformations

# Perform inverse transformation as estimated
if (inverse) {

# Reverse-standardize
if (object$standardize) newdata <- newdata * object$sd + object$mean # Reverse-cube-root (cube) newdata <- newdata^3 - object$a

# Otherwise, perform transformation as estimated
} else if (!inverse) {
# Take cube root
newdata <- (newdata + object$a)^(1/3) # Standardize to mean 0, sd 1 if (object$standardize)
newdata <- (newdata - object$mean) / object$sd
}

# Return transformed data
unname(newdata)
}

## Optional: print method

This will be printed when bestNormalize selects your custom method or when you print an object returned by your new custom function.

print.cuberoot_x <- function(x, ...) {
cat(ifelse(x$standardize, "Standardized", "Non-Standardized"), 'cuberoot(x + a) Transformation with', x$n, 'nonmissing obs.:\n',
'Relevant statistics:\n',
'- a =', x$a, '\n', '- mean (before standardization) =', x$mean, '\n',
'- sd (before standardization) =', x$sd, '\n') } Note: if you can find a similar transformation in the source code, it’s easy to model your code after it. For instance, for cuberoot_x and predict.cuberoot_x, I used sqrt_x.R as a template file. ## Implementing with bestNormalize # Store custom functions into list custom_transform <- list( cuberoot_x = cuberoot_x, predict.cuberoot_x = predict.cuberoot_x, print.cuberoot_x = print.cuberoot_x ) set.seed(123129) x <- rgamma(100, 1, 1) (b <- bestNormalize(x = x, new_transforms = custom_transform, standardize = FALSE)) ## Best Normalizing transformation with 100 Observations ## Estimated Normality Statistics (Pearson P / df, lower => more normal): ## - arcsinh(x): 1.2347 ## - Box-Cox: 1.0267 ## - cuberoot_x: 0.9787 ## - Double Reversed Log_b(x+a): 2.4507 ## - Exp(x): 4.7947 ## - Log_b(x+a): 1.3547 ## - No transform: 2.0027 ## - orderNorm (ORQ): 1.1627 ## - sqrt(x + a): 1.0907 ## - Yeo-Johnson: 1.0987 ## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats ## ## Based off these, bestNormalize chose: ## Non-Standardized cuberoot(x + a) Transformation with 100 nonmissing obs.: ## Relevant statistics: ## - a = 0 ## - mean (before standardization) = 0.9588261 ## - sd (before standardization) = 0.3298665 Evidently, the cube-rooting was the best normalizing transformation! ## Sanity check Is this code actually performing the cube-rooting? all.equal(x^(1/3), b$chosen_transform$x.t) ## [1] TRUE all.equal(x^(1/3), predict(b)) ## [1] TRUE It does indeed. # Using custom normalization statistics The bestNormalize package can estimate any univariate statistic using its CV framework. A user-defined function can be passed in through the norm_stat_fn argument, and this function will then be applied in lieu of the Pearson test statistic divided by its degree of freedom. The user-defined function must take an argument x, which indicates the data on which a user wants to evaluate the statistic. Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test statistic: bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$stat)
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.1958
##  - Box-Cox: 0.1785
##  - Center+scale: 0.2219
##  - Double Reversed Log_b(x+a): 0.2624
##  - Exp(x): 0.3299
##  - Log_b(x+a): 0.1959
##  - orderNorm (ORQ): 0.186
##  - sqrt(x + a): 0.1829
##  - Yeo-Johnson: 0.1872
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193
##  - mean (before standardization) = -0.1263882
##  - sd (before standardization) = 0.9913552

Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test’s p-value:

(dont_do_this <- bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$p)) ## Best Normalizing transformation with 100 Observations ## Estimated Normality Statistics (using custom normalization statistic) ## - arcsinh(x): 0.4327 ## - Box-Cox: 0.4831 ## - Center+scale: 0.2958 ## - Double Reversed Log_b(x+a): 0.2028 ## - Exp(x): 0.0675 ## - Log_b(x+a): 0.3589 ## - orderNorm (ORQ): 0.4492 ## - sqrt(x + a): 0.4899 ## - Yeo-Johnson: 0.4531 ## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats ## ## Based off these, bestNormalize chose: ## Standardized exp(x) Transformation with 100 nonmissing obs.: ## Relevant statistics: ## - mean (before standardization) = 6.885396 ## - sd (before standardization) = 13.66084 Note: bestNormalize will attempt to minimize this statistic by default, which is definitely not what you want to do when calculating the p-value. This is seen in the example above, as the WORST normalization transformation is chosen. In this case, a user is advised to either manually select the best one: best_transform <- names(which.max(dont_do_this$norm_stats))
(do_this <- dont_do_this$other_transforms[[best_transform]]) ## Standardized sqrt(x + a) Transformation with 100 nonmissing obs.: ## Relevant statistics: ## - a = 0 ## - mean (before standardization) = 0.9811849 ## - sd (before standardization) = 0.4779252 Or, the user can reverse their defined statistic (in this case by subtracting it from 1): (do_this <- bestNormalize(x, norm_stat_fn = function(x) 1-nortest::lillie.test(x)$p))
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.5166
##  - Box-Cox: 0.4191
##  - Center+scale: 0.6521
##  - Double Reversed Log_b(x+a): 0.8005
##  - Exp(x): 0.9601
##  - Log_b(x+a): 0.5338
##  - orderNorm (ORQ): 0.4646
##  - sqrt(x + a): 0.4475
##  - Yeo-Johnson: 0.4773
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193
##  - mean (before standardization) = -0.1263882
##  - sd (before standardization) = 0.9913552`