[R] Help needed! Pre-processing the dataset before splitting - model building - model tuning - performance evaluation

Tue Sep 24 07:04:46 CEST 2024

Below is the link for a dataset on focus. I want to split the dataset into
training and test set, use training set to build the model and model tune,
use test set to evaluate performance. But before doing that I want to make
sure that original dataset doesn't have noise, collinearity to address, no
major outliers so that I have to transform the data using techniques like
Box-Cox and looking at VIF to eliminate highly correlated predictors.

https://www.kaggle.com/datasets/joaofilipemarques/google-advanced-data-analytics-waze-user-data

When I fit the original dataset into regression model with Minitab, I get
attached result for residuals. It doesn't look normal. Does it mean there
is high correlation or the dataset in have nonlinear response and
predictors? How should I approach this? What would be my strategy if I use
in Python, Minitab, and R. Explaining it in all softwares are appraciated
if possible.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Residual Plots for Response.png
Type: image/png
Size: 17679 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20240924/8046d3c5/attachment.png>