[R] Help needed! Pre-processing the dataset before splitting - model building - model tuning - performance evaluation

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Wed Sep 25 10:00:34 CEST 2024


Às 06:04 de 24/09/2024, Bekzod Akhmuratov escreveu:
> Below is the link for a dataset on focus. I want to split the dataset into
> training and test set, use training set to build the model and model tune,
> use test set to evaluate performance. But before doing that I want to make
> sure that original dataset doesn't have noise, collinearity to address, no
> major outliers so that I have to transform the data using techniques like
> Box-Cox and looking at VIF to eliminate highly correlated predictors.
> 
> https://www.kaggle.com/datasets/joaofilipemarques/google-advanced-data-analytics-waze-user-data
> 
> When I fit the original dataset into regression model with Minitab, I get
> attached result for residuals. It doesn't look normal. Does it mean there
> is high correlation or the dataset in have nonlinear response and
> predictors? How should I approach this? What would be my strategy if I use
> in Python, Minitab, and R. Explaining it in all softwares are appraciated
> if possible.
> 
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,

R-Help is a list of questions and answers about R code, not to suggest 
analysis strategies. Anyhow, I suggest that you read the Python notebook 
at the bottom of the Kaggle page, it addresses your main question and if 
you have doubts translating the Python code to R code, ask us more 
specific questions on those doubts.

Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
www.avg.com



More information about the R-help mailing list