Getting data from a Kaggle’s competition, let’s compare the performance between classic and new generation of gradient boosting decision trees (GBDTs).
In this competition proposed by Santander Bank, invites Kaggle users to predict which customers will make a specific transaction in the future, regardless of the amount of money made. The data provided for this contest has the same structure as the actual data they have available to solve the problem in the bank, which makes us address a real problem with a demanding dataset by number of records and characteristics, by which will test the performance of classic algorithms versus next-generation algorithms.
The data is anonymised, where each row contains 200 discrete variables and no categorical variables.
Next we’ll do a data exploration, readiness to apply the model, and analyze which algorithms get the best performance with low overfitting and compare the results between them.
- Data extraction
- Data exploration
- Unbalanced Data and Resampling
- Feature selection
- Binary classification models
- Hyperparameter tuning
- Detection of the most influential variables