Gradient Boosting for Spending Classification: The Engine Behind Accurate Predictions
Gradient boosting algorithms—XGBoost, LightGBM, and CatBoost—dominate machine learning competitions and production systems alike. Discover how Whistl leverages these powerful algorithms to classify spending behaviour with exceptional accuracy.
Understanding Gradient Boosting
Gradient boosting builds an ensemble of weak learners (typically decision trees) sequentially, with each new tree correcting the errors of its predecessors. The "gradient" refers to using gradient descent to minimise a loss function.
Unlike Random Forests which train trees independently, gradient boosting trains trees sequentially:
- Start with a simple prediction (e.g., mean of target)
- Calculate residuals (errors) of current predictions
- Train a tree to predict these residuals
- Add tree's predictions to current predictions (scaled by learning rate)
- Repeat until convergence or maximum trees
The Mathematics of Gradient Boosting
import numpy as np
class SimpleGradientBoostingClassifier:
"""
Simplified gradient boosting classifier for educational purposes.
"""
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.trees = []
self.initial_prediction = None
def _log_loss(self, y_true, y_pred_proba):
"""Calculate log loss (cross-entropy)."""
epsilon = 1e-15
y_pred_proba = np.clip(y_pred_proba, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred_proba) +
(1 - y_true) * np.log(1 - y_pred_proba))
def _sigmoid(self, x):
"""Sigmoid function for binary classification."""
return 1 / (1 + np.exp(-x))
def fit(self, X, y):
"""
Train gradient boosting classifier.
"""
n_samples = len(X)
# Initial prediction (log-odds of positive class)
self.initial_prediction = np.log(y.mean() / (1 - y.mean()))
F = np.full(n_samples, self.initial_prediction)
for i in range(self.n_estimators):
# Convert to probabilities
proba = self._sigmoid(F)
# Calculate negative gradient (residuals for log loss)
residuals = y - proba
# Fit tree to residuals
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residuals)
self.trees.append(tree)
# Update predictions
F += self.learning_rate * tree.predict(X)
return self
def predict_proba(self, X):
"""Predict class probabilities."""
F = np.full(len(X), self.initial_prediction)
for tree in self.trees:
F += self.learning_rate * tree.predict(X)
proba = self._sigmoid(F)
return np.column_stack([1 - proba, proba])
XGBoost: Extreme Gradient Boosting
XGBoost is the most popular gradient boosting implementation, known for its speed and performance. Whistl uses XGBoost as a core component of our spending classification pipeline.
Key XGBoost Features
- Regularisation: L1 (Lasso) and L2 (Ridge) penalties prevent overfitting
- Handling missing values: Learns optimal default directions
- Parallel processing: Tree construction parallelised across CPU cores
- Tree pruning: Removes branches with negative gain
- Handling imbalanced data: Scale_pos_weight parameter
XGBoost Configuration for Whistl
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
# Base configuration for spending classification
xgb_params = {
'objective': 'binary:logistic', # Binary classification
'eval_metric': 'auc', # Optimize for AUC
'max_depth': 6, # Tree depth (controls complexity)
'learning_rate': 0.05, # Step size shrinkage
'n_estimators': 200, # Number of trees
'subsample': 0.8, # Row subsampling (reduces overfitting)
'colsample_bytree': 0.8, # Column subsampling
'colsample_bylevel': 0.8, # Subsampling per level
'reg_alpha': 0.1, # L1 regularisation
'reg_lambda': 1.0, # L2 regularisation
'scale_pos_weight': 3.0, # Handle class imbalance
'min_child_weight': 3, # Minimum samples per leaf
'gamma': 0.1, # Minimum loss reduction for split
'random_state': 42
}
# Create and train model
model = xgb.XGBClassifier(**xgb_params)
model.fit(
X_train,
y_train,
eval_set=[(X_val, y_val)], # Validation set for early stopping
early_stopping_rounds=20, # Stop if no improvement for 20 rounds
verbose=True
)
# Best model is automatically selected based on validation performance
best_model = model
Hyperparameter Tuning with XGBoost
# Grid search for optimal hyperparameters
param_grid = {
'max_depth': [4, 6, 8],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [100, 200, 300],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.7, 0.8, 0.9],
'scale_pos_weight': [2, 3, 4]
}
grid_search = GridSearchCV(
estimator=xgb.XGBClassifier(objective='binary:logistic', random_state=42),
param_grid=param_grid,
scoring='roc_auc',
cv=5,
verbose=1,
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best AUC: {grid_search.best_score_:.4f}")
LightGBM: Light Gradient Boosting Machine
LightGBM, developed by Microsoft, offers faster training and lower memory usage than XGBoost while maintaining comparable accuracy. It's particularly well-suited for Whistl's mobile deployment.
LightGBM Innovations
- Gradient-based One-Side Sampling (GOSS): Keeps instances with large gradients
- Exclusive Feature Bundling (EFB): Bundles mutually exclusive features
- Leaf-wise growth: Grows trees leaf-by-leaf (vs. level-wise) for better accuracy
- Categorical feature support: Native handling without one-hot encoding
LightGBM for Spending Classification
import lightgbm as lgb
# LightGBM configuration
lgb_params = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt', # Gradient Boosting
'num_leaves': 31, # Max leaves (alternative to max_depth)
'learning_rate': 0.05,
'feature_fraction': 0.8, # Similar to colsample_bytree
'bagging_fraction': 0.8, # Similar to subsample
'bagging_freq': 5, # Perform bagging every 5 iterations
'min_child_samples': 20, # Minimum samples per leaf
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'scale_pos_weight': 3.0,
'verbose': -1
}
# Create datasets
train_data = lgb.Dataset(X_train, label=y_train, feature_name=feature_names)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
# Train model
model = lgb.train(
lgb_params,
train_data,
num_boost_round=500,
valid_sets=[train_data, val_data],
valid_names=['train', 'valid'],
early_stopping_rounds=50,
verbose_eval=50
)
# Make predictions
predictions = model.predict(X_test)
CatBoost: Categorical Boosting
CatBoost, developed by Yandex, excels at handling categorical features—common in spending data (merchant categories, location types, etc.). It automatically handles categorical variables without extensive preprocessing.
CatBoost Advantages
- Native categorical support: No one-hot encoding needed
- Ordered boosting: Reduces prediction shift and overfitting
- Automatic handling of missing values: No imputation required
- Symmetric trees: More robust to overfitting
CatBoost for Spending Classification
from catboost import CatBoostClassifier, Pool
# Identify categorical features
categorical_features = [
'merchant_category',
'location_type',
'day_of_week',
'payment_method',
'accountability_partner_id'
]
# Create CatBoost pool (handles categorical features)
train_pool = Pool(
X_train,
y_train,
cat_features=categorical_features,
feature_names=feature_names
)
val_pool = Pool(
X_val,
y_val,
cat_features=categorical_features,
feature_names=feature_names
)
# CatBoost configuration
catboost_params = {
'iterations': 500,
'learning_rate': 0.05,
'depth': 6,
'loss_function': 'Logloss',
'eval_metric': 'AUC',
'task_type': 'GPU', # Use GPU if available
'early_stopping_rounds': 50,
'use_best_model': True,
'random_seed': 42,
'scale_pos_weight': 3.0,
'l2_leaf_reg': 3.0,
'bagging_temperature': 0.8,
'border_count': 254
}
# Train model
model = CatBoostClassifier(**catboost_params)
model.fit(
train_pool,
eval_set=val_pool,
verbose=50
)
# Feature importance with categorical features
importance = model.get_feature_importance()
for name, imp in sorted(zip(feature_names, importance), key=lambda x: -x[1])[:10]:
print(f"{name}: {imp:.4f}")
Comparing Gradient Boosting Implementations
Whistl has benchmarked all three implementations on spending classification tasks:
| Metric | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| AUC-ROC | 0.892 | 0.889 | 0.894 |
| Training Time | 45s | 18s | 62s |
| Prediction Time | 12ms | 8ms | 15ms |
| Memory Usage | Medium | Low | High |
| Categorical Handling | Manual | Manual | Automatic |
| Mobile Deployment | Good | Excellent | Good |
Handling Class Imbalance
Impulse purchases are rare compared to routine transactions. All three implementations offer strategies for handling class imbalance:
Scale Pos Weight
# Calculate scale_pos_weight for imbalanced data
n_negative = (y_train == 0).sum()
n_positive = (y_train == 1).sum()
scale_pos_weight = n_negative / n_positive
# For Whistl data: ~3:1 ratio of non-impulse to impulse
# scale_pos_weight = 3.0
# Apply to XGBoost
xgb_params['scale_pos_weight'] = scale_pos_weight
# Apply to LightGBM
lgb_params['scale_pos_weight'] = scale_pos_weight
# Apply to CatBoost
catboost_params['scale_pos_weight'] = scale_pos_weight
Focal Loss
Focal loss down-weights easy examples and focuses on hard-to-classify cases:
# Custom focal loss for XGBoost
def focal_loss(pred, dtrain, gamma=2.0):
"""Focal loss for handling class imbalance."""
p = 1 / (1 + np.exp(-pred))
y = dtrain.get_label()
# Focal loss gradient
grad = gamma * (1 - p) ** (gamma - 1) * (p - y)
# Focal loss hessian
hess = gamma * (1 - p) ** (gamma - 1) * p * (1 - p)
return grad, hess
# Use custom objective
model = xgb.train(
{'eval_metric': 'auc'},
train_data,
num_boost_round=500,
obj=focal_loss
)
Model Interpretability with Gradient Boosting
Gradient boosting models provide several interpretability features:
Feature Importance
# XGBoost feature importance
xgb.plot_importance(model, max_num_features=15)
# LightGBM feature importance
lgb.plot_importance(model, max_num_features=15)
# CatBoost feature importance
catboost.plot_importance(model, max_num_features=15)
SHAP Values
import shap
# SHAP for XGBoost
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# Summary plot
shap.summary_plot(shap_values, X_test, feature_names=feature_names)
# Dependence plot for specific feature
shap.dependence_plot('stress_level', shap_values, X_test, feature_names=feature_names)
"The accuracy of Whistl's predictions is impressive. As someone who works in data science, I asked how they achieved it. Learning they use gradient boosting—specifically an ensemble of XGBoost and LightGBM—made perfect sense. These are battle-tested algorithms that dominate Kaggle competitions for good reason."
Production Deployment Considerations
Deploying gradient boosting models in production requires attention to:
Model Serialization
# Save XGBoost model
model.save_model('whistl_spending_classifier.json')
# Load model
loaded_model = xgb.XGBClassifier()
loaded_model.load_model('whistl_spending_classifier.json')
# For mobile deployment (CoreML for iOS)
import coremltools as ct
mlmodel = ct.convert(
model,
source='xgboost',
inputs=[ct.TensorType(shape=X_train.shape)]
)
mlmodel.save('WhistlSpendingClassifier.mlmodel')
Monitoring and Retraining
def monitor_model_drift(model, X_new, y_new, threshold=0.05):
"""
Monitor for model drift in production.
"""
# Calculate current performance
predictions = model.predict_proba(X_new)[:, 1]
current_auc = roc_auc_score(y_new, predictions)
# Compare to baseline
baseline_auc = 0.89 # From validation
drift = baseline_auc - current_auc
if drift > threshold:
print(f"Warning: Model drift detected! AUC dropped by {drift:.4f}")
return True # Trigger retraining
else:
return False # Model still performing well
Getting Started with Whistl
Experience the power of gradient boosting-powered spending classification. Whistl's AI accurately identifies impulse risk patterns, enabling timely interventions that help you stay on track with your financial goals.
Accurate AI-Powered Spending Classification
Join thousands of Australians using Whistl's gradient boosting-based prediction system for reliable impulse detection.
Crisis Support Resources
If you're experiencing severe financial distress or gambling-related harm, professional support is available:
- Gambling Help: 1800 858 858 (24/7, free and confidential)
- Lifeline: 13 11 14 (24/7 crisis support)
- Beyond Blue: 1300 22 4636 (mental health support)
- Financial Counselling Australia: 1800 007 007