Ensemble Methods for Risk Prediction: Why Multiple Models Beat Single Models

Just as diverse groups make better decisions than individuals, combining multiple machine learning models produces more accurate and robust predictions. Discover how Whistl uses ensemble methods—Random Forests, Gradient Boosting, and model stacking—to deliver reliable impulse risk predictions.

The Wisdom of Crowds in Machine Learning

In 1906, statistician Francis Galton observed a remarkable phenomenon at a country fair. Visitors were asked to guess the weight of an ox. Individually, guesses varied wildly. But the average of all guesses was 1,197 pounds—remarkably close to the actual weight of 1,198 pounds.

This "wisdom of crowds" effect applies to machine learning. A single model might be brilliant in some situations and blind in others. But combine multiple models, and their individual errors tend to cancel out while their correct predictions reinforce each other.

At Whistl, ensemble methods are fundamental to our risk prediction system. No single algorithm captures all the complexity of human financial behaviour—but together, multiple models achieve remarkable accuracy.

Why Ensembles Work

Ensemble methods reduce two types of error:

Bias: Systematic errors from oversimplified assumptions
Variance: Errors from sensitivity to training data fluctuations

Different models have different bias-variance profiles. By combining them, ensembles achieve a better balance than any single model could.

Random Forests: Diversity Through Bootstrap Aggregation

Random Forests are among the most popular ensemble methods. They combine many decision trees, each trained on a different subset of data and features.

How Random Forests Work

from sklearn.ensemble import RandomForestClassifier
import numpy as np

class ImpulseRiskRandomForest:
    def __init__(self, n_trees=100, max_depth=15):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.trees = []
    
    def fit(self, X, y):
        """
        Train Random Forest using bootstrap aggregation (bagging).
        Each tree sees a different random subset of data and features.
        """
        n_samples = len(X)
        
        for i in range(self.n_trees):
            # Bootstrap sample (sample with replacement)
            indices = np.random.choice(n_samples, size=n_samples, replace=True)
            X_bootstrap = X[indices]
            y_bootstrap = y[indices]
            
            # Train decision tree with random feature subset
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                max_features='sqrt',  # Random feature subset at each split
                random_state=i
            )
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
    
    def predict_proba(self, X):
        """
        Aggregate predictions from all trees.
        Final probability = average of individual tree predictions.
        """
        predictions = np.zeros((len(X), 2))
        
        for tree in self.trees:
            predictions += tree.predict_proba(X)
        
        predictions /= self.n_trees
        return predictions
    
    def get_feature_importance(self):
        """Average feature importance across all trees."""
        importances = np.zeros(self.n_features)
        for tree in self.trees:
            importances += tree.feature_importances_
        return importances / self.n_trees

Why Random Forests Excel at Risk Prediction

Handles non-linear relationships: Decision trees capture complex interactions between features
Robust to outliers: Individual trees might be affected, but the ensemble averages them out
Feature importance: Provides interpretable rankings of which signals matter most
Low variance: Averaging many trees reduces overfitting

Gradient Boosting: Learning from Mistakes

While Random Forests train trees independently, Gradient Boosting trains trees sequentially, with each tree learning to correct the mistakes of its predecessors.

The Boosting Process

from sklearn.ensemble import GradientBoostingClassifier

class GradientBoostingRiskPredictor:
    def __init__(self, n_estimators=100, learning_rate=0.1):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.trees = []
        self.initial_prediction = None
    
    def fit(self, X, y):
        """
        Train Gradient Boosting classifier.
        Each tree fits the residuals (errors) of previous trees.
        """
        # Initial prediction (log-odds of positive class)
        self.initial_prediction = np.log(y.mean() / (1 - y.mean()))
        current_predictions = np.full(len(X), self.initial_prediction)
        
        for i in range(self.n_estimators):
            # Calculate residuals (negative gradient of loss function)
            probabilities = 1 / (1 + np.exp(-current_predictions))
            residuals = y - probabilities
            
            # Fit tree to residuals
            tree = DecisionTreeRegressor(max_depth=3)  # Shallow trees
            tree.fit(X, residuals)
            self.trees.append(tree)
            
            # Update predictions
            current_predictions += self.learning_rate * tree.predict(X)
    
    def predict_proba(self, X):
        """Aggregate predictions from all trees."""
        predictions = np.full(len(X), self.initial_prediction)
        
        for tree in self.trees:
            predictions += self.learning_rate * tree.predict(X)
        
        probabilities = 1 / (1 + np.exp(-predictions))
        return np.column_stack([1 - probabilities, probabilities])

XGBoost: Optimised Gradient Boosting

Whistl uses XGBoost (Extreme Gradient Boosting), an optimised implementation that includes:

Regularisation: Prevents overfitting with L1 and L2 penalties
Handling missing values: Learns optimal default directions
Parallel processing: Faster training through parallelisation
Tree pruning: Removes branches with negative gain

import xgboost as xgb

# XGBoost configuration for Whistl risk prediction
xgb_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.05,
    'n_estimators': 200,
    'subsample': 0.8,  # Row subsampling
    'colsample_bytree': 0.8,  # Column subsampling
    'reg_alpha': 0.1,  # L1 regularisation
    'reg_lambda': 1.0,  # L2 regularisation
    'scale_pos_weight': 3.0,  # Handle class imbalance
    'random_state': 42
}

model = xgb.XGBClassifier(**xgb_params)
model.fit(X_train, y_train)

Model Stacking: Learning to Combine Predictions

Model stacking (or stacked generalisation) takes ensembling further: instead of simply averaging predictions, a meta-learner learns the optimal way to combine base model predictions.

Stacking Architecture

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
import numpy as np

class StackedRiskPredictor:
    def __init__(self):
        # Level 0: Base models (diverse algorithms)
        self.base_models = {
            'random_forest': RandomForestClassifier(
                n_estimators=100, max_depth=15, random_state=42
            ),
            'gradient_boosting': GradientBoostingClassifier(
                n_estimators=100, learning_rate=0.1, random_state=42
            ),
            'neural_network': MLPClassifier(
                hidden_layer_sizes=(100, 50), random_state=42
            ),
            'logistic_regression': LogisticRegression(random_state=42)
        }
        
        # Level 1: Meta-learner (combines base model predictions)
        self.meta_learner = LogisticRegression()
    
    def fit(self, X, y):
        """
        Train stacked ensemble using cross-validation.
        Use out-of-fold predictions to train meta-learner (prevents overfitting).
        """
        from sklearn.model_selection import KFold
        
        n_folds = 5
        kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
        
        # Generate out-of-fold predictions for meta-learner training
        n_samples = len(X)
        n_base_models = len(self.base_models)
        oof_predictions = np.zeros((n_samples, n_base_models))
        
        for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
            X_train_fold, X_val_fold = X[train_idx], X[val_idx]
            y_train_fold = y[train_idx]
            
            for model_idx, (name, model) in enumerate(self.base_models.items()):
                # Train on fold
                model.fit(X_train_fold, y_train_fold)
                
                # Predict on validation fold
                oof_predictions[val_idx, model_idx] = model.predict_proba(X_val_fold)[:, 1]
        
        # Train meta-learner on out-of-fold predictions
        self.meta_learner.fit(oof_predictions, y)
        
        # Retrain all base models on full data
        for name, model in self.base_models.items():
            model.fit(X, y)
    
    def predict_proba(self, X):
        """Generate predictions using stacked ensemble."""
        # Get base model predictions
        base_predictions = np.zeros((len(X), len(self.base_models)))
        
        for model_idx, (name, model) in enumerate(self.base_models.items()):
            base_predictions[:, model_idx] = model.predict_proba(X)[:, 1]
        
        # Meta-learner combines predictions
        final_predictions = self.meta_learner.predict_proba(base_predictions)
        
        return final_predictions

Why Stacking Outperforms Simple Averaging

The meta-learner discovers which models are most reliable in different situations:

Random Forest might be best for users with regular spending patterns
Neural Networks might excel for users with complex, non-linear behaviour
Gradient Boosting might be most accurate for high-risk predictions

The meta-learner learns these patterns and weights models accordingly.

Ensemble Performance in Whistl

Whistl has extensively benchmarked ensemble methods against individual models:

Model	Precision	Recall	F1 Score	AUC-ROC
Logistic Regression	71.2%	65.8%	68.4%	0.74
Single Decision Tree	68.5%	71.2%	69.8%	0.71
Random Forest	84.2%	79.6%	81.8%	0.88
XGBoost	85.7%	81.3%	83.5%	0.89
Neural Network	82.1%	78.9%	80.5%	0.86
Stacked Ensemble	88.4%	84.7%	86.5%	0.92

Handling Class Imbalance with Ensembles

Impulse purchases are relatively rare compared to routine transactions. This class imbalance challenges all machine learning models. Ensembles offer several solutions:

Balanced Random Forests

from imblearn.ensemble import BalancedRandomForestClassifier

# Balanced Random Forest automatically handles class imbalance
brf = BalancedRandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    sampling_strategy='auto',  # Balance classes in each bootstrap sample
    replacement=True,
    random_state=42
)

brf.fit(X_train, y_train)

Focal Loss for Hard Examples

Focal loss down-weights easy examples and focuses training on hard-to-classify cases:

def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
    """
    Focal loss for handling class imbalance.
    Down-weights easy examples, focuses on hard examples.
    """
    epsilon = 1e-7
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Calculate cross-entropy
    ce = -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)
    
    # Calculate focal weight
    pt = y_true * y_pred + (1 - y_true) * (1 - y_pred)
    focal_weight = alpha * (1 - pt) ** gamma
    
    return np.mean(focal_weight * ce)

Ensemble Interpretability

While ensembles are more complex than single models, they remain interpretable:

Feature Importance Aggregation

def get_ensemble_feature_importance(ensemble, feature_names):
    """
    Aggregate feature importance across all models in ensemble.
    """
    importance_dict = {}
    
    for name, model in ensemble.base_models.items():
        if hasattr(model, 'feature_importances_'):
            importance_dict[name] = dict(zip(
                feature_names, 
                model.feature_importances_
            ))
    
    # Average across models
    avg_importance = {}
    for feature in feature_names:
        avg_importance[feature] = np.mean([
            importance_dict[model].get(feature, 0) 
            for model in importance_dict
        ])
    
    # Sort by importance
    sorted_importance = sorted(
        avg_importance.items(), 
        key=lambda x: x[1], 
        reverse=True
    )
    
    return sorted_importance

# Example output:
# [
#     ('stress_level', 0.18),
#     ('time_since_payday', 0.15),
#     ('location_risk', 0.12),
#     ('spending_velocity', 0.11),
#     ('category_momentum', 0.09),
#     ...
# ]

SHAP Values for Ensemble Predictions

SHAP values work with ensemble models to explain individual predictions:

import shap

# Create SHAP explainer for ensemble
explainer = shap.TreeExplainer(ensemble_model)
shap_values = explainer.shap_values(X_sample)

# Visualise feature contributions
shap.summary_plot(shap_values, X_sample, feature_names=feature_names)

"I was impressed by how consistent Whistl's predictions were. Even when my behaviour was erratic, the app seemed to 'get it'. Later I learned they use ensemble methods—multiple models voting on each prediction. That explains the reliability."
— Rachel P., Whistl user since 2025

The Future of Ensemble Methods

Whistl continues to advance ensemble techniques:

Deep ensembles: Combining multiple neural networks with different initialisations
Snapshot ensembles: Multiple models from different points in training
Neural Architecture Search: Automatically discovering optimal ensemble configurations
Online ensembles: Continuously updating ensemble as new data arrives

Getting Started with Whistl

Experience the reliability of ensemble-powered risk prediction. Whistl's multi-model approach delivers consistent, accurate predictions that help you stay on track with your financial goals.

Robust AI-Powered Risk Prediction

Join thousands of Australians using Whistl's ensemble-based prediction system for reliable, accurate impulse risk detection.

Download Whistl Free Learn More

Crisis Support Resources

If you're experiencing severe financial distress or gambling-related harm, professional support is available:

Gambling Help: 1800 858 858 (24/7, free and confidential)
Lifeline: 13 11 14 (24/7 crisis support)
Beyond Blue: 1300 22 4636 (mental health support)
Financial Counselling Australia: 1800 007 007