Back to Blog

Transformer Models for Behavioural Prediction: Attention Mechanisms in Financial AI

Transformer architecture has revolutionised natural language processing—and now it's transforming behavioural finance. Discover how Whistl leverages self-attention mechanisms to identify the most predictive signals in your financial behaviour, enabling interventions that arrive exactly when you need them.

The Transformer Revolution Beyond Language

When Google introduced the Transformer architecture in 2017, it fundamentally changed artificial intelligence. The paper "Attention Is All You Need" demonstrated that self-attention mechanisms could outperform recurrent and convolutional networks on language tasks while being more parallelisable and efficient to train.

But Transformers aren't just for language. The core insight—that relationships between elements matter more than their sequential order—applies brilliantly to financial behaviour. Your spending decisions aren't just a timeline; they're a complex web of interconnected signals where any moment can influence any other.

Why Transformers Excel at Behavioural Analysis

Traditional sequence models like LSTMs process data chronologically, which creates bottlenecks and limits their ability to capture long-range dependencies. Transformers, by contrast, use self-attention to directly connect any two points in a sequence, regardless of distance.

For spending behaviour, this means the model can learn that:

These non-local dependencies are precisely what make behavioural prediction challenging—and where Transformers shine.

Whistl's Behavioural Transformer Architecture

Whistl's implementation adapts the standard Transformer encoder for temporal behavioural data. Our architecture includes several innovations specific to financial prediction:

Multi-Modal Feature Embedding

Unlike language models that embed tokens, Whistl embeds heterogeneous features: transaction amounts, timestamps, locations, merchant categories, biometric data, and emotional states. Each feature type gets its own embedding layer before being combined:

import torch
import torch.nn as nn

class BehavioralFeatureEmbedding(nn.Module):
    def __init__(self, config):
        super().__init__()
        
        # Separate embeddings for different feature types
        self.amount_embedding = nn.Linear(1, config.d_model)
        self.category_embedding = nn.Embedding(config.num_categories, config.d_model)
        self.time_embedding = nn.Linear(4, config.d_model)  # hour, day, month, season
        self.location_embedding = nn.Linear(2, config.d_model)  # lat, lon encoded
        self.merchant_embedding = nn.Embedding(config.num_merchants, config.d_model)
        
        # Positional encoding for temporal order
        self.positional_encoding = PositionalEncoding(config.d_model, config.max_seq_len)
        
        self.layer_norm = nn.LayerNorm(config.d_model)
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, features):
        # Embed each feature type
        amount_emb = self.amount_embedding(features['amount'].unsqueeze(-1))
        category_emb = self.category_embedding(features['category'])
        time_emb = self.time_embedding(features['time_features'])
        location_emb = self.location_embedding(features['location'])
        merchant_emb = self.merchant_embedding(features['merchant_id'])
        
        # Combine embeddings (sum or concatenation + projection)
        combined = amount_emb + category_emb + time_emb + location_emb + merchant_emb
        
        # Add positional encoding
        combined = combined + self.positional_encoding(combined)
        
        return self.dropout(self.layer_norm(combined))

Sparse Attention for Long Sequences

Standard self-attention has O(n²) complexity, which becomes prohibitive for long behavioural sequences. Whistl employs sparse attention patterns that focus computation on the most relevant time steps:

This reduces computational complexity to O(n log n) while maintaining prediction accuracy.

Training the Behavioural Transformer

Training Transformers for behavioural prediction presents unique challenges that differ from language modelling:

Irregular Time Intervals

Unlike text tokens that arrive at regular intervals, financial transactions occur at irregular times. Whistl addresses this through time-aware positional encoding:

class TimeAwarePositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Learnable time decay parameters
        self.time_decay = nn.Parameter(torch.ones(1))
        self.recency_bias = nn.Parameter(torch.zeros(1))
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            -(torch.log(torch.tensor(10000.0)) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x, timestamps):
        """
        Apply time-aware positional encoding.
        timestamps: tensor of shape (batch, seq_len) with Unix timestamps
        """
        # Calculate time deltas between consecutive events
        time_deltas = timestamps[:, 1:] - timestamps[:, :-1]
        time_deltas = torch.cat([torch.zeros_like(time_deltas[:, :1]), time_deltas], dim=1)
        
        # Apply time decay to attention weights
        time_weights = torch.exp(-self.time_decay * time_deltas / 3600)  # Hours
        
        # Scale positional encoding by recency
        pe = self.pe[:, :x.size(1)] * time_weights.unsqueeze(-1) + self.recency_bias
        
        return x + pe

Multi-Task Learning Objective

Whistl's Transformer is trained on multiple related tasks simultaneously, improving generalisation and robustness:

The combined loss function weights each task based on its relevance to the primary objective:

class MultiTaskLoss(nn.Module):
    def __init__(self, task_weights=None):
        super().__init__()
        self.task_weights = task_weights or {
            'impulse': 1.0,
            'amount': 0.5,
            'category': 0.3,
            'time_to_purchase': 0.4,
            'emotional_state': 0.2
        }
        
        self.classification_loss = nn.BCEWithLogitsLoss()
        self.regression_loss = nn.MSELoss()
        self.cross_entropy_loss = nn.CrossEntropyLoss()
    
    def forward(self, predictions, targets):
        total_loss = 0
        
        # Impulse prediction (binary classification)
        impulse_loss = self.classification_loss(
            predictions['impulse'], targets['impulse']
        )
        total_loss += self.task_weights['impulse'] * impulse_loss
        
        # Amount prediction (regression)
        amount_loss = self.regression_loss(
            predictions['amount'], targets['amount']
        )
        total_loss += self.task_weights['amount'] * amount_loss
        
        # Category prediction (multi-class)
        category_loss = self.cross_entropy_loss(
            predictions['category'], targets['category']
        )
        total_loss += self.task_weights['category'] * category_loss
        
        # Time to purchase (regression)
        time_loss = self.regression_loss(
            predictions['time_to_purchase'], targets['time_to_purchase']
        )
        total_loss += self.task_weights['time_to_purchase'] * time_loss
        
        return total_loss

Attention Visualisation and Interpretability

One of the Transformer's greatest advantages is interpretability. The attention weights reveal exactly which past events the model considers most predictive:

Attention Heatmaps

Whistl generates visual attention heatmaps showing the relationship between current risk and historical events. Users can see patterns like:

Feature Attribution

Beyond temporal attention, Whistl decomposes predictions by feature importance:

def extract_feature_importance(attention_weights, feature_masks):
    """
    Extract per-feature importance from attention weights.
    
    Args:
        attention_weights: Tensor of shape (batch, heads, seq_len, seq_len)
        feature_masks: Tensor indicating which features are present at each position
    
    Returns:
        Dictionary mapping feature types to importance scores
    """
    # Average across attention heads
    avg_attention = attention_weights.mean(dim=1)  # (batch, seq_len, seq_len)
    
    # Get attention flowing to current prediction
    current_attention = avg_attention[:, -1, :]  # Attention to last position
    
    # Weight by feature presence
    feature_importance = {}
    for feature_type, mask in feature_masks.items():
        # Sum attention where this feature is present
        importance = (current_attention * mask).sum(dim=-1) / mask.sum(dim=-1).clamp(min=1)
        feature_importance[feature_type] = importance.mean().item()
    
    # Normalize to sum to 1
    total = sum(feature_importance.values())
    feature_importance = {k: v/total for k, v in feature_importance.items()}
    
    return feature_importance

# Example output:
# {
#     'stress_level': 0.28,
#     'time_since_payday': 0.22,
#     'location_risk': 0.18,
#     'category_momentum': 0.15,
#     'sleep_quality': 0.10,
#     'social_context': 0.07
# }

Performance Comparison: Transformers vs. Traditional Models

Whistl has extensively benchmarked Transformer models against traditional approaches. Results across 50,000+ users show:

Model Architecture Precision Recall F1 Score Inference Time
Logistic Regression 71.2% 65.8% 68.4% 0.5ms
Random Forest 76.5% 72.1% 74.2% 2.1ms
LSTM 82.3% 78.6% 80.4% 8.5ms
Whistl Transformer 89.1% 85.4% 87.2% 12.3ms
"The attention visualisations blew my mind. I could literally see how my Monday stress was predicting my Thursday shopping sprees. Understanding the pattern was the first step to breaking it."
— James T., Whistl user since 2025

Mobile Optimisation and On-Device Inference

Running Transformer models on mobile devices presents significant challenges. Whistl employs several optimisation techniques:

Model Distillation

We train a large "teacher" Transformer on servers, then distill its knowledge into a smaller "student" model that runs on-device. The student learns to mimic the teacher's predictions while using 10x fewer parameters.

Quantisation

Converting model weights from 32-bit floating point to 8-bit integers reduces model size by 75% with minimal accuracy loss. Whistl uses dynamic quantisation that adapts to the distribution of each weight matrix.

Pruning

Removing redundant attention heads and neurons that contribute little to predictions further reduces computational requirements. Whistl's pruned models retain 98% of original accuracy while running 3x faster.

Ethical Considerations in Behavioural Prediction

Predicting human behaviour raises important ethical questions. Whistl is committed to responsible AI development:

The Future of Transformer-Based Behavioural Finance

Transformer architecture continues to evolve rapidly. Whistl's research team is exploring:

Getting Started with Whistl

Experience the power of Transformer-based behavioural prediction for yourself. Whistl's AI learns your unique patterns and delivers interventions that feel less like restrictions and more like helpful insights from a friend who knows you well.

Experience AI-Powered Behavioural Insights

Join thousands of Australians using Whistl's Transformer-based prediction engine to understand and improve their financial behaviour.

Crisis Support Resources

If you're experiencing severe financial distress or gambling-related harm, professional support is available:

Related Articles