A/B Testing Infrastructure: Complete Setup Guide
Whistl continuously tests intervention variations to maximise effectiveness. This technical guide explains experiment design, feature flags, statistical analysis, sequential testing, and how Whistl runs hundreds of experiments to optimise user outcomes.
Why A/B Testing Matters
Intervention effectiveness varies by individual:
- Message tone: Tough Love vs. Supportive coaching
- Timing: Immediate vs. delayed intervention
- Step ordering: Which negotiation steps work best
- Visual design: Goal imagery that motivates
- Notification copy: Messages that drive engagement
Whistl runs 50+ concurrent experiments to continuously improve outcomes.
Experiment Architecture
Whistl's experimentation platform supports complex testing:
System Overview
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Client │ │ Experiment │ │ Analytics │
│ App │───▶│ Service │───▶│ Pipeline │
│ │ │ │ │ │
└─────────────┘ └──────────────┘ └─────────────┘
│ │ │
│ - Request config │ - Assign variant │
│ - Report event │ - Log exposure │
│ │ │
▼ ▼ ▼
Feature Flags Randomization Statistical
(Local Cache) (Hash-based) Analysis
Feature Flag System
Feature flags control experiment variants:
Flag Configuration
{
"experiment_id": "intervention_tone_v3",
"name": "Intervention Tone Optimization",
"status": "running",
"start_date": "2026-02-01T00:00:00Z",
"end_date": null,
"variants": [
{
"id": "control",
"name": "Current Tone",
"allocation": 0.25,
"config": {
"tone": "balanced",
"emoji_usage": "moderate"
}
},
{
"id": "tough_love",
"name": "Direct Approach",
"allocation": 0.25,
"config": {
"tone": "direct",
"emoji_usage": "minimal"
}
},
{
"id": "supportive",
"name": "Empathetic Approach",
"allocation": 0.25,
"config": {
"tone": "supportive",
"emoji_usage": "frequent"
}
},
{
"id": "adaptive",
"name": "ML-Selected Tone",
"allocation": 0.25,
"config": {
"tone": "ml_selected",
"emoji_usage": "ml_selected"
}
}
],
"targeting": {
"countries": ["AU", "NZ", "US", "UK"],
"app_versions": ["1.8.0+"],
"user_segments": ["active_30d"]
},
"primary_metric": "intervention_acceptance_rate",
"guardrail_metrics": ["app_uninstall_rate", "session_duration"]
}
Client-Side Flag Evaluation
class ExperimentManager {
private var flags: [String: ExperimentFlag] = [:]
private var userVariant: [String: String] = [:]
func evaluateFlag(experimentId: String) -> Variant {
guard let flag = flags[experimentId] else {
return flag.defaultVariant
}
// Check if user already assigned
if let assignedVariant = userVariant[experimentId] {
return flag.getVariant(assignedVariant)
}
// Assign variant based on user ID hash
let variant = assignVariant(user: currentUser, flag: flag)
userVariant[experimentId] = variant.id
// Log exposure
logExposure(experimentId: experimentId, variant: variant)
return variant
}
private func assignVariant(user: User, flag: ExperimentFlag) -> Variant {
// Hash user ID + experiment ID for consistent assignment
let hash = hash("\(user.id):\(flag.id)")
let bucket = hash % 100 / 100.0 // 0.0 - 1.0
// Find variant based on allocation
var cumulative: Double = 0
for variant in flag.variants {
cumulative += variant.allocation
if bucket < cumulative {
return variant
}
}
return flag.variants.last!
}
}
Experiment Types
Whistl runs multiple experiment types:
A/B Test (Two Variants)
| Variant | Allocation | Description |
|---|---|---|
| Control (A) | 50% | Current intervention message |
| Treatment (B) | 50% | New message with urgency framing |
Multivariate Test (Multiple Factors)
| Factor | Variants |
|---|---|
| Message Length | Short, Medium, Long |
| Tone | Supportive, Direct, Neutral |
| CTA Button | "Call Sponsor", "Breathe", "I'm OK" |
Tests all combinations: 3 × 3 × 3 = 27 variants
Sequential Test
Variants tested one after another with adaptive allocation:
class SequentialTest {
func allocateTraffic(variants: [Variant]) -> [Double] {
// Thompson Sampling for adaptive allocation
var allocations: [Double] = []
for variant in variants {
// Sample from posterior distribution
let sample = variant.successes.betaSample(
failures: variant.failures
)
allocations.append(sample)
}
// Normalize to probabilities
let sum = allocations.reduce(0, +)
return allocations.map { $0 / sum }
}
}
Statistical Analysis
Experiments use rigorous statistical methods:
Sample Size Calculation
class SampleSizeCalculator {
func calculate(
baselineRate: Double,
minimumDetectableEffect: Double,
power: Double = 0.8,
significance: Double = 0.05
) -> Int {
let p1 = baselineRate
let p2 = baselineRate * (1 + minimumDetectableEffect)
let zAlpha = zScore(1 - significance / 2) // 1.96 for 95%
let zBeta = zScore(power) // 0.84 for 80% power
let pooledP = (p1 + p2) / 2
let numerator = pow(zAlpha * sqrt(2 * pooledP * (1 - pooledP)) +
zBeta * sqrt(p1 * (1 - p1) + p2 * (1 - p2)), 2)
let denominator = pow(p1 - p2, 2)
return Int(ceil(2 * numerator / denominator))
}
}
// Example: Baseline 70% acceptance, want to detect 5% improvement
// Result: Need 2,841 users per variant
Significance Testing
class SignificanceTest {
func test(
controlSuccesses: Int,
controlTotal: Int,
treatmentSuccesses: Int,
treatmentTotal: Int
) -> TestResult {
let p1 = Double(controlSuccesses) / Double(controlTotal)
let p2 = Double(treatmentSuccesses) / Double(treatmentTotal)
let pooledP = Double(controlSuccesses + treatmentSuccesses) /
Double(controlTotal + treatmentTotal)
let se = sqrt(pooledP * (1 - pooledP) *
(1.0 / Double(controlTotal) + 1.0 / Double(treatmentTotal)))
let zScore = (p2 - p1) / se
let pValue = 2 * (1 - normalCDF(abs(zScore)))
return TestResult(
zScore: zScore,
pValue: pValue,
significant: pValue < 0.05,
confidenceInterval: calculateCI(p1, p2, se)
)
}
}
Bayesian Analysis
class BayesianAnalysis {
func analyze(control: VariantData, treatment: VariantData) -> BayesianResult {
// Beta distributions for each variant
let controlDist = BetaDistribution(
alpha: control.successes + 1,
beta: control.failures + 1
)
let treatmentDist = BetaDistribution(
alpha: treatment.successes + 1,
beta: treatment.failures + 1
)
// Monte Carlo simulation
let samples = 10000
var treatmentBetter = 0
for _ in 0.. controlSample {
treatmentBetter += 1
}
}
let probabilityTreatmentBetter = Double(treatmentBetter) / Double(samples)
return BayesianResult(
probabilityTreatmentBetter: probabilityTreatmentBetter,
credibleInterval: calculateCredibleInterval(treatmentDist),
expectedLoss: calculateExpectedLoss(controlDist, treatmentDist)
)
}
}
Experiment Metrics
Whistl tracks multiple metrics per experiment:
Primary Metrics
| Metric | Definition | Target |
|---|---|---|
| Intervention Acceptance Rate | % who engage with intervention | >70% |
| Breathing Completion Rate | % who complete breathing exercise | >50% |
| Partner Contact Rate | % who contact accountability partner | >30% |
| Goal Engagement | Dream board views per week | >5 |
Guardrail Metrics
| Metric | Threshold | Action |
|---|---|---|
| App Uninstall Rate | <2x control | Stop experiment if exceeded |
| Session Duration | >0.8x control | Warning if too low |
| Support Tickets | <3x control | Investigate if elevated |
| Crash Rate | No increase | Immediate stop if increased |
Experiment Lifecycle
Experiments follow a structured process:
Stages
- Hypothesis: Define expected outcome
- Design: Specify variants, metrics, sample size
- Review: Ethics and privacy review
- Launch: Deploy to small percentage
- Monitor: Watch guardrail metrics
- Analyze: Statistical analysis when complete
- Decide: Roll out, iterate, or abandon
Example Experiment
// Experiment: Notification Timing
{
"hypothesis": "Sending notifications at personalized optimal times will increase engagement by 15%",
"variants": [
{"id": "control", "timing": "immediate", "allocation": 0.5},
{"id": "treatment", "timing": "ml_optimized", "allocation": 0.5}
],
"primary_metric": "notification_open_rate",
"guardrail_metrics": ["app_uninstall_rate", "notification_disable_rate"],
"sample_size": 10000, // per variant
"duration": "14 days",
"success_criteria": {
"min_improvement": 0.15,
"significance_level": 0.05,
"power": 0.8
}
}
Results Dashboard
Experiment results are visualized for the team:
Dashboard Metrics
- Daily active users per variant
- Cumulative metric values
- Statistical significance over time
- Segment breakdowns (iOS/Android, region)
- Guardrail metric status
Ethical Considerations
Whistl follows ethical experimentation principles:
Ethics Guidelines
- No harm: Experiments must not increase risk
- Control is valid: Control must be current best practice
- Privacy: No sensitive data used for targeting
- Transparency: Users can opt out of experiments
- Quick stopping: Harmful variants stopped immediately
Conclusion
Whistl's A/B testing infrastructure enables data-driven optimization of intervention effectiveness. Through rigorous statistical analysis, ethical experimentation, and continuous learning, Whistl gets better at helping users every day.
Every experiment is an opportunity to improve outcomes—Whistl tests to learn, not just to win.
Experience Optimized Protection
Whistl continuously tests and improves intervention effectiveness. Download free and benefit from data-driven optimization.
Download Whistl FreeRelated: ML Model Updates | Privacy-Compliant Analytics | 8-Step Negotiation Engine