Bayesian A/B Test Workflow¶
This notebook demonstrates a complete Bayesian A/B testing workflow using utility functions.
Workflow Overview¶
- Setup: Define experiment data (control + variants)
- Non-Inferiority Test: Verify variants don't degrade performance
- Visualize Results: Plot prior and posteriors for all variants
- Select Best Variant: Choose the winning variant with probability
- Visualize Selection: Compare all variant posteriors
Key Advantages of Bayesian Approach¶
- ✅ Works with small, unbalanced samples
- ✅ Provides actionable probabilities (not just p-values)
- ✅ Scales to many variants effortlessly
- ✅ Allows continuous monitoring without p-hacking concerns
- ✅ Directly answers: "Which variant is best?"
1. Setup: Import Libraries and Define Data¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# Import Bayesian utility functions
from bayesian import (
test_non_inferiority_weakly_informative,
select_best_variant
)
# Import plotting utilities
from plotting_utils import (
plot_weakly_informative_prior_with_variants,
plot_multiple_posteriors_comparison
)
2. Define Experiment Data¶
Example: Testing 3 passkey creation UX variants against a control.
- Control: Current experience (~71% completion rate)
- Variants A, B, C: New passkey creation flows
- Goal: Ensure variants don't degrade completion, then pick the best one
# Control group data
n_control = 32106 # Number of users
x_control = 22772 # Number who completed
control_rate = x_control / n_control
print(f"Control Group:")
print(f" Sample size: {n_control:,}")
print(f" Conversions: {x_control:,}")
print(f" Conversion rate: {control_rate:.2%}")
# Variant data
variants_data = {
'A': {'n': 4625, 'x': 3244},
'B': {'n': 2100, 'x': 1433},
'C': {'n': 2022, 'x': 1396}
}
print(f"\nVariants:")
for name, data in variants_data.items():
rate = data['x'] / data['n']
print(f" {name}: n={data['n']:3d}, x={data['x']:3d}, rate={rate:.2%}")
# Test parameters
epsilon = 0.02 # 2% non-inferiority margin (acceptable degradation, business decision)
print(f"\nNon-inferiority margin (ε): {epsilon:.1%}")
print(f"Non-inferiority threshold: {control_rate - epsilon:.2%}")
Control Group: Sample size: 32,106 Conversions: 22,772 Conversion rate: 70.93% Variants: A: n=4625, x=3244, rate=70.14% B: n=2100, x=1433, rate=68.24% C: n=2022, x=1396, rate=69.04% Non-inferiority margin (ε): 2.0% Non-inferiority threshold: 68.93%
3. Non-Inferiority Test¶
Test whether each variant is non-inferior (not significantly worse) than control.
Key Insight: Domain Knowledge vs Business Tolerance¶
The test separates two important concepts:
- Expected degradation (domain knowledge): "Adding 2 extra clicks will degrade by ~2%"
- Business tolerance (epsilon): "We can accept up to 5% degradation"
The prior is centered at your expected performance (domain knowledge), while the test checks against the maximum acceptable threshold (business requirement).
# Domain knowledge: Guestimate that adding passkey creation (2 extra clicks) will degrade by ~2%,
# we use it to center our prior so this is not a threshold at all, just a prior belief which will change after seeing data.
# we can play with the value but it has to ve be less than epsilon otherwise it woudl mean we would
# know even before testing that this is inferiro and there woud be no point testing.
expected_degradation = 0.01
# Run non-inferiority test
results = test_non_inferiority_weakly_informative(
n_control=n_control,
x_control=x_control,
variants_data=variants_data,
epsilon=epsilon, # Business: can tolerate 5% degradation
expected_degradation=expected_degradation, # Domain: expect 2% degradation
alpha_prior_strength=20, # Weak prior (high entropy)
threshold=0.95 # 95% probability required to be sure, can adjust based on risk tolerance
)
print("="*80)
print("PRIOR AND THRESHOLD SETUP")
print("="*80)
print(f"Control rate: {control_rate:.2%}")
print(f"Expected degradation (domain knowledge): {expected_degradation:.1%}")
print(f" → Prior centered at: {control_rate - expected_degradation:.2%}")
print(f"Maximum acceptable degradation (business): {epsilon:.1%}")
print(f" → Test threshold at: {control_rate - epsilon:.2%}")
print(f"\nThis means:")
print(f" • Prior says: 'I expect variant around {control_rate - expected_degradation:.1%}'")
print(f" • Test says: 'Must be above {control_rate - epsilon:.1%} to pass'")
# Display results
print("\n" + "="*80)
print("NON-INFERIORITY TEST RESULTS")
print("="*80)
for variant_name, result in results.items():
status = "✓ NON-INFERIOR" if result['is_non_inferior'] else "✗ NOT NON-INFERIOR"
print(f"\nVariant {variant_name}: {status}")
print(f" P(variant > threshold): {result['probability']:.2%}")
print(f" Posterior mean: {result['variant_rate']:.2%}")
print(f" Prior mean: {result['prior_mean']:.2%}")
print(f" Observed rate: {variants_data[variant_name]['x']/variants_data[variant_name]['n']:.2%}")
# Summary
non_inferior_count = sum(1 for r in results.values() if r['is_non_inferior'])
print(f"\n{'='*80}")
print(f"Summary: {non_inferior_count}/{len(variants_data)} variants are non-inferior")
print(f"{'='*80}")
================================================================================ PRIOR AND THRESHOLD SETUP ================================================================================ Control rate: 70.93% Expected degradation (domain knowledge): 1.0% → Prior centered at: 69.93% Maximum acceptable degradation (business): 2.0% → Test threshold at: 68.93% This means: • Prior says: 'I expect variant around 69.9%' • Test says: 'Must be above 68.9% to pass' ================================================================================ NON-INFERIORITY TEST RESULTS ================================================================================ Variant A: ✓ NON-INFERIOR P(variant > threshold): 96.38% Posterior mean: 70.14% Prior mean: 69.93% Observed rate: 70.14% Variant B: ✗ NOT NON-INFERIOR P(variant > threshold): 25.54% Posterior mean: 68.26% Prior mean: 69.93% Observed rate: 68.24% Variant C: ✗ NOT NON-INFERIOR P(variant > threshold): 55.12% Posterior mean: 69.05% Prior mean: 69.93% Observed rate: 69.04% ================================================================================ Summary: 1/3 variants are non-inferior ================================================================================
4. Visualize Non-Inferiority Test¶
Plot shows:
- Gray dashed line: Common weakly informative prior
- Colored lines: Posterior distribution for each variant
- Red dotted line: Non-inferiority threshold
- Black dash-dot line: Control conversion rate
- Text box: Probability each variant exceeds threshold
# Create visualization - simplified usage!
fig, ax = plot_weakly_informative_prior_with_variants(results)
plt.show()
5. Select Best Variant¶
Among non-inferior variants, which one is most likely the best?
Uses Monte Carlo simulation to compute P(variant is best) for each variant.
# Select best variant among all (or filter to non-inferior only)
# For this example, we'll analyze all variants
selection_results = select_best_variant(
variants_data=variants_data,
alpha_prior=1, # Non-informative prior for selection
beta_prior=1,
credible_level=0.95,
n_simulations=100000
)
# Display results
print("="*80)
print("BEST VARIANT SELECTION")
print("="*80)
print(f"\nProbability each variant is best:")
for name, prob in selection_results['probabilities'].items():
bar = '█' * int(prob * 60)
print(f" {name}: {prob:.2%} {bar}")
winner = selection_results['best_variant']
winner_prob = selection_results['probabilities'][winner]
print(f"\n{'='*80}")
print(f"WINNER: Variant {winner}")
print(f" Probability of being best: {winner_prob:.2%}")
print(f" Posterior mean: {selection_results['posterior_means'][winner]:.2%}")
print(f" 95% Credible interval: [{selection_results['credible_intervals'][winner][0]:.2%}, {selection_results['credible_intervals'][winner][1]:.2%}]")
print(f" Expected loss: {selection_results['expected_loss'][winner]:.4f}")
print(f"{'='*80}")
================================================================================ BEST VARIANT SELECTION ================================================================================ Probability each variant is best: A: 78.08% ██████████████████████████████████████████████ B: 4.29% ██ C: 17.63% ██████████ ================================================================================ WINNER: Variant A Probability of being best: 78.08% Posterior mean: 70.13% 95% Credible interval: [68.81%, 71.44%] Expected loss: 0.0014 ================================================================================
6. Visualize Variant Comparison¶
Compare posterior distributions of all variants to see overlap and separation.
# Prepare posteriors for plotting
from scipy.stats import beta as beta_dist
posteriors = {}
for name, data in variants_data.items():
# Using non-informative prior Beta(1,1) for fair comparison
alpha_post = data['x'] + 1
beta_post = data['n'] - data['x'] + 1
posteriors[name] = {
'alpha': alpha_post,
'beta': beta_post,
'mean': alpha_post / (alpha_post + beta_post),
'ci_95': (
beta_dist.ppf(0.025, alpha_post, beta_post),
beta_dist.ppf(0.975, alpha_post, beta_post)
)
}
# Create comparison plot
fig, ax = plot_multiple_posteriors_comparison(
posteriors=posteriors,
control_group_conversion_rate=control_rate,
epsilon=epsilon
)
plt.show()
Summary¶
This workflow demonstrates:
✅ Non-Inferiority Testing: Verify variants don't significantly degrade performance
- Uses weakly informative prior based on control data
- Provides direct probability: P(variant > threshold)
- Works with small, unbalanced samples
✅ Variant Selection: Choose the best performing variant
- Direct answer: P(variant A is best), P(variant B is best), etc.
- No multiple comparison corrections needed
- Scales to any number of variants
✅ Actionable Results: Business-friendly outputs
- "Variant B is best with 47% probability"
- "Expected loss if we choose C instead: 0.0023"
- Clear decision-making support
Key Takeaway¶
Bayesian methods provide:
- Faster decisions (works with small samples)
- Better interpretability (probabilities, not p-values)
- Greater flexibility (any sample sizes, multiple variants)
- Continuous monitoring (no p-hacking issues)
For modern product development with rapid iteration and risk-averse traffic allocation, Bayesian methods are superior to traditional NHST approaches.
Utility Functions Used in the Notebook¶
"""
Bayesian utilities for A/B testing.
This module contains Bayesian methods for non-inferiority testing,
variant selection, and conversion rate analysis using Beta-Bernoulli
conjugate models with Monte Carlo simulation.
"""
import numpy as np
from scipy.stats import beta as beta_dist
def test_non_inferiority(
n_control,
x_control,
variants_data,
epsilon,
alpha_prior,
beta_prior,
threshold=0.95,
):
"""
Test non-inferiority of multiple variants against a control.
Uses Bayesian Beta-Bernoulli conjugate model with Monte Carlo simulation
to compute the probability that each variant's conversion rate is within
an acceptable degradation margin of the control.
Parameters
----------
n_control : int
Number of samples in control group
x_control : int
Number of successes in control group
variants_data : dict
Dictionary with variant names as keys and {'n': samples, 'x': successes} as values
Example: {'A': {'n': 1000, 'x': 200}, 'B': {'n': 1000, 'x': 215}}
epsilon : float
Non-inferiority margin (e.g., 0.03 for 3%)
alpha_prior : float
Alpha parameter for Beta prior
beta_prior : float
Beta parameter for Beta prior
threshold : float, optional
Probability threshold for declaring non-inferiority (default: 0.95)
Returns
-------
dict : Dictionary with results for each variant containing:
- 'is_non_inferior': bool, whether variant is non-inferior
- 'probability': float, P(variant > control - epsilon)
- 'control_rate': float, posterior mean of control
- 'variant_rate': float, posterior mean of variant
- 'posterior_params': tuple, (alpha, beta) of variant posterior
"""
# Control posterior
alpha_control = x_control + alpha_prior
beta_control = n_control - x_control + beta_prior
control_rate = alpha_control / (alpha_control + beta_control)
# Boundary for non-inferiority
boundary = control_rate - epsilon
results = {}
n_simulations = 100000
# Sample from control posterior once (reuse for all variants)
control_samples = beta_dist.rvs(alpha_control, beta_control, size=n_simulations)
for variant_name, data in variants_data.items():
# Variant posterior
alpha_variant = data["x"] + alpha_prior
beta_variant = data["n"] - data["x"] + beta_prior
variant_rate = alpha_variant / (alpha_variant + beta_variant)
# Sample from variant posterior
variant_samples = beta_dist.rvs(alpha_variant, beta_variant, size=n_simulations)
# Compute P(variant > control - epsilon)
prob_non_inferior = np.mean(variant_samples > (control_samples - epsilon))
results[variant_name] = {
"is_non_inferior": prob_non_inferior >= threshold,
"probability": prob_non_inferior,
"control_rate": control_rate,
"variant_rate": variant_rate,
"posterior_params": (alpha_variant, beta_variant),
}
return results
def test_non_inferiority_weakly_informative(
n_control,
x_control,
variants_data,
epsilon,
expected_degradation=None,
alpha_prior_strength=20,
threshold=0.95,
):
"""
Test non-inferiority using weakly informative prior based on domain knowledge.
This function constructs a weakly informative prior centered at your expected
variant performance (based on domain knowledge), then tests against a separate
non-inferiority threshold (based on business requirements).
Key insight: Prior belief (where you expect the variant to be) is SEPARATE from
the decision threshold (worst acceptable performance).
The prior is constructed as:
- α_prior = alpha_prior_strength (default: 20, for high entropy/wide uncertainty)
- β_prior = (α_prior / target_mean) - α_prior
- target_mean = control_rate - expected_degradation
The test threshold is:
- non_inferiority_threshold = control_rate - epsilon
Parameters
----------
n_control : int
Number of samples in control group
x_control : int
Number of successes in control group
variants_data : dict
Dictionary with variant names as keys and {'n': samples, 'x': successes} as values
Example: {'A': {'n': 561, 'x': 381}, 'B': {'n': 285, 'x': 192}}
epsilon : float
Non-inferiority margin - maximum acceptable degradation (business requirement)
Example: 0.05 means "can tolerate up to 5% degradation"
expected_degradation : float, optional
Expected degradation based on domain knowledge (e.g., "adding 2 clicks will
degrade by ~2%"). If None, defaults to epsilon (conservative).
Should typically be LESS than epsilon.
Example: 0.02 means "expect 2% degradation"
alpha_prior_strength : float, optional
Strength parameter for the prior (default: 20). Smaller values give
wider (more uncertain) priors. Typical values: 10-30.
threshold : float, optional
Probability threshold for declaring non-inferiority (default: 0.95)
Variant is non-inferior if P(variant > control - epsilon) >= threshold
Returns
-------
dict : Dictionary with results for each variant containing:
- 'is_non_inferior': bool, whether variant is non-inferior
- 'probability': float, P(variant > control - epsilon)
- 'control_rate': float, observed control conversion rate
- 'variant_rate': float, posterior mean of variant
- 'posterior_params': tuple, (alpha, beta) of variant posterior
- 'prior_params': tuple, (alpha_prior, beta_prior) used
- 'prior_mean': float, mean of the prior distribution
"""
# Compute control conversion rate
control_rate = x_control / n_control
# Determine expected degradation (defaults to epsilon if not specified)
if expected_degradation is None:
expected_degradation = epsilon
# Construct weakly informative prior centered at expected performance
# This reflects domain knowledge, separate from the business threshold
target_prior_mean = control_rate - expected_degradation
alpha_prior = alpha_prior_strength
beta_prior = (alpha_prior / target_prior_mean) - alpha_prior
# Verify prior is valid (must have positive parameters)
if beta_prior <= 0:
raise ValueError(
f"Invalid prior parameters: beta_prior={beta_prior:.4f} <= 0. "
f"This can happen when epsilon is too large relative to control_rate. "
f"Try reducing epsilon or increasing alpha_prior_strength."
)
results = {}
for variant_name, data in variants_data.items():
# Variant posterior: Beta(x + α_prior, n - x + β_prior)
n = data["n"]
x = data["x"]
alpha_posterior = x + alpha_prior
beta_posterior = (n - x) + beta_prior
variant_posterior_mean = alpha_posterior / (alpha_posterior + beta_posterior)
# Non-inferiority threshold
non_inferiority_threshold = control_rate - epsilon
# Direct calculation: P(variant > threshold) using Beta CDF
# P(X > threshold) = 1 - P(X <= threshold) = 1 - CDF(threshold)
prob_non_inferior = 1 - beta_dist.cdf(
non_inferiority_threshold, alpha_posterior, beta_posterior
)
results[variant_name] = {
"is_non_inferior": prob_non_inferior >= threshold,
"probability": prob_non_inferior,
"control_rate": control_rate,
"variant_rate": variant_posterior_mean,
"posterior_params": (alpha_posterior, beta_posterior),
"prior_params": (alpha_prior, beta_prior),
"prior_mean": target_prior_mean,
"threshold": non_inferiority_threshold, # Store the actual test threshold
"epsilon": epsilon, # Store epsilon for reference
"n": n, # Store sample size for plotting
"x": x, # Store successes for plotting
}
return results
def select_best_variant(
variants_data,
alpha_prior=1,
beta_prior=1,
credible_level=0.95,
n_simulations=100000,
):
"""
Select the best variant among multiple options using Bayesian approach.
Uses Monte Carlo simulation to determine which variant has the highest
probability of being the best, along with expected loss calculations
for decision-making under uncertainty.
Parameters
----------
variants_data : dict
Dictionary with variant names as keys and {'n': samples, 'x': successes} as values
Example: {'A': {'n': 800, 'x': 168}, 'B': {'n': 800, 'x': 172}}
alpha_prior : float, optional
Alpha parameter for Beta prior (default: 1 for uniform)
beta_prior : float, optional
Beta parameter for Beta prior (default: 1 for uniform)
credible_level : float, optional
Credible interval level (default: 0.95)
n_simulations : int, optional
Number of Monte Carlo simulations (default: 100000)
Returns
-------
dict : Dictionary containing:
- 'best_variant': str, name of variant most likely to be best
- 'probabilities': dict, P(each variant is best)
- 'posterior_means': dict, posterior mean for each variant
- 'credible_intervals': dict, (lower, upper) credible interval for each variant
- 'expected_loss': dict, expected loss from choosing each variant
Examples
--------
>>> variants = {
... 'A': {'n': 800, 'x': 168},
... 'B': {'n': 800, 'x': 172},
... 'C': {'n': 800, 'x': 165}
... }
>>> result = select_best_variant(variants)
>>> print(result['best_variant'])
'B'
>>> print(result['probabilities'])
{'A': 0.31, 'B': 0.47, 'C': 0.22}
"""
variant_names = list(variants_data.keys())
posteriors = {}
samples = {}
# Compute posteriors and draw samples
for name, data in variants_data.items():
alpha_post = data["x"] + alpha_prior
beta_post = data["n"] - data["x"] + beta_prior
posteriors[name] = {
"alpha": alpha_post,
"beta": beta_post,
"mean": alpha_post / (alpha_post + beta_post),
}
# Draw samples from posterior
samples[name] = beta_dist.rvs(alpha_post, beta_post, size=n_simulations)
# Compute credible interval
ci_lower = beta_dist.ppf((1 - credible_level) / 2, alpha_post, beta_post)
ci_upper = beta_dist.ppf(1 - (1 - credible_level) / 2, alpha_post, beta_post)
posteriors[name]["credible_interval"] = (ci_lower, ci_upper)
# Monte Carlo: count how often each variant is best
best_counts = {name: 0 for name in variant_names}
for i in range(n_simulations):
# Get samples for this iteration
sample_values = {name: samples[name][i] for name in variant_names}
# Find best variant in this simulation
best_variant = max(sample_values, key=sample_values.get)
best_counts[best_variant] += 1
# Calculate probabilities
probabilities = {name: count / n_simulations for name, count in best_counts.items()}
# Expected loss: E[max(all) - this variant]
expected_loss = {}
for name in variant_names:
max_samples = np.maximum.reduce([samples[v] for v in variant_names])
losses = max_samples - samples[name]
expected_loss[name] = np.mean(losses)
# Determine best variant
best_variant = max(probabilities, key=probabilities.get)
return {
"best_variant": best_variant,
"probabilities": probabilities,
"posterior_means": {name: posteriors[name]["mean"] for name in variant_names},
"credible_intervals": {
name: posteriors[name]["credible_interval"] for name in variant_names
},
"expected_loss": expected_loss,
}
Appendix: Mathematical Foundations¶
This appendix provides the rigorous mathematical foundations underlying Bayesian A/B testing with Beta-Binomial conjugacy.
A.1 The Beta Distribution¶
Definition¶
The Beta distribution is a continuous probability distribution defined on the interval $[0,1]$, making it ideal for modeling probabilities and proportions.
A random variable $p$ follows a Beta distribution with shape parameters $\alpha > 0$ and $\beta > 0$:
$$ p \sim \mathrm{Beta}(\alpha, \beta) $$
Probability Density Function (PDF)¶
$$ f(p \mid \alpha, \beta) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha, \beta)} \quad \text{for } p \in [0,1] $$
where $B(\alpha, \beta)$ is the Beta function, serving as a normalizing constant:
$$ B(\alpha, \beta) = \int_0^1 t^{\alpha-1} (1-t)^{\beta-1} \, dt $$
The Beta function can also be expressed in terms of the Gamma function:
$$ B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)} $$
where $\Gamma(n) = (n-1)!$ for positive integers.
Properties¶
Mean (Expected Value): $$ E[p] = \frac{\alpha}{\alpha + \beta} $$
Variance: $$ \mathrm{Var}(p) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} $$
Mode (for $\alpha, \beta > 1$): $$ \text{Mode}(p) = \frac{\alpha - 1}{\alpha + \beta - 2} $$
Intuition: Pseudo-Counts¶
The parameters $\alpha$ and $\beta$ can be interpreted as prior pseudo-observations:
- $\alpha - 1$ = "prior successes"
- $\beta - 1$ = "prior failures"
- $\alpha + \beta$ = "prior sample size"
Special cases:
- Beta(1, 1): Uniform distribution — complete ignorance, all values equally likely
- Beta(2, 2): Slightly peaked at 0.5 — weak belief in fairness
- Beta(10, 10): Strongly peaked at 0.5 — strong prior belief
- Beta(10, 90): Peaked at 0.1 — strong belief in low probability
Concentration parameter: $\alpha + \beta$ controls "certainty"
- Small values → wide, flat distribution (high uncertainty)
- Large values → narrow, peaked distribution (high certainty)
A.2 The Bernoulli Likelihood¶
Single Trial¶
A Bernoulli trial is a random experiment with exactly two outcomes: success (1) or failure (0).
For a single observation $x \in \{0, 1\}$ with success probability $p$:
$$ P(x \mid p) = p^x (1-p)^{1-x} $$
This equals:
- $p$ if $x = 1$ (success)
- $1-p$ if $x = 0$ (failure)
Multiple Independent Trials¶
For $n$ independent Bernoulli trials with $k$ successes and $(n-k)$ failures:
$$ P(\text{data} \mid p) = P(k \text{ successes in } n \text{ trials} \mid p) $$
The likelihood function is:
$$ \mathcal{L}(p \mid k, n) = \binom{n}{k} p^k (1-p)^{n-k} $$
where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient.
Key insight: For Bayesian inference, the binomial coefficient is a constant (doesn't depend on $p$), so we can write:
$$ \mathcal{L}(p \mid k, n) \propto p^k (1-p)^{n-k} $$
This form — $p^k (1-p)^{n-k}$ — is crucial for conjugacy.
A.3 Bayesian Inference: From Prior to Posterior¶
Bayes' Theorem¶
Bayes' theorem relates prior beliefs to posterior beliefs after observing data:
$$ \boxed{P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}} $$
Where:
- $P(\theta \mid \text{data})$ = Posterior (what we want)
- $P(\text{data} \mid \theta)$ = Likelihood (how likely data is given parameter)
- $P(\theta)$ = Prior (our belief before seeing data)
- $P(\text{data})$ = Evidence (normalizing constant)
The Evidence (Marginal Likelihood)¶
The denominator $P(\text{data})$ is computed by integrating over all possible parameter values:
$$ P(\text{data}) = \int_{\text{all } \theta} P(\text{data} \mid \theta) \cdot P(\theta) \, d\theta $$
For our conversion rate problem with $p \in [0,1]$:
$$ P(\text{data}) = \int_0^1 P(\text{data} \mid p) \cdot P(p) \, dp $$
This integral represents averaging the likelihood over all possible conversion rates, weighted by our prior belief.
Interpretation: "How probable is this data, considering all possible values of $p$ and our prior beliefs?"
Proportionality Form¶
Since $P(\text{data})$ doesn't depend on $\theta$, we often write:
$$ P(\theta \mid \text{data}) \propto P(\text{data} \mid \theta) \cdot P(\theta) $$
Posterior ∝ Likelihood × Prior
This is the fundamental equation of Bayesian inference.
A.4 Beta-Binomial Conjugacy: The Closed-Form Miracle¶
Why Conjugacy Matters¶
For most combinations of prior and likelihood, computing the posterior requires numerical integration — expensive and complex.
But for certain "conjugate" pairs, the posterior has a closed-form solution — we can write down the answer directly!
Beta prior + Binomial likelihood = Beta posterior
This is why Beta distributions are ubiquitous in Bayesian A/B testing.
The Derivation¶
Setup:
- Prior: $p \sim \mathrm{Beta}(\alpha_0, \beta_0)$
- Data: $k$ successes in $n$ trials
- Goal: Find posterior $P(p \mid k, n)$
Step 1: Write out Bayes' theorem
$$ P(p \mid k, n) \propto P(k, n \mid p) \cdot P(p) $$
Step 2: Substitute the likelihood (Binomial)
$$ P(k, n \mid p) \propto p^k (1-p)^{n-k} $$
Step 3: Substitute the prior (Beta)
$$ P(p) = \frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)} $$
Step 4: Multiply likelihood and prior
$$ \begin{align} P(p \mid k, n) &\propto p^k (1-p)^{n-k} \cdot \frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)} \\ &\propto p^k (1-p)^{n-k} \cdot p^{\alpha_0-1}(1-p)^{\beta_0-1} \\ &= p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1} \end{align} $$
Step 5: Recognize the form
This has the form $p^{\alpha-1}(1-p)^{\beta-1}$, which is the kernel of a Beta distribution!
Specifically, it's $\mathrm{Beta}(\alpha, \beta)$ where:
$$ \begin{align} \alpha &= k + \alpha_0 \\ \beta &= (n - k) + \beta_0 \end{align} $$
The Result: Bayesian Update Rule¶
$$ \boxed{ \begin{align} \text{Prior: } &p \sim \mathrm{Beta}(\alpha_0, \beta_0) \\ \text{Data: } &k \text{ successes in } n \text{ trials} \\ \text{Posterior: } &p \mid \text{data} \sim \mathrm{Beta}(\alpha_0 + k,\; \beta_0 + (n-k)) \end{align} } $$
Simple update rule: $$ \begin{align} \alpha_{\text{posterior}} &= \alpha_{\text{prior}} + \text{(observed successes)} \\ \beta_{\text{posterior}} &= \beta_{\text{prior}} + \text{(observed failures)} \end{align} $$
No integration needed! Just add counts.
A.5 Why the Integration Works Out¶
The Normalizing Constant¶
We skipped the $P(\text{data})$ term by using proportionality. But where does it come from?
Full Bayes' theorem:
$$ P(p \mid k, n) = \frac{P(k, n \mid p) \cdot P(p)}{P(k, n)} $$
where
$$ P(k, n) = \int_0^1 P(k, n \mid p) \cdot P(p) \, dp $$
Explicit Integration¶
Substituting our Beta prior and Binomial likelihood:
$$ P(k, n) = \int_0^1 \left[\binom{n}{k} p^k (1-p)^{n-k}\right] \cdot \left[\frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)}\right] dp $$
Pull out constants:
$$ P(k, n) = \frac{\binom{n}{k}}{B(\alpha_0, \beta_0)} \int_0^1 p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1} \, dp $$
Recognize the integral:
The integral is exactly the Beta function:
$$ \int_0^1 p^{\alpha-1} (1-p)^{\beta-1} \, dp = B(\alpha, \beta) $$
with $\alpha = k + \alpha_0$ and $\beta = (n-k) + \beta_0$.
Therefore:
$$ P(k, n) = \frac{\binom{n}{k} \cdot B(k + \alpha_0,\, (n-k) + \beta_0)}{B(\alpha_0, \beta_0)} $$
The Full Posterior¶
Putting it all together:
$$ \begin{align} P(p \mid k, n) &= \frac{\binom{n}{k} p^k (1-p)^{n-k} \cdot \frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)}}{\frac{\binom{n}{k} \cdot B(k + \alpha_0,\, (n-k) + \beta_0)}{B(\alpha_0, \beta_0)}} \\ &= \frac{p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1}}{B(k + \alpha_0,\, (n-k) + \beta_0)} \end{align} $$
This is exactly the PDF of $\mathrm{Beta}(k + \alpha_0,\, (n-k) + \beta_0)$! ✓
Why This is Remarkable¶
The integral over all possible proportions: $$ \int_0^1 p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1} \, dp $$
has a closed-form solution (the Beta function), which means:
- No numerical integration needed — exact answer
- Fast computation — just add and subtract
- Sequential updates — posterior becomes next prior
- Mathematically elegant — form is preserved
This is conjugacy: The posterior is in the same family (Beta) as the prior.
A.6 Example: Updating with Real Data¶
Scenario¶
Suppose we have:
- Prior: Beta(20, 8) — weakly informative, centered at $\frac{20}{28} \approx 0.71$
- Data: Variant C from our experiment
- $n = 2022$ trials
- $k = 1396$ successes
- Observed rate: $\frac{1396}{2022} \approx 0.690$
Bayesian Update¶
Prior: $$ p \sim \mathrm{Beta}(20, 8) $$
Posterior (after observing data): $$ \begin{align} \alpha_{\text{post}} &= 20 + 1396 = 1416 \\ \beta_{\text{post}} &= 8 + (2022 - 1396) = 634 \\ p \mid \text{data} &\sim \mathrm{Beta}(1416, 634) \end{align} $$
Posterior mean: $$ E[p \mid \text{data}] = \frac{1416}{1416 + 634} = \frac{1416}{2050} \approx 0.6907 $$
Posterior variance: $$ \mathrm{Var}(p \mid \text{data}) = \frac{1416 \times 634}{2050^2 \times 2051} \approx 0.000104 $$
Standard deviation: $\sqrt{0.000104} \approx 0.0102$ (about 1%)
Interpretation¶
- Prior belief: $p \approx 0.71$ with high uncertainty (only 28 pseudo-observations)
- Data: $p \approx 0.69$ with 2022 observations
- Posterior: $p \approx 0.69$ with low uncertainty (2050 effective observations)
The data dominates because we have many more real observations than prior pseudo-observations.
If we had used a stronger prior (e.g., Beta(200, 80)), it would have pulled the posterior closer to 0.71, but would require more data to overcome.
A.7 Sequential Updates: The Bayesian Learning Loop¶
Conjugacy Enables Sequential Learning¶
Because the posterior is also a Beta distribution, we can use it as the prior for the next update:
$$ \begin{align} \text{Prior}_1 &\xrightarrow{\text{Data}_1} \text{Posterior}_1 \\ \text{Posterior}_1 &= \text{Prior}_2 \xrightarrow{\text{Data}_2} \text{Posterior}_2 \\ \text{Posterior}_2 &= \text{Prior}_3 \xrightarrow{\text{Data}_3} \text{Posterior}_3 \\ &\vdots \end{align} $$
This is how Thompson sampling works!
Each user:
- Current belief = Beta($\alpha$, $\beta$)
- Show variant, observe conversion
- Update: Beta($\alpha + \text{conversion}$, $\beta + (1 - \text{conversion})$)
- Repeat
Mathematical Equivalence¶
Batch update (all data at once): $$ \mathrm{Beta}(\alpha_0, \beta_0) \xrightarrow{n \text{ trials, } k \text{ successes}} \mathrm{Beta}(\alpha_0 + k, \beta_0 + (n-k)) $$
Sequential update (one at a time): $$ \begin{align} \mathrm{Beta}(\alpha_0, \beta_0) &\xrightarrow{x_1} \mathrm{Beta}(\alpha_0 + x_1, \beta_0 + (1-x_1)) \\ &\xrightarrow{x_2} \mathrm{Beta}(\alpha_0 + x_1 + x_2, \beta_0 + (2 - x_1 - x_2)) \\ &\quad\vdots \\ &\xrightarrow{x_n} \mathrm{Beta}(\alpha_0 + \sum x_i, \beta_0 + (n - \sum x_i)) \end{align} $$
Result is identical! Order of observations doesn't matter.
This property enables:
- Continuous monitoring (no need to wait for fixed sample size)
- Online learning (update as data arrives)
- Thompson sampling (sample → update → repeat)
A.8 Summary: The Mathematical Beauty of Beta-Binomial Conjugacy¶
The Full Picture¶
$$ \boxed{ \begin{array}{rcl} \text{Prior} & : & p \sim \mathrm{Beta}(\alpha_0, \beta_0) \\ &&\\ \text{Likelihood} & : & P(k \mid p, n) = \binom{n}{k} p^k (1-p)^{n-k} \\ &&\\ \text{Evidence} & : & P(k, n) = \displaystyle\int_0^1 P(k \mid p, n) \cdot P(p) \, dp = \frac{\binom{n}{k} \cdot B(\alpha_0 + k, \beta_0 + n - k)}{B(\alpha_0, \beta_0)} \\ &&\\ \text{Posterior} & : & p \mid k, n \sim \mathrm{Beta}(\alpha_0 + k,\; \beta_0 + (n-k)) \end{array} } $$
Why This Matters for A/B Testing¶
Closed-form solution: No numerical integration required
- Integration over all proportions $\int_0^1$ has exact answer
- Result is another Beta distribution
Simple updates: Just add counts
- $\alpha \leftarrow \alpha + \text{successes}$
- $\beta \leftarrow \beta + \text{failures}$
Sequential learning: Posterior becomes next prior
- No need to recompute from scratch
- Enables online/streaming algorithms
Computational efficiency: $O(1)$ per update
- No matrix inversions
- No iterative optimization
- Scales to millions of updates
Interpretability: Parameters are pseudo-counts
- Easy to set priors
- Easy to explain to stakeholders
Enables Thompson sampling: Natural exploration-exploitation
- Sample from posterior
- Update posterior
- Repeat
The Fundamental Insight¶
Conjugacy transforms an intractable integration problem into trivial arithmetic.
Instead of: $$ \text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\int_0^1 \text{Likelihood} \times \text{Prior} \, dp} $$
We get: $$ \text{Posterior parameters} = \text{Prior parameters} + \text{Data counts} $$
This is the mathematical foundation that makes Bayesian A/B testing practical, fast, and elegant.