Bayesian A/B Test Workflow¶

This notebook demonstrates a complete Bayesian A/B testing workflow using utility functions.

Workflow Overview¶

  1. Setup: Define experiment data (control + variants)
  2. Non-Inferiority Test: Verify variants don't degrade performance
  3. Visualize Results: Plot prior and posteriors for all variants
  4. Select Best Variant: Choose the winning variant with probability
  5. Visualize Selection: Compare all variant posteriors

Key Advantages of Bayesian Approach¶

  • ✅ Works with small, unbalanced samples
  • ✅ Provides actionable probabilities (not just p-values)
  • ✅ Scales to many variants effortlessly
  • ✅ Allows continuous monitoring without p-hacking concerns
  • ✅ Directly answers: "Which variant is best?"

1. Setup: Import Libraries and Define Data¶

In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

# Import Bayesian utility functions
from bayesian import (
    test_non_inferiority_weakly_informative,
    select_best_variant
)

# Import plotting utilities
from plotting_utils import (
    plot_weakly_informative_prior_with_variants,
    plot_multiple_posteriors_comparison
)

2. Define Experiment Data¶

Example: Testing 3 passkey creation UX variants against a control.

  • Control: Current experience (~71% completion rate)
  • Variants A, B, C: New passkey creation flows
  • Goal: Ensure variants don't degrade completion, then pick the best one
In [2]:
# Control group data
n_control = 32106  # Number of users
x_control = 22772  # Number who completed
control_rate = x_control / n_control

print(f"Control Group:")
print(f"  Sample size: {n_control:,}")
print(f"  Conversions: {x_control:,}")
print(f"  Conversion rate: {control_rate:.2%}")

# Variant data
variants_data = {
    'A': {'n': 4625, 'x': 3244},
    'B': {'n': 2100, 'x': 1433},
    'C': {'n': 2022, 'x': 1396}
}

print(f"\nVariants:")
for name, data in variants_data.items():
    rate = data['x'] / data['n']
    print(f"  {name}: n={data['n']:3d}, x={data['x']:3d}, rate={rate:.2%}")

# Test parameters
epsilon = 0.02  # 2% non-inferiority margin (acceptable degradation, business decision)
print(f"\nNon-inferiority margin (ε): {epsilon:.1%}")
print(f"Non-inferiority threshold: {control_rate - epsilon:.2%}")
Control Group:
  Sample size: 32,106
  Conversions: 22,772
  Conversion rate: 70.93%

Variants:
  A: n=4625, x=3244, rate=70.14%
  B: n=2100, x=1433, rate=68.24%
  C: n=2022, x=1396, rate=69.04%

Non-inferiority margin (ε): 2.0%
Non-inferiority threshold: 68.93%

3. Non-Inferiority Test¶

Test whether each variant is non-inferior (not significantly worse) than control.

Key Insight: Domain Knowledge vs Business Tolerance¶

The test separates two important concepts:

  1. Expected degradation (domain knowledge): "Adding 2 extra clicks will degrade by ~2%"
  2. Business tolerance (epsilon): "We can accept up to 5% degradation"

The prior is centered at your expected performance (domain knowledge), while the test checks against the maximum acceptable threshold (business requirement).

In [3]:
# Domain knowledge: Guestimate that adding passkey creation (2 extra clicks) will degrade by ~2%, 
# we use it to center our prior so this is not a threshold at all, just a prior belief which will change after seeing data. 
# we can play with the value but it has to ve be less than epsilon otherwise it woudl mean we would
# know even before testing that this is inferiro and there woud be no point testing.
expected_degradation = 0.01

# Run non-inferiority test
results = test_non_inferiority_weakly_informative(
    n_control=n_control,
    x_control=x_control,
    variants_data=variants_data,
    epsilon=epsilon,  # Business: can tolerate 5% degradation
    expected_degradation=expected_degradation,  # Domain: expect 2% degradation
    alpha_prior_strength=20,  # Weak prior (high entropy)
    threshold=0.95  # 95% probability required to be sure, can adjust based on risk tolerance
)

print("="*80)
print("PRIOR AND THRESHOLD SETUP")
print("="*80)
print(f"Control rate: {control_rate:.2%}")
print(f"Expected degradation (domain knowledge): {expected_degradation:.1%}")
print(f"  → Prior centered at: {control_rate - expected_degradation:.2%}")
print(f"Maximum acceptable degradation (business): {epsilon:.1%}")
print(f"  → Test threshold at: {control_rate - epsilon:.2%}")
print(f"\nThis means:")
print(f"  • Prior says: 'I expect variant around {control_rate - expected_degradation:.1%}'")
print(f"  • Test says: 'Must be above {control_rate - epsilon:.1%} to pass'")

# Display results
print("\n" + "="*80)
print("NON-INFERIORITY TEST RESULTS")
print("="*80)

for variant_name, result in results.items():
    status = "✓ NON-INFERIOR" if result['is_non_inferior'] else "✗ NOT NON-INFERIOR"
    print(f"\nVariant {variant_name}: {status}")
    print(f"  P(variant > threshold): {result['probability']:.2%}")
    print(f"  Posterior mean: {result['variant_rate']:.2%}")
    print(f"  Prior mean: {result['prior_mean']:.2%}")
    print(f"  Observed rate: {variants_data[variant_name]['x']/variants_data[variant_name]['n']:.2%}")

# Summary
non_inferior_count = sum(1 for r in results.values() if r['is_non_inferior'])
print(f"\n{'='*80}")
print(f"Summary: {non_inferior_count}/{len(variants_data)} variants are non-inferior")
print(f"{'='*80}")
================================================================================
PRIOR AND THRESHOLD SETUP
================================================================================
Control rate: 70.93%
Expected degradation (domain knowledge): 1.0%
  → Prior centered at: 69.93%
Maximum acceptable degradation (business): 2.0%
  → Test threshold at: 68.93%

This means:
  • Prior says: 'I expect variant around 69.9%'
  • Test says: 'Must be above 68.9% to pass'

================================================================================
NON-INFERIORITY TEST RESULTS
================================================================================

Variant A: ✓ NON-INFERIOR
  P(variant > threshold): 96.38%
  Posterior mean: 70.14%
  Prior mean: 69.93%
  Observed rate: 70.14%

Variant B: ✗ NOT NON-INFERIOR
  P(variant > threshold): 25.54%
  Posterior mean: 68.26%
  Prior mean: 69.93%
  Observed rate: 68.24%

Variant C: ✗ NOT NON-INFERIOR
  P(variant > threshold): 55.12%
  Posterior mean: 69.05%
  Prior mean: 69.93%
  Observed rate: 69.04%

================================================================================
Summary: 1/3 variants are non-inferior
================================================================================

4. Visualize Non-Inferiority Test¶

Plot shows:

  • Gray dashed line: Common weakly informative prior
  • Colored lines: Posterior distribution for each variant
  • Red dotted line: Non-inferiority threshold
  • Black dash-dot line: Control conversion rate
  • Text box: Probability each variant exceeds threshold
In [4]:
# Create visualization - simplified usage!
fig, ax = plot_weakly_informative_prior_with_variants(results)
plt.show()
No description has been provided for this image

5. Select Best Variant¶

Among non-inferior variants, which one is most likely the best?

Uses Monte Carlo simulation to compute P(variant is best) for each variant.

In [5]:
# Select best variant among all (or filter to non-inferior only)
# For this example, we'll analyze all variants
selection_results = select_best_variant(
    variants_data=variants_data,
    alpha_prior=1,  # Non-informative prior for selection
    beta_prior=1,
    credible_level=0.95,
    n_simulations=100000
)

# Display results
print("="*80)
print("BEST VARIANT SELECTION")
print("="*80)

print(f"\nProbability each variant is best:")
for name, prob in selection_results['probabilities'].items():
    bar = '█' * int(prob * 60)
    print(f"  {name}: {prob:.2%} {bar}")

winner = selection_results['best_variant']
winner_prob = selection_results['probabilities'][winner]
print(f"\n{'='*80}")
print(f"WINNER: Variant {winner}")
print(f"  Probability of being best: {winner_prob:.2%}")
print(f"  Posterior mean: {selection_results['posterior_means'][winner]:.2%}")
print(f"  95% Credible interval: [{selection_results['credible_intervals'][winner][0]:.2%}, {selection_results['credible_intervals'][winner][1]:.2%}]")
print(f"  Expected loss: {selection_results['expected_loss'][winner]:.4f}")
print(f"{'='*80}")
================================================================================
BEST VARIANT SELECTION
================================================================================

Probability each variant is best:
  A: 78.08% ██████████████████████████████████████████████
  B: 4.29% ██
  C: 17.63% ██████████

================================================================================
WINNER: Variant A
  Probability of being best: 78.08%
  Posterior mean: 70.13%
  95% Credible interval: [68.81%, 71.44%]
  Expected loss: 0.0014
================================================================================

6. Visualize Variant Comparison¶

Compare posterior distributions of all variants to see overlap and separation.

In [6]:
# Prepare posteriors for plotting
from scipy.stats import beta as beta_dist

posteriors = {}
for name, data in variants_data.items():
    # Using non-informative prior Beta(1,1) for fair comparison
    alpha_post = data['x'] + 1
    beta_post = data['n'] - data['x'] + 1
    
    posteriors[name] = {
        'alpha': alpha_post,
        'beta': beta_post,
        'mean': alpha_post / (alpha_post + beta_post),
        'ci_95': (
            beta_dist.ppf(0.025, alpha_post, beta_post),
            beta_dist.ppf(0.975, alpha_post, beta_post)
        )
    }

# Create comparison plot
fig, ax = plot_multiple_posteriors_comparison(
    posteriors=posteriors,
    control_group_conversion_rate=control_rate,
    epsilon=epsilon
)
plt.show()
No description has been provided for this image

Summary¶

This workflow demonstrates:

  1. ✅ Non-Inferiority Testing: Verify variants don't significantly degrade performance

    • Uses weakly informative prior based on control data
    • Provides direct probability: P(variant > threshold)
    • Works with small, unbalanced samples
  2. ✅ Variant Selection: Choose the best performing variant

    • Direct answer: P(variant A is best), P(variant B is best), etc.
    • No multiple comparison corrections needed
    • Scales to any number of variants
  3. ✅ Actionable Results: Business-friendly outputs

    • "Variant B is best with 47% probability"
    • "Expected loss if we choose C instead: 0.0023"
    • Clear decision-making support

Key Takeaway¶

Bayesian methods provide:

  • Faster decisions (works with small samples)
  • Better interpretability (probabilities, not p-values)
  • Greater flexibility (any sample sizes, multiple variants)
  • Continuous monitoring (no p-hacking issues)

For modern product development with rapid iteration and risk-averse traffic allocation, Bayesian methods are superior to traditional NHST approaches.

Utility Functions Used in the Notebook¶

In [7]:
"""
Bayesian utilities for A/B testing.

This module contains Bayesian methods for non-inferiority testing,
variant selection, and conversion rate analysis using Beta-Bernoulli
conjugate models with Monte Carlo simulation.
"""

import numpy as np
from scipy.stats import beta as beta_dist


def test_non_inferiority(
    n_control,
    x_control,
    variants_data,
    epsilon,
    alpha_prior,
    beta_prior,
    threshold=0.95,
):
    """
    Test non-inferiority of multiple variants against a control.

    Uses Bayesian Beta-Bernoulli conjugate model with Monte Carlo simulation
    to compute the probability that each variant's conversion rate is within
    an acceptable degradation margin of the control.

    Parameters
    ----------
    n_control : int
        Number of samples in control group
    x_control : int
        Number of successes in control group
    variants_data : dict
        Dictionary with variant names as keys and {'n': samples, 'x': successes} as values
        Example: {'A': {'n': 1000, 'x': 200}, 'B': {'n': 1000, 'x': 215}}
    epsilon : float
        Non-inferiority margin (e.g., 0.03 for 3%)
    alpha_prior : float
        Alpha parameter for Beta prior
    beta_prior : float
        Beta parameter for Beta prior
    threshold : float, optional
        Probability threshold for declaring non-inferiority (default: 0.95)

    Returns
    -------
    dict : Dictionary with results for each variant containing:
        - 'is_non_inferior': bool, whether variant is non-inferior
        - 'probability': float, P(variant > control - epsilon)
        - 'control_rate': float, posterior mean of control
        - 'variant_rate': float, posterior mean of variant
        - 'posterior_params': tuple, (alpha, beta) of variant posterior

    """
    # Control posterior
    alpha_control = x_control + alpha_prior
    beta_control = n_control - x_control + beta_prior
    control_rate = alpha_control / (alpha_control + beta_control)

    # Boundary for non-inferiority
    boundary = control_rate - epsilon

    results = {}
    n_simulations = 100000

    # Sample from control posterior once (reuse for all variants)
    control_samples = beta_dist.rvs(alpha_control, beta_control, size=n_simulations)

    for variant_name, data in variants_data.items():
        # Variant posterior
        alpha_variant = data["x"] + alpha_prior
        beta_variant = data["n"] - data["x"] + beta_prior
        variant_rate = alpha_variant / (alpha_variant + beta_variant)

        # Sample from variant posterior
        variant_samples = beta_dist.rvs(alpha_variant, beta_variant, size=n_simulations)

        # Compute P(variant > control - epsilon)
        prob_non_inferior = np.mean(variant_samples > (control_samples - epsilon))

        results[variant_name] = {
            "is_non_inferior": prob_non_inferior >= threshold,
            "probability": prob_non_inferior,
            "control_rate": control_rate,
            "variant_rate": variant_rate,
            "posterior_params": (alpha_variant, beta_variant),
        }

    return results


def test_non_inferiority_weakly_informative(
    n_control,
    x_control,
    variants_data,
    epsilon,
    expected_degradation=None,
    alpha_prior_strength=20,
    threshold=0.95,
):
    """
    Test non-inferiority using weakly informative prior based on domain knowledge.

    This function constructs a weakly informative prior centered at your expected
    variant performance (based on domain knowledge), then tests against a separate
    non-inferiority threshold (based on business requirements).

    Key insight: Prior belief (where you expect the variant to be) is SEPARATE from
    the decision threshold (worst acceptable performance).

    The prior is constructed as:
    - α_prior = alpha_prior_strength (default: 20, for high entropy/wide uncertainty)
    - β_prior = (α_prior / target_mean) - α_prior
    - target_mean = control_rate - expected_degradation

    The test threshold is:
    - non_inferiority_threshold = control_rate - epsilon

    Parameters
    ----------
    n_control : int
        Number of samples in control group
    x_control : int
        Number of successes in control group
    variants_data : dict
        Dictionary with variant names as keys and {'n': samples, 'x': successes} as values
        Example: {'A': {'n': 561, 'x': 381}, 'B': {'n': 285, 'x': 192}}
    epsilon : float
        Non-inferiority margin - maximum acceptable degradation (business requirement)
        Example: 0.05 means "can tolerate up to 5% degradation"
    expected_degradation : float, optional
        Expected degradation based on domain knowledge (e.g., "adding 2 clicks will
        degrade by ~2%"). If None, defaults to epsilon (conservative).
        Should typically be LESS than epsilon.
        Example: 0.02 means "expect 2% degradation"
    alpha_prior_strength : float, optional
        Strength parameter for the prior (default: 20). Smaller values give
        wider (more uncertain) priors. Typical values: 10-30.
    threshold : float, optional
        Probability threshold for declaring non-inferiority (default: 0.95)
        Variant is non-inferior if P(variant > control - epsilon) >= threshold

    Returns
    -------
    dict : Dictionary with results for each variant containing:
        - 'is_non_inferior': bool, whether variant is non-inferior
        - 'probability': float, P(variant > control - epsilon)
        - 'control_rate': float, observed control conversion rate
        - 'variant_rate': float, posterior mean of variant
        - 'posterior_params': tuple, (alpha, beta) of variant posterior
        - 'prior_params': tuple, (alpha_prior, beta_prior) used
        - 'prior_mean': float, mean of the prior distribution

    """
    # Compute control conversion rate
    control_rate = x_control / n_control

    # Determine expected degradation (defaults to epsilon if not specified)
    if expected_degradation is None:
        expected_degradation = epsilon

    # Construct weakly informative prior centered at expected performance
    # This reflects domain knowledge, separate from the business threshold
    target_prior_mean = control_rate - expected_degradation
    alpha_prior = alpha_prior_strength
    beta_prior = (alpha_prior / target_prior_mean) - alpha_prior

    # Verify prior is valid (must have positive parameters)
    if beta_prior <= 0:
        raise ValueError(
            f"Invalid prior parameters: beta_prior={beta_prior:.4f} <= 0. "
            f"This can happen when epsilon is too large relative to control_rate. "
            f"Try reducing epsilon or increasing alpha_prior_strength."
        )

    results = {}

    for variant_name, data in variants_data.items():
        # Variant posterior: Beta(x + α_prior, n - x + β_prior)
        n = data["n"]
        x = data["x"]
        alpha_posterior = x + alpha_prior
        beta_posterior = (n - x) + beta_prior
        variant_posterior_mean = alpha_posterior / (alpha_posterior + beta_posterior)

        # Non-inferiority threshold
        non_inferiority_threshold = control_rate - epsilon

        # Direct calculation: P(variant > threshold) using Beta CDF
        # P(X > threshold) = 1 - P(X <= threshold) = 1 - CDF(threshold)
        prob_non_inferior = 1 - beta_dist.cdf(
            non_inferiority_threshold, alpha_posterior, beta_posterior
        )

        results[variant_name] = {
            "is_non_inferior": prob_non_inferior >= threshold,
            "probability": prob_non_inferior,
            "control_rate": control_rate,
            "variant_rate": variant_posterior_mean,
            "posterior_params": (alpha_posterior, beta_posterior),
            "prior_params": (alpha_prior, beta_prior),
            "prior_mean": target_prior_mean,
            "threshold": non_inferiority_threshold,  # Store the actual test threshold
            "epsilon": epsilon,  # Store epsilon for reference
            "n": n,  # Store sample size for plotting
            "x": x,  # Store successes for plotting
        }

    return results


def select_best_variant(
    variants_data,
    alpha_prior=1,
    beta_prior=1,
    credible_level=0.95,
    n_simulations=100000,
):
    """
    Select the best variant among multiple options using Bayesian approach.

    Uses Monte Carlo simulation to determine which variant has the highest
    probability of being the best, along with expected loss calculations
    for decision-making under uncertainty.

    Parameters
    ----------
    variants_data : dict
        Dictionary with variant names as keys and {'n': samples, 'x': successes} as values
        Example: {'A': {'n': 800, 'x': 168}, 'B': {'n': 800, 'x': 172}}
    alpha_prior : float, optional
        Alpha parameter for Beta prior (default: 1 for uniform)
    beta_prior : float, optional
        Beta parameter for Beta prior (default: 1 for uniform)
    credible_level : float, optional
        Credible interval level (default: 0.95)
    n_simulations : int, optional
        Number of Monte Carlo simulations (default: 100000)

    Returns
    -------
    dict : Dictionary containing:
        - 'best_variant': str, name of variant most likely to be best
        - 'probabilities': dict, P(each variant is best)
        - 'posterior_means': dict, posterior mean for each variant
        - 'credible_intervals': dict, (lower, upper) credible interval for each variant
        - 'expected_loss': dict, expected loss from choosing each variant

    Examples
    --------
    >>> variants = {
    ...     'A': {'n': 800, 'x': 168},
    ...     'B': {'n': 800, 'x': 172},
    ...     'C': {'n': 800, 'x': 165}
    ... }
    >>> result = select_best_variant(variants)
    >>> print(result['best_variant'])
    'B'
    >>> print(result['probabilities'])
    {'A': 0.31, 'B': 0.47, 'C': 0.22}
    """
    variant_names = list(variants_data.keys())
    posteriors = {}
    samples = {}

    # Compute posteriors and draw samples
    for name, data in variants_data.items():
        alpha_post = data["x"] + alpha_prior
        beta_post = data["n"] - data["x"] + beta_prior

        posteriors[name] = {
            "alpha": alpha_post,
            "beta": beta_post,
            "mean": alpha_post / (alpha_post + beta_post),
        }

        # Draw samples from posterior
        samples[name] = beta_dist.rvs(alpha_post, beta_post, size=n_simulations)

        # Compute credible interval
        ci_lower = beta_dist.ppf((1 - credible_level) / 2, alpha_post, beta_post)
        ci_upper = beta_dist.ppf(1 - (1 - credible_level) / 2, alpha_post, beta_post)
        posteriors[name]["credible_interval"] = (ci_lower, ci_upper)

    # Monte Carlo: count how often each variant is best
    best_counts = {name: 0 for name in variant_names}

    for i in range(n_simulations):
        # Get samples for this iteration
        sample_values = {name: samples[name][i] for name in variant_names}

        # Find best variant in this simulation
        best_variant = max(sample_values, key=sample_values.get)
        best_counts[best_variant] += 1

    # Calculate probabilities
    probabilities = {name: count / n_simulations for name, count in best_counts.items()}

    # Expected loss: E[max(all) - this variant]
    expected_loss = {}
    for name in variant_names:
        max_samples = np.maximum.reduce([samples[v] for v in variant_names])
        losses = max_samples - samples[name]
        expected_loss[name] = np.mean(losses)

    # Determine best variant
    best_variant = max(probabilities, key=probabilities.get)

    return {
        "best_variant": best_variant,
        "probabilities": probabilities,
        "posterior_means": {name: posteriors[name]["mean"] for name in variant_names},
        "credible_intervals": {
            name: posteriors[name]["credible_interval"] for name in variant_names
        },
        "expected_loss": expected_loss,
    }

Appendix: Mathematical Foundations¶

This appendix provides the rigorous mathematical foundations underlying Bayesian A/B testing with Beta-Binomial conjugacy.

A.1 The Beta Distribution¶

Definition¶

The Beta distribution is a continuous probability distribution defined on the interval $[0,1]$, making it ideal for modeling probabilities and proportions.

A random variable $p$ follows a Beta distribution with shape parameters $\alpha > 0$ and $\beta > 0$:

$$ p \sim \mathrm{Beta}(\alpha, \beta) $$

Probability Density Function (PDF)¶

$$ f(p \mid \alpha, \beta) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha, \beta)} \quad \text{for } p \in [0,1] $$

where $B(\alpha, \beta)$ is the Beta function, serving as a normalizing constant:

$$ B(\alpha, \beta) = \int_0^1 t^{\alpha-1} (1-t)^{\beta-1} \, dt $$

The Beta function can also be expressed in terms of the Gamma function:

$$ B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)} $$

where $\Gamma(n) = (n-1)!$ for positive integers.


Properties¶

Mean (Expected Value): $$ E[p] = \frac{\alpha}{\alpha + \beta} $$

Variance: $$ \mathrm{Var}(p) = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} $$

Mode (for $\alpha, \beta > 1$): $$ \text{Mode}(p) = \frac{\alpha - 1}{\alpha + \beta - 2} $$


Intuition: Pseudo-Counts¶

The parameters $\alpha$ and $\beta$ can be interpreted as prior pseudo-observations:

  • $\alpha - 1$ = "prior successes"
  • $\beta - 1$ = "prior failures"
  • $\alpha + \beta$ = "prior sample size"

Special cases:

  • Beta(1, 1): Uniform distribution — complete ignorance, all values equally likely
  • Beta(2, 2): Slightly peaked at 0.5 — weak belief in fairness
  • Beta(10, 10): Strongly peaked at 0.5 — strong prior belief
  • Beta(10, 90): Peaked at 0.1 — strong belief in low probability

Concentration parameter: $\alpha + \beta$ controls "certainty"

  • Small values → wide, flat distribution (high uncertainty)
  • Large values → narrow, peaked distribution (high certainty)

A.2 The Bernoulli Likelihood¶

Single Trial¶

A Bernoulli trial is a random experiment with exactly two outcomes: success (1) or failure (0).

For a single observation $x \in \{0, 1\}$ with success probability $p$:

$$ P(x \mid p) = p^x (1-p)^{1-x} $$

This equals:

  • $p$ if $x = 1$ (success)
  • $1-p$ if $x = 0$ (failure)

Multiple Independent Trials¶

For $n$ independent Bernoulli trials with $k$ successes and $(n-k)$ failures:

$$ P(\text{data} \mid p) = P(k \text{ successes in } n \text{ trials} \mid p) $$

The likelihood function is:

$$ \mathcal{L}(p \mid k, n) = \binom{n}{k} p^k (1-p)^{n-k} $$

where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient.

Key insight: For Bayesian inference, the binomial coefficient is a constant (doesn't depend on $p$), so we can write:

$$ \mathcal{L}(p \mid k, n) \propto p^k (1-p)^{n-k} $$

This form — $p^k (1-p)^{n-k}$ — is crucial for conjugacy.


A.3 Bayesian Inference: From Prior to Posterior¶

Bayes' Theorem¶

Bayes' theorem relates prior beliefs to posterior beliefs after observing data:

$$ \boxed{P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}} $$

Where:

  • $P(\theta \mid \text{data})$ = Posterior (what we want)
  • $P(\text{data} \mid \theta)$ = Likelihood (how likely data is given parameter)
  • $P(\theta)$ = Prior (our belief before seeing data)
  • $P(\text{data})$ = Evidence (normalizing constant)

The Evidence (Marginal Likelihood)¶

The denominator $P(\text{data})$ is computed by integrating over all possible parameter values:

$$ P(\text{data}) = \int_{\text{all } \theta} P(\text{data} \mid \theta) \cdot P(\theta) \, d\theta $$

For our conversion rate problem with $p \in [0,1]$:

$$ P(\text{data}) = \int_0^1 P(\text{data} \mid p) \cdot P(p) \, dp $$

This integral represents averaging the likelihood over all possible conversion rates, weighted by our prior belief.

Interpretation: "How probable is this data, considering all possible values of $p$ and our prior beliefs?"


Proportionality Form¶

Since $P(\text{data})$ doesn't depend on $\theta$, we often write:

$$ P(\theta \mid \text{data}) \propto P(\text{data} \mid \theta) \cdot P(\theta) $$

Posterior ∝ Likelihood × Prior

This is the fundamental equation of Bayesian inference.


A.4 Beta-Binomial Conjugacy: The Closed-Form Miracle¶

Why Conjugacy Matters¶

For most combinations of prior and likelihood, computing the posterior requires numerical integration — expensive and complex.

But for certain "conjugate" pairs, the posterior has a closed-form solution — we can write down the answer directly!

Beta prior + Binomial likelihood = Beta posterior

This is why Beta distributions are ubiquitous in Bayesian A/B testing.


The Derivation¶

Setup:

  • Prior: $p \sim \mathrm{Beta}(\alpha_0, \beta_0)$
  • Data: $k$ successes in $n$ trials
  • Goal: Find posterior $P(p \mid k, n)$

Step 1: Write out Bayes' theorem

$$ P(p \mid k, n) \propto P(k, n \mid p) \cdot P(p) $$

Step 2: Substitute the likelihood (Binomial)

$$ P(k, n \mid p) \propto p^k (1-p)^{n-k} $$

Step 3: Substitute the prior (Beta)

$$ P(p) = \frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)} $$

Step 4: Multiply likelihood and prior

$$ \begin{align} P(p \mid k, n) &\propto p^k (1-p)^{n-k} \cdot \frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)} \\ &\propto p^k (1-p)^{n-k} \cdot p^{\alpha_0-1}(1-p)^{\beta_0-1} \\ &= p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1} \end{align} $$

Step 5: Recognize the form

This has the form $p^{\alpha-1}(1-p)^{\beta-1}$, which is the kernel of a Beta distribution!

Specifically, it's $\mathrm{Beta}(\alpha, \beta)$ where:

$$ \begin{align} \alpha &= k + \alpha_0 \\ \beta &= (n - k) + \beta_0 \end{align} $$


The Result: Bayesian Update Rule¶

$$ \boxed{ \begin{align} \text{Prior: } &p \sim \mathrm{Beta}(\alpha_0, \beta_0) \\ \text{Data: } &k \text{ successes in } n \text{ trials} \\ \text{Posterior: } &p \mid \text{data} \sim \mathrm{Beta}(\alpha_0 + k,\; \beta_0 + (n-k)) \end{align} } $$

Simple update rule: $$ \begin{align} \alpha_{\text{posterior}} &= \alpha_{\text{prior}} + \text{(observed successes)} \\ \beta_{\text{posterior}} &= \beta_{\text{prior}} + \text{(observed failures)} \end{align} $$

No integration needed! Just add counts.


A.5 Why the Integration Works Out¶

The Normalizing Constant¶

We skipped the $P(\text{data})$ term by using proportionality. But where does it come from?

Full Bayes' theorem:

$$ P(p \mid k, n) = \frac{P(k, n \mid p) \cdot P(p)}{P(k, n)} $$

where

$$ P(k, n) = \int_0^1 P(k, n \mid p) \cdot P(p) \, dp $$


Explicit Integration¶

Substituting our Beta prior and Binomial likelihood:

$$ P(k, n) = \int_0^1 \left[\binom{n}{k} p^k (1-p)^{n-k}\right] \cdot \left[\frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)}\right] dp $$

Pull out constants:

$$ P(k, n) = \frac{\binom{n}{k}}{B(\alpha_0, \beta_0)} \int_0^1 p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1} \, dp $$

Recognize the integral:

The integral is exactly the Beta function:

$$ \int_0^1 p^{\alpha-1} (1-p)^{\beta-1} \, dp = B(\alpha, \beta) $$

with $\alpha = k + \alpha_0$ and $\beta = (n-k) + \beta_0$.

Therefore:

$$ P(k, n) = \frac{\binom{n}{k} \cdot B(k + \alpha_0,\, (n-k) + \beta_0)}{B(\alpha_0, \beta_0)} $$


The Full Posterior¶

Putting it all together:

$$ \begin{align} P(p \mid k, n) &= \frac{\binom{n}{k} p^k (1-p)^{n-k} \cdot \frac{p^{\alpha_0-1}(1-p)^{\beta_0-1}}{B(\alpha_0, \beta_0)}}{\frac{\binom{n}{k} \cdot B(k + \alpha_0,\, (n-k) + \beta_0)}{B(\alpha_0, \beta_0)}} \\ &= \frac{p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1}}{B(k + \alpha_0,\, (n-k) + \beta_0)} \end{align} $$

This is exactly the PDF of $\mathrm{Beta}(k + \alpha_0,\, (n-k) + \beta_0)$! ✓


Why This is Remarkable¶

The integral over all possible proportions: $$ \int_0^1 p^{k + \alpha_0 - 1} (1-p)^{(n-k) + \beta_0 - 1} \, dp $$

has a closed-form solution (the Beta function), which means:

  1. No numerical integration needed — exact answer
  2. Fast computation — just add and subtract
  3. Sequential updates — posterior becomes next prior
  4. Mathematically elegant — form is preserved

This is conjugacy: The posterior is in the same family (Beta) as the prior.


A.6 Example: Updating with Real Data¶

Scenario¶

Suppose we have:

  • Prior: Beta(20, 8) — weakly informative, centered at $\frac{20}{28} \approx 0.71$
  • Data: Variant C from our experiment
    • $n = 2022$ trials
    • $k = 1396$ successes
    • Observed rate: $\frac{1396}{2022} \approx 0.690$

Bayesian Update¶

Prior: $$ p \sim \mathrm{Beta}(20, 8) $$

Posterior (after observing data): $$ \begin{align} \alpha_{\text{post}} &= 20 + 1396 = 1416 \\ \beta_{\text{post}} &= 8 + (2022 - 1396) = 634 \\ p \mid \text{data} &\sim \mathrm{Beta}(1416, 634) \end{align} $$

Posterior mean: $$ E[p \mid \text{data}] = \frac{1416}{1416 + 634} = \frac{1416}{2050} \approx 0.6907 $$

Posterior variance: $$ \mathrm{Var}(p \mid \text{data}) = \frac{1416 \times 634}{2050^2 \times 2051} \approx 0.000104 $$

Standard deviation: $\sqrt{0.000104} \approx 0.0102$ (about 1%)

Interpretation¶

  • Prior belief: $p \approx 0.71$ with high uncertainty (only 28 pseudo-observations)
  • Data: $p \approx 0.69$ with 2022 observations
  • Posterior: $p \approx 0.69$ with low uncertainty (2050 effective observations)

The data dominates because we have many more real observations than prior pseudo-observations.

If we had used a stronger prior (e.g., Beta(200, 80)), it would have pulled the posterior closer to 0.71, but would require more data to overcome.


A.7 Sequential Updates: The Bayesian Learning Loop¶

Conjugacy Enables Sequential Learning¶

Because the posterior is also a Beta distribution, we can use it as the prior for the next update:

$$ \begin{align} \text{Prior}_1 &\xrightarrow{\text{Data}_1} \text{Posterior}_1 \\ \text{Posterior}_1 &= \text{Prior}_2 \xrightarrow{\text{Data}_2} \text{Posterior}_2 \\ \text{Posterior}_2 &= \text{Prior}_3 \xrightarrow{\text{Data}_3} \text{Posterior}_3 \\ &\vdots \end{align} $$

This is how Thompson sampling works!

Each user:

  1. Current belief = Beta($\alpha$, $\beta$)
  2. Show variant, observe conversion
  3. Update: Beta($\alpha + \text{conversion}$, $\beta + (1 - \text{conversion})$)
  4. Repeat

Mathematical Equivalence¶

Batch update (all data at once): $$ \mathrm{Beta}(\alpha_0, \beta_0) \xrightarrow{n \text{ trials, } k \text{ successes}} \mathrm{Beta}(\alpha_0 + k, \beta_0 + (n-k)) $$

Sequential update (one at a time): $$ \begin{align} \mathrm{Beta}(\alpha_0, \beta_0) &\xrightarrow{x_1} \mathrm{Beta}(\alpha_0 + x_1, \beta_0 + (1-x_1)) \\ &\xrightarrow{x_2} \mathrm{Beta}(\alpha_0 + x_1 + x_2, \beta_0 + (2 - x_1 - x_2)) \\ &\quad\vdots \\ &\xrightarrow{x_n} \mathrm{Beta}(\alpha_0 + \sum x_i, \beta_0 + (n - \sum x_i)) \end{align} $$

Result is identical! Order of observations doesn't matter.

This property enables:

  • Continuous monitoring (no need to wait for fixed sample size)
  • Online learning (update as data arrives)
  • Thompson sampling (sample → update → repeat)

A.8 Summary: The Mathematical Beauty of Beta-Binomial Conjugacy¶

The Full Picture¶

$$ \boxed{ \begin{array}{rcl} \text{Prior} & : & p \sim \mathrm{Beta}(\alpha_0, \beta_0) \\ &&\\ \text{Likelihood} & : & P(k \mid p, n) = \binom{n}{k} p^k (1-p)^{n-k} \\ &&\\ \text{Evidence} & : & P(k, n) = \displaystyle\int_0^1 P(k \mid p, n) \cdot P(p) \, dp = \frac{\binom{n}{k} \cdot B(\alpha_0 + k, \beta_0 + n - k)}{B(\alpha_0, \beta_0)} \\ &&\\ \text{Posterior} & : & p \mid k, n \sim \mathrm{Beta}(\alpha_0 + k,\; \beta_0 + (n-k)) \end{array} } $$


Why This Matters for A/B Testing¶

  1. Closed-form solution: No numerical integration required

    • Integration over all proportions $\int_0^1$ has exact answer
    • Result is another Beta distribution
  2. Simple updates: Just add counts

    • $\alpha \leftarrow \alpha + \text{successes}$
    • $\beta \leftarrow \beta + \text{failures}$
  3. Sequential learning: Posterior becomes next prior

    • No need to recompute from scratch
    • Enables online/streaming algorithms
  4. Computational efficiency: $O(1)$ per update

    • No matrix inversions
    • No iterative optimization
    • Scales to millions of updates
  5. Interpretability: Parameters are pseudo-counts

    • Easy to set priors
    • Easy to explain to stakeholders
  6. Enables Thompson sampling: Natural exploration-exploitation

    • Sample from posterior
    • Update posterior
    • Repeat

The Fundamental Insight¶

Conjugacy transforms an intractable integration problem into trivial arithmetic.

Instead of: $$ \text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\int_0^1 \text{Likelihood} \times \text{Prior} \, dp} $$

We get: $$ \text{Posterior parameters} = \text{Prior parameters} + \text{Data counts} $$

This is the mathematical foundation that makes Bayesian A/B testing practical, fast, and elegant.