Exec Summary¶
This notebook presents both Null Hypothesis Significance Testing (NHST) and Bayesian Methodologies, goes into a lot of details about the math and the methodological considerations which make a Bayesian approach a better fit for web/mobile AB tests. But if you want to skip the math and just get the benefits, go to the end of the notebook and use the Python functions:
test_non_inferiority()- Verify new features don't degrade experienceselect_best_variant()- Choose winning variant with probability
which should be sufficient to support a CX analysis with 3 variants and a control.
Problem Statement¶
When launching new web or mobile features, engineering teams face a common dilemma:
- Limited traffic allocation: At launch new features get only 2-5% of traffic to minimize risk
- Multiple variants: Design teams often propose 3-5 different implementations
- Small sample sizes: At launch because of controlled releases, each variant may only see hundreds or low thousands of users
- Need for speed: We need fast decisions on which variants are best to iterate or scale
- Less than perfect launch logistics because of bugs or misconfiguration the allocation between variants and control may not be what was planned i.e. some samples may be too large or too small or even missing.
Traditional NHST fails here: With small, unbalanced samples (e.g., 7,000 control vs. 150 per variant), statistical tests either:
- Fail to reach significance (underpowered, β > 0.8, meaning power < 20%)
- Require weeks of data collection
- Is unwieldy to compare more than 2 variants at a time
The Bayesian Solution¶
Bayesian methods excel precisely where NHST struggles:
1. Works with Small Samples¶
- Incorporates prior knowledge (historical conversion rates)
- Updates beliefs incrementally as data arrives
- Provides meaningful conclusions even with n=150 per variant
- No arbitrary "minimum sample size" requirement to reach statistical significance.
2. Handles Unbalanced Allocation Naturally¶
- 90% control, 10% variants? No problem.
- Each variant can have different sample sizes
- No need for equal allocation or "balanced designs"
- Protects existing user experience while testing
3. Scales to Many Variants Effortlessly¶
- Compare 3, 5, 10, or 100 variants simultaneously
- Single coherent analysis—no multiple comparison penalties
- Direct answer: P(A is best) = 31%, P(B is best) = 47%, P(C is best) = 22%
4. Provides Actionable Probabilities¶
- NHST says: "Cannot reject H₀" , not directly actionable, say if one has idea of the cost of a bad decision there is no real way to compute an expected value for what can be seen as a bet on being right or wrong.
- Bayesian says: "47% chance B is best, 22% it's worse than control" (actionable, expected value can be computed, can directly feed into a decision algorithm)
- Direct business decision: Deploy B with quantified risk
5. Allows Continuous Monitoring¶
- Can check results anytime without "p-hacking" concerns
- Smoothly and incrementally update posterior probability as new data arrives
- Stop early if clear winner emerges, or rebalance incrementally (See multi armed bandits strategies)
- Continue if more certainty needed—mathematically rigorous
Key Benefits Summary¶
| Aspect | Traditional NHST | Bayesian Approach |
|---|---|---|
| Small samples | Underpowered, inconclusive | Works well with prior knowledge |
| Unbalanced allocation | Loses efficiency | No problem |
| Multiple variants | Complex corrections needed | Natural single analysis |
| Interpretation | p-value (hard to explain) | Probability (intuitive) |
| Decision making | Binary reject/fail | Quantified risk/confidence |
| Continuous monitoring | Forbidden (p-hacking) | Allowed and rigorous |
| Time to decision | Weeks (need larger n) | Days (works with small n) |
Bottom Line¶
For modern product development with:
- Rapid iteration cycles
- bugs/misconfiguration in the traffic splitters
- Risk-averse traffic allocation
- Multiple design options
- Small initial samples
Bayesian methods provide:
- Faster decisions (days vs. weeks)
- Better use of limited data
- Clear, business-friendly outputs
- Quantified confidence for risk management
This enables product teams to launch confidently, iterate quickly, and scale successful features—all while protecting the existing user experience.
Implementation¶
The two standalone Python functions in this notebook provide:
test_non_inferiority()- Verify new features don't degrade experienceselect_best_variant()- Choose winning variant with probability
Both work with any sample sizes and scale to any number of variants.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
from scipy.stats import norm
from statsmodels.stats.proportion import confint_proportions_2indep
from statsmodels.stats.proportion import proportions_ztest
# import the beta function from scipy.special
from scipy.special import beta as beta_function
from scipy.stats import beta as beta_dist
from plotting_utils import plot_gaussian_hypothesis_test
from plotting_utils import plot_type_ii_error_analysis, plot_beta_prior_comparison, plot_prior_vs_posterior
from plotting_utils import plot_informative_prior_posterior_comparison, plot_weakly_informative_prior_with_variants
from plotting_utils import plot_multiple_posteriors_comparison
from nhst import compute_sample_size_non_inferiority
from bayesian import test_non_inferiority, select_best_variant, test_non_inferiority_weakly_informative
In Depth Analysis of the 2 methodologies¶
When evaluating new user experiences (UX) — such as launching passkeys and measuring their impact on abandonment rates — our CX team has provided us with 3 variants $A_1$, $A_2$ and $A_3$ for the passkey creation experience. After launch we need to make sure that adding passkey creation does not degrade the success rate of our current CX significantly by comparing each variant to a control group that will remain on our legacy pages, then we have to decide which of the variants performs best.
This notebook compares two major approaches:
- Null Hypothesis Significance Testing (NHST) — the long-standing statistical framework, widely used but often conceptually tricky and difficult to interpret in practical business decision-making.
- Bayesian methods — increasingly popular even because they offer more flexibility, better ability to work on small samples and produce results that are often easier to interpret directly when deciding on actions (e.g., when to shift more traffic from a control to a new variant).
Test Setup: Control Group vs. Variants¶
For the sake of the discussion let's assume an existing digital identity+credentials creation flow with a completion rate of ~20% (meaning ~80% of users abandon).
Our test design:
- Keep x% of traffic on the current experience as the control group C.
- Send the remaining traffic to one or more variants $A_1$, $A_2$, $A_3$.
The test will be conducted in 2 steps. First we need to determine that each new experience is no worse than the current one, accepting small degradation as unavoidable since we are adding more pages and clicks — then after establishing we haven't degraded the experience in an unacceptable fashion, shifting more traffic to the variants and deciding which is the better-performing variant.
The type of A/B test — where the first goal is to ensure a new design does not degrade the experience — is called a non-inferiority test (explained below).
Null Hypothesis Significance Testing (NHST)¶
At a high level, the NHST workflow is:
- Assume what you don’t want to see — this is the null hypothesis.
- Example in medicine: “the drug has no effect.”
- Example here: “the new experience significantly increases abandonment.”
- Run the experiment and compute a test statistic which in this case would be a proportion number of successes over total attempts
- Ask: If the null hypothesis were true, how likely is it that we would observe a result at least this extreme?
- If that probability (the p-value) is very low — e.g., below a conventional threshold such as 5% — we reject the null.
Two immediate caveats:
- Rejecting the null does not prove the opposite is true; it only says the data would be unlikely if the null were correct The p-value itself is quantified as P(data | H₀), but the resulting decision (reject or fail to reject) comes with no probability of being correct. Without P(H₀ | data), we cannot compute expected values for decision-making—if deploying a bad variant costs $100k and a good one gains $50k, NHST provides no framework to quantify the expected value of the decision
- “Unlikely enough” is also completely arbitrary — thresholds like 5% are conventions, not laws of nature.
A key point: NHST computes $P(\text{data} \mid \text{hypothesis})$.
Later we’ll see that the Bayesian approach instead computes $P(\text{hypothesis} \mid \text{data})$ — a fundamentally different quantity.
Modeling Conversion as Random Variables¶
The abandonment or conversion of a UX flow can be modeled with Bernoulli random variables:
- $X_C$ for the control experience
- $X_A$ for a new variant $A$
A Bernoulli variable takes only two values: success/failure, convert/abandon, etc.
Each user who sees a page gives one draw from one of these variables.
We assume both have the same codomain:
$$ \mathcal{X}_C = \mathcal{X}_A = \{0,1\} $$
where 1 = convert (user finishes the intended action, e.g., creating a passkey) and 0 = abandon.
Technical failures are treated as success here because the user attempted the action.
Sample Proportions¶
NHST usually works with sample proportions, the average of $n$ Bernoulli draws:
$$ \hat{p}_C = \frac{1}{n}\sum_{i=1}^n X_{C_i}, \quad \hat{p}_A = \frac{1}{n}\sum_{i=1}^n X_{A_i}. $$
Each $\hat{p}$:
- Is a random variable taking values $\{0,\tfrac1n,\tfrac2n,\ldots,1\}$.
- Is also an estimator of the true expected value $p = E[X]$.
By the law of large numbers, $\hat{p} \to p$ as $n$ grows.
(Statisticians use a “hat” to denote an estimator.)
Formally, an estimator maps $n$ realizations of $X$ ($\mathcal{X}^n$) to a real number:
$$ \hat{p}: \mathcal{X}^n \to [0,1]. $$
Because it is the mean of $n$ Bernoulli variables, $\hat{p}$ follows a binomial distribution that becomes approximately Gaussian when $n$ is large.
Variance and Standard Deviation of a Sample Proportion¶
For a single Bernoulli $X$:
$$
\mathrm{Var}(X) = p(1-p).
$$
For the sample proportion: $$ \mathrm{Var}\!\left(\tfrac1n \sum_{i=1}^n X_i\right) = \tfrac1{n^2} n p(1-p) = \tfrac{p(1-p)}{n}. $$
$$ \boxed{\mathrm{Var}(\hat{p}) = \frac{p(1-p)}{n}} $$
The square root of this variance is the standard error — a quantity we’ll use later.
$$ \boxed{SE = SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}} $$
Difference in Proportions¶
For or current problem of deciding "non ineferiority" or "superiority" we will use a metric which is the difference between variant and control proportions:
$$ \hat{\Delta} = \hat{p}_A - \hat{p}_C $$
This estimates the true difference
$$ \Delta = p_A - p_C. $$
Hypotheses¶
Null Hypothesis $H_0$ — the “bad” scenario we want to reject:
the new UX degrades conversion by at least some small amount $\epsilon$ we consider unacceptable (e.g., $3\%$):$$ H_0: E[\Delta] \le -\epsilon $$
Alternative Hypothesis $H_1$ — the new UX is not worse than control (possibly better):
$$ H_1: E[\Delta] > -\epsilon $$
Boundary Hypothesis — used in test construction:
assume the difference is exactly at the acceptable degradation limit:$$ E[\Delta] = -\epsilon $$
Numerical Example¶
For a concrete example, we define the following counts and quantities for each experience:
$n_C$ : number of visitors in the control group
$x_C$ : number of conversions observed in the control group
$n_A$ : number of visitors in the variant A group
$x_A$ : number of conversions observed in the variant A group
$\hat{\Delta}_{\mathrm{obs}}$ : the observed difference in conversion proportions between variant and control, it becomes negative when it goes in the "wrong" direction of what we don't want to see happening (degradation)
$-\epsilon$ : the acceptable degradation margin, i.e., the smallest decrease in conversion we are willing to tolerate for the new variant
# Actual experiment data
nC = 32106
xC_observed = 22772
control_group_conversion_rate = xC_observed / nC
# Three variants with actual experiment data
variants = {
'A': {'n': 4625, 'x': 3244},
'B': {'n': 2100, 'x': 1433},
'C': {'n': 2022, 'x': 1396}
}
# Variant A data (from variants dictionary)
nX = variants['C']['n']
xX_observed = variants['C']['x']
# Test parameters
epsilon = 0.02 # 2% non-inferiority margin
nhst_alpha = 0.05 # 5% significance level
# Derived values
hatpC_observed = xC_observed / nC
hatpA_observed = xX_observed / nX
hatDelta_observed = hatpA_observed - hatpC_observed
print(f"Control group conversion rate: {hatpC_observed:.4f}")
print(f"Treatment group A conversion rate: {hatpA_observed:.4f}")
print(f"Observed difference in conversion rate: {hatDelta_observed:.4f}")
Control group conversion rate: 0.7093 Treatment group A conversion rate: 0.6904 Observed difference in conversion rate: -0.0189
Standard Deviation of the Estimator $\hat{\Delta}$ (a.k.a. Standard Error in Frequentist Statistics)¶
In NHST, the first step is to estimate the standard deviation of the estimator $\hat{\Delta}$ (often called the standard error, SE). Note that this is already a relatively convoluted concept: we need to decide in an hypothetical world where we could repeat the experience many times and also assuming the null hypothesis what kind of variance we would observe in the results.
We then compare the observed difference in proportions from the experiment to this estimated variability to decide whether the observed effect is “far enough” from what we would expect still all under the null hypothesis $H_0$.
This is a key pain point for NHST:
- We do not know the true standard deviation — it depends on the unknown and unknowable underlying conversion probabilities.
- Frequentist methods therefore use the plug-in principle: estimate the unknown variance by “plugging in” the sample estimates (the data you just observed).
But note the circularity:
- We want to know if the data are unusual under $H_0$.
- To measure “unusual,” we need the standard error assuming $H_0$.
- SE depends on the unknown true rates, so we plug in $\hat{p}$ (from the data!).
- We then use this data-derived SE to judge whether the data are unusual.
It’s like saying: “Use my one measurement to tell me how variable my measurements are, then use that to decide if my measurement is surprising.”
Frequentists accept this because:
- Long-run frequency view: if we repeated the procedure many times in a hypothetical world and everything remained constant, it would have correct average properties over the long run.
- Pragmatism: it’s computable from one experiment.
- Simulation evidence: works “reasonably well” for not too small $n$, though “reasonably well” is debatable (google many articles and books about the replication crisis in medicine and "soft" sciences).
- Framework limitation: in classical stats, parameters are fixed unknowns, so no natural way to treat them as random.
Common Plug-In Approaches (a.k.a. “Standard Error Hacks”)¶
1. Wald Pooled Standard Error (for a “No Effect” Hypothesis)¶
If an hypothesis is no effect ($p_A = p_C$), we can pool data from control and variant, because under the hypothesis they are assumed to come from the same distribution.
Realizations of sample proportions are:
$$ \hat{p}_A = \frac{x_A}{n_A}, \qquad \hat{p}_C = \frac{x_C}{n_C}. $$
If "no effect" as true (no difference), a pooled estimator is:
$$ \hat{p}_{\text{pool}} = \frac{x_A + x_C}{n_A + n_C}. $$
Variance of the difference between two independent proportions is the sum of their variances:
$$ \mathrm{Var}(\hat{p}_A - \hat{p}_C) = \mathrm{Var}(\hat{p}_A) + \mathrm{Var}(\hat{p}_C). $$
Plugging in the pooled estimate:
$$ \mathrm{Var}(\hat{p}_A - \hat{p}_C) = \hat{p}_{\text{pool}}(1-\hat{p}_{\text{pool}}) \left(\frac{1}{n_A}+\frac{1}{n_C}\right). $$
So the pooled standard error is:
$$ \boxed{\text{WaldPooled SE} = \sqrt{\hat{p}_{\text{pool}}(1-\hat{p}_{\text{pool}}) \left(\frac{1}{n_A}+\frac{1}{n_C}\right)} } $$
This is the classic standard error of the difference between two independent proportions.
Once SE is computed, we calculate a z-score (number of SEs the observed difference is away from the null value 0):
$$ z = \frac{\hat{p}_A - \hat{p}_C}{\text{SE}}. $$
2. Wald Unpooled Standard Error (for Non-Inferiority)¶
If the null is non-inferiority (allowing a margin $-\epsilon$), we cannot assume $p_A = p_C$, so we don’t pool.
Instead we sum the individual variances (using the plug-in trick separately for each group):
$$ \widehat{\text{WaldUnpooled SE}} = \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_C(1-\hat{p}_C)}{n_C}}. $$
Ideally, the true $p_A$ and $p_C$ should be used here, but we don’t know them — so we substitute $\hat{p}_A$ and $\hat{p}_C$.
This works but can be inaccurate if sample sizes are too small or the underlying rates are toward extremes.
3. Newcombe / Score-Based (Wilson)¶
- Better coverage than Wald, especially with imbalanced sample sizes or very high/low $p$.
- Still closed-form and relatively easy to compute.
- Still uses plug-in estimates and suffers from the same “single-sample” limitation.
(Not detailed here — but recommended over Wald when possible.)
4. Miettinen–Nurminen¶
- Widely used in clinical trials and regulated industries (e.g., FDA guidance).
- Provides improved accuracy for non-inferiority tests.
- However: mathematically complex, still plug-in based, and still fundamentally relies on one sample.
Summary¶
Frequentist methods must estimate variance from the data itself — leading to circularity and potential miscalibration, especially for small samples or edge cases.
Later we’ll see how Bayesian methods avoid this by modeling uncertainty about the true conversion rates directly.
Numerical examples
pooled_proportion = (xC_observed + xX_observed) / (nC + nX)
wald_pooled_SE = (pooled_proportion * (1 - pooled_proportion) * (1/nC + 1/nX))**0.5
print(f"Wald Pooled Standard Error: {wald_pooled_SE:.4f}")
wald_unpooled_SE = ((hatpC_observed * (1 - hatpC_observed) / nC) + (hatpA_observed * (1 - hatpA_observed) / nX))**0.5
print(f"Wald Unpooled Standard Error: {wald_unpooled_SE:.4f}")
Wald Pooled Standard Error: 0.0104 Wald Unpooled Standard Error: 0.0106
Probability of False Positive, p-value, Significance Level $\alpha$ (sometimes called “confidence”), and Critical Value¶
Once we have an estimate of the standard error $SE$ the standard deviation of $\hat{\Delta}$, NHST assumes a sampling distribution for the estimator under the null hypothesis $H_0$.
The idea is:
- If we know (or assume) the mean and standard deviation of $\hat{\Delta}$ under $H_0$,
- we can model it with a known probability distribution and calculate how likely any observed result is.
Although the true process is discrete (binomial), in practice we often approximate it by a normal (Gaussian) distribution. This is mathematically simpler and is a good approximation for moderate or large sample sizes.
Using “Boundary” as the mean of the distribution to integrate to compute the probability of what we are observing.¶
To decide whether the result we are seeing is "unlikely", we are going to integrate the right tail of the probably distribution with for lower bound $\hat{\Delta}_{\mathrm{obs}}$ and upper bound $+\infty$, which means "probability of the conversion to be at $\hat{\Delta}_{\mathrm{obs}}$ or better"
The null hypothesis for "non inferiority" is technically an inequality
$$ H_0: E[\Delta] \le -\epsilon. $$
So we could imagine many probability distributions centered at any values below $-\epsilon$ and they would all be be covered (true) by $H0$ but we don't want to do the computuation for all of them, so to get a single distribution to work with, we should use the boundary value as the mean of the distribition we will work with:
$$ \mu = E[\Delta] = -\epsilon. $$
Why?
- Whatever decision we make with $\mu= E[\Delta] = -\epsilon$ as the mean of the distribution AND $\hat{\Delta}_{\mathrm{obs}}$ as the lower, we would make the exact same decision if we were integrating a distribution centered somewhere lower on the x axis, than $-\epsilon$, as the lower bound of the integral $\hat{\Delta}_{\mathrm{obs}}$ does not move. This would make the right tail even smaller/thiner (less surface, less probable) as you move the mean to the left. But the opposite is not true that is if we picked distribution centered higher than $-\epsilon$ but still integrated from $\hat{\Delta}_{\mathrm{obs}}$, we could fail to reject the null becuase the probabilithyg get higher (say > 5%) but still reject it if moved the mean back to $-\epsilon$ . So this is the most conservative and only correct test we can make.
So under $H_0$, we model $\hat{\Delta}$ as:
$$ \hat{\Delta} \sim N(\mu, \sigma) $$
with
$$ \mu = -\epsilon, \qquad \sigma = SE. $$
# Plot the Gaussian N(mu, sigma) and shade the right-tail area beyond x
# Use previously defined values
SE_H0 = wald_unpooled_SE
mu_H0 = -epsilon
sigma_HO = SE_H0
x0 = hatDelta_observed
# Create the plot using the helper function
fig, ax = plot_gaussian_hypothesis_test(
mu_H0=mu_H0,
sigma_H0=sigma_HO,
observed_value=x0,
alpha=nhst_alpha,
epsilon=epsilon
)
Computing the p-Value¶
The p-value is the probability (under $H_0$) of observing a result as extreme or more extreme than what we got, in the direction of the alternative $H_1$.
In this one-sided non-inferiority test, that means the right tail probability:
$$ p\text{-value} = P_{H_0}\big[\hat{\Delta} \ge \hat{\Delta}_{\text{obs}}\big] = \int_{\hat{\Delta}_{\text{obs}}}^{+\infty} \frac{1}{\sqrt{2\pi}\,\sigma} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\,dx. $$
This tail integral is the survival function of the normal distribution.
Using the standard normal CDF $\Phi$:
$$ p\text{-value} = 1 - \Phi\!\left(\frac{\hat{\Delta}_{\text{obs}}-\mu}{\sigma}\right). $$
Significance Level $\alpha$ and Critical Value¶
- We choose a significance level $\alpha$ (often $0.05$).
- If $p\text{-value} \le \alpha$, we reject $H_0$ — our result is unlikely under the null.
The critical value $c$ is the smallest observed difference that would lead to rejection at level $\alpha$. It is obtained by inverting the right-tail probability:
$$ c = \mu + \sigma \,\Phi^{-1}(1 - \alpha). $$
Any observed $\hat{\Delta}_{\text{obs}} \ge c$ yields $p\text{-value} \le \alpha$ and thus rejects $H_0$.
Below we’ll compute the p-value and critical value explicitly and visualize how the right tail behaves.
SE_H0 = wald_unpooled_SE
from scipy.stats import norm
mu_H0 = -epsilon # mean
sigma_HO = SE_H0 # standard deviation
x = hatDelta_observed # value to evaluate
# Survival function P(X > x)
p_value = norm.sf(x, loc=mu_H0, scale=sigma_HO)
print(f'p-value (one-sided) for an observed difference in proportions at {hatDelta_observed:.4f}: {p_value:.4f}')
# inverse survival function to find critical value for given p-value
critical_value = norm.isf(nhst_alpha, loc=mu_H0, scale=sigma_HO)
print(f"Critical value for our alpha cutoff value at {nhst_alpha:.4f}: {critical_value:.4f}")
p-value (one-sided) for an observed difference in proportions at -0.0189: 0.4575 Critical value for our alpha cutoff value at 0.0500: -0.0026
So with the accetable degradation value we chose as an example for the sample that is 2% max degration accetable, the p-value is a bit more than 45%, so that's very far from being enough if we picked customer 5% as the alpha cutoff point for significance. So we would "fail to reject", meaning we can't say much, we don't know whether we can claim that the new CX does not cause some unwanted degradation.
Note that this is because the sample is relatively small. Small samples are a realistic situation when launching a new feature; for various reasons including unresolved bugs you usually just put a tiny bit of traffic on the new feature. But those small samples are often not sufficient to make a decision based on NHST. This is due to the way the standard error is computed; it needs more data to deliver some insights.
Traditional Presentation Using z-Scores¶
Another common way to compute the p-value in NHST is to standardize the observed statistic rather than work directly with $\hat{\Delta}_{\mathrm{obs}}$.
The idea:
- Convert our Gaussian with mean $\mu$ and standard deviation $\sigma$ into a standard normal $N(0,1)$.
- This is done by the familiar $z$-score transformation:
$$ Z = \frac{X - \mu}{\sigma}. $$
For our non-inferiority test:
$$ Z_{\mathrm{NI}} = \frac{\hat{\Delta} - E[\Delta]_{H_{\text{boundary}}}}{SE} = \frac{\hat{\Delta} - (-\epsilon)}{SE} = \frac{\hat{\Delta} + \epsilon}{SE}. $$
Thus $Z_{\mathrm{NI}}$ follows approximately a standard normal $N(0,1)$ under $H_0$.
Tail Probability in Standard Normal Form¶
We want the probability (under $H_0$) of observing something at least as extreme as our sample:
$$ P\big[\hat{\Delta} \ge \hat{\Delta}_{\mathrm{obs}}\big] = P\!\left[\frac{\hat{\Delta} + \epsilon}{SE} \ge \frac{\hat{\Delta}_{\mathrm{obs}}+\epsilon}{SE}\right]. $$
This is simply the right-tail probability of a standard normal:
$$ \int_{Z_{\mathrm{NI}}}^{+\infty} \frac{1}{\sqrt{2\pi}}\,e^{-z^2/2}\,dz. $$
Using the standard normal survival function gives the same p-value as before — this is just the “classical” z-score framing that many NHST tutorials use.
zni = (hatDelta_observed + epsilon) / SE_H0
p_zni = norm.sf(zni)
print(f"z_NI: {zni:.4f}, p-value: {p_zni:.4f}")
z_NI: 0.1067, p-value: 0.4575
This is the same pvalue, just a different way to compute the integral
False Positive (a.k.a. Type I Error)¶
In this NHST setup, the alpha represents the cutoff for the decision rule (lower means we reject H0, it is the conditional probability of rejecting H0 while it is actually true. In this "non inferiority" setup the rejection is considered a "postive" as H0 is "what we don't want to see", so that is concluding “no unacceptable degradation” while there is actually a nasty one.
This conditional probability P(Reject H₀ | H₀ is true) defines the p-value that is P(data as extreme or more extreme | H₀ is true); we reject H₀ when p-value ≤ alpha
By setting the significance level $\alpha = 0.05$, we accept a 5% risk of making this wrong decision when there is a degration. But note that this is a frequentist definition: if we ran the experiment many times, on average we would incorrectly reject 5% of the time in this case. However it does assign any actual probablity to our current decision/experiment (Bayesian method do, see below).
It also says nothing about the "effect size", that is about how much "non-degradation/improvement" we may have. For non-inferiority it does not matter too much but for superiority we would want to know, also something Bayesian approach can do more directly.
False Negative (Type II Error), Power, and Sample Size¶
The false negative — is failing to reject $H_0$ when the alternative $H_1$ is actually true.
In the context of a non-inferiority test, a false negative means:
We fail the test (do not reject $H_0$) even though the new UX is truly non-inferior (as good or better than the old one). Usually it means we woud need to keep on gathering data until the test has more power to detect something (see below)
Choosing an Effect Size Under $H_1$¶
Just as with the Type I error calculation, we need to pick an expected value for the difference $\Delta$ — but this time under $H_1$.
- In practice, we must choose a single reference value to center the alternative distribution.
- A common (and pragmatic) choice is the minimum effect size we care to detect — often set to $E[\Delta] = 0$ (meaning no difference between variant and control).
- If the variant is truly “no worse” (Δ = 0), the test should reject $H_0$ most of the time.
This choice is somewhat arbitrary and reflects a business decision: “How small of a difference do we consider acceptable to detect?”
Modeling Under $H_1$¶
If we assume the variant is truly no worse (Δ = 0), we can pool samples to estimate the standard error (since under $H_1$ we’re treating them as coming from the same distribution):
$$ SE_{H_1} = \text{WaldPooled SE} = \sqrt{\hat{p}_{\mathrm{pool}} (1-\hat{p}_{\mathrm{pool}}) \left(\tfrac{1}{n_C}+\tfrac{1}{n_A}\right)}. $$
We then compare this alternative distribution (mean = 0, std = $SE_{H_1}$) to the critical value $c$ that was already set by the significance level $\alpha$.
Beta and Power¶
$\beta$ (Type II error) = false negative rate , = probability that the observed statistic falls below the critical value that was established under H0, but with a different distribution with a mean is assumed the one we chose under $H_1$ (e.g., Δ = 0) and also the standard error establisehd under H1. Negaive here mean "fail to reject" while we truy had no inferiority.
Power = $1-\beta$ = probability of correctly rejecting $H_0$ when the variant is truly non-inferior. Another way to say this is: knowing the property we care about is really there (non-inferiority), what is our probability of detecting it. In machine learning and search queries analysis, Power is also called "recall", that is if it is the propoerty we care about is really there how often can we predict or find it.
Graphically:
- The null distribution is centered at $-\epsilon$ (our boundary).
- The alternative distribution is centered at $0$ (no degradation).
- $\beta$ is the area of the alternative distribution to the left of the critical value.
SE_H1 = wald_pooled_SE
mu_H1 = 0
sigma_H1 = SE_H1
x = critical_value
beta = norm.cdf(x, loc=mu_H1, scale=sigma_H1)
print(f"Observed probability of false negative a.k.a β a.k.a type 2 errors, at critical value : {beta:.4f}")
power = 1 - beta
print(f"Observed Power (1 - β): {power:.4f}")
Observed probability of false negative a.k.a β a.k.a type 2 errors, at critical value : 0.4022 Observed Power (1 - β): 0.5978
# Plot Type II error analysis
# Create the plot using the helper function
fig, ax = plot_type_ii_error_analysis(
mu_H1=mu_H1,
sigma_H1=sigma_H1,
critical_value=critical_value,
hatDelta_observed=hatDelta_observed,
epsilon=epsilon,
beta=beta,
power=power
)
Designing for Target Power¶
If we want to achieve a target power — commonly 80% (so $\beta = 0.2$) —
we can solve for the required sample size (embedded in $SE$).
- Larger $n$ → smaller $SE$ → distributions separate more clearly → higher power.
- This is the usual sample size calculation step when planning an A/B test.
In practice, one:
- Fixes $\alpha$ (e.g., 0.05).
- Chooses the minimum effect size of interest (e.g., $\Delta=0$ for non-inferiority).
- Sets desired power (e.g., 80%).
- Solves for $n_C$ and $n_A$ to achieve that power given the pooled variance.
The code to solve to get a target beta can be developed easily, but since we will favor Bayesian approaches which can work with small samples, we will leave that TBD.
# Example: Compute sample size for our passkey example using the utility function
print("="*80)
print("SAMPLE SIZE CALCULATION FOR NON-INFERIORITY TEST")
print("="*80)
# Parameters from our example
p_control = control_group_conversion_rate
epsilon_val = epsilon
alpha_val = nhst_alpha
target_power = 0.80
print(f"\nParameters:")
print(f" Control conversion rate: {p_control:.2%}")
print(f" Non-inferiority margin (ε): {epsilon_val:.2%}")
print(f" Significance level (α): {alpha_val:.2%}")
print(f" Target power: {target_power:.2%}")
print(f" Assumed true difference under H1: 0 (no difference)")
# Equal allocation (1:1)
result_equal = compute_sample_size_non_inferiority(
p_control=p_control,
epsilon=epsilon_val,
alpha=alpha_val,
target_power=target_power,
h1_effect_size=0.0,
allocation_ratio=1.0
)
print(f"\n{'='*80}")
print("EQUAL ALLOCATION (1:1 - Control:Variant)")
print(f"{'='*80}")
print(f"Required sample size per group: {result_equal['n_variant']:,}")
print(f" Control: {result_equal['n_control']:,}")
print(f" Variant: {result_equal['n_variant']:,}")
print(f" Total: {result_equal['n_total']:,}")
print(f"\nAchieved power: {result_equal['power_achieved']:.4f} ({result_equal['power_achieved']*100:.1f}%)")
# Unequal allocation (10:1 - more traffic to control)
result_unequal = compute_sample_size_non_inferiority(
p_control=p_control,
epsilon=epsilon_val,
alpha=alpha_val,
target_power=target_power,
h1_effect_size=0.0,
allocation_ratio=0.1 # Variant gets 10% of control's sample size
)
print(f"\n{'='*80}")
print("UNEQUAL ALLOCATION (10:1 - Control gets 10x more traffic)")
print(f"{'='*80}")
print(f" Control: {result_unequal['n_control']:,}")
print(f" Variant: {result_unequal['n_variant']:,}")
print(f" Total: {result_unequal['n_total']:,}")
print(f"\nAchieved power: {result_unequal['power_achieved']:.4f} ({result_unequal['power_achieved']*100:.1f}%)")
print(f"\n{'='*80}")
print("COMPARISON WITH CURRENT EXAMPLE")
print(f"{'='*80}")
print(f"Current sample sizes:")
print(f" Control: {nC:,}")
print(f" Variant: {nX:,}")
print(f" Observed power: {power:.4f} ({power*100:.1f}%)")
print(f"\nTo achieve 80% power, you would need:")
print(f" Equal allocation: {result_equal['n_variant']:,} per group")
print(f" Increase factor: {result_equal['n_variant'] / nX:.1f}x more samples per group")
print(f"\n💡 KEY INSIGHT:")
print(f" With current n={nX}, power is only {power*100:.1f}%")
print(f" Need n≈{result_equal['n_variant']:,} per group for 80% power")
print(f" This is why NHST struggles with small samples!")
================================================================================ SAMPLE SIZE CALCULATION FOR NON-INFERIORITY TEST ================================================================================ Parameters: Control conversion rate: 70.93% Non-inferiority margin (ε): 2.00% Significance level (α): 5.00% Target power: 80.00% Assumed true difference under H1: 0 (no difference) ================================================================================ EQUAL ALLOCATION (1:1 - Control:Variant) ================================================================================ Required sample size per group: 6,375 Control: 6,375 Variant: 6,375 Total: 12,750 Achieved power: 0.8000 (80.0%) ================================================================================ UNEQUAL ALLOCATION (10:1 - Control gets 10x more traffic) ================================================================================ Control: 35,059 Variant: 3,506 Total: 38,565 Achieved power: 0.8000 (80.0%) ================================================================================ COMPARISON WITH CURRENT EXAMPLE ================================================================================ Current sample sizes: Control: 32,106 Variant: 2,022 Observed power: 0.5978 (59.8%) To achieve 80% power, you would need: Equal allocation: 6,375 per group Increase factor: 3.2x more samples per group 💡 KEY INSIGHT: With current n=2022, power is only 59.8% Need n≈6,375 per group for 80% power This is why NHST struggles with small samples!
NHST Confidence Interval (CI)¶
NHST also has a notion of Confidence Interval but it is not what most people think it means. It does not say anything about the hypothesis or whether the observed value is close to the truth, or whether rejecting H0 is true or not. It is computed without using any of the hypotheses; it is just using the "plugged in" standard deviation SE (derived from the observation, with all the caveats we discussed) and says: if we repeated the experiment many times and constructed an interval each time using this procedure, 95% of those intervals would contain the true parameter value. The parameter is fixed; the interval is random across repeated experiments. Not useful to make any decisions directly. Some people use it indirectly by looking for overlaps between the interval and some key values, but that's just another way to do the p-value analysis as above, or "picking the best variant" below. The p-value computation is the same computation really and the mainstream way of doing it for NHST practitioners.
Bayesian Approach¶
In contrast to NHST, the Bayesian approach is conceptually simpler:
- Instead of fixing two specific expected values for $\Delta$ (one under $H_0$ and one under $H_1$),
- We treat the true conversion difference $E[\Delta]$ itself as an unknown random variable and reason about its entire probability distribution.
This lets us quantify directly how likely any value of $\Delta$ is, given both prior knowledge and the data we observe.
The typical workflow is
- Pick a prior belief, express it as a Beta distribution
- Run the experiment
- Use Bayes theorem to update the prior belief into a posterior belief
- Rinse and repeat as this is one of the strengths of the method; the posterior belief can be used as a new prior before running more experiments that will firm it up, if needed and with perfect mathematical rigor unlike NHST which does not allow peeking or stopping early because of the way sampling works.
Using the Beta Distribution for Our Prior Belief¶
For experiments based on Bernoulli trials (success/failure, convert/abandon, etc.), the most convenient way to model our prior belief about a conversion rate is the Beta distribution.
⚠️ Don’t confuse this Beta with the “$\beta$” from NHST (Type II error).
Here, Beta is the name of a probability distribution.
The Beta distribution:
- Is defined on the interval $[0,1]$, making it perfect for modeling a probability.
- Has two shape parameters, $\alpha$ and $\beta$, which control how strongly it reflects our prior knowledge.
Some examples of possible priors:
- Uninformative prior: $ \mathrm{Beta}(1,1) $ — essentially a uniform distribution, expressing “we know nothing.”
- Weakly informative prior: centered roughly around 17–20% but without being very sure
Here are a few graphical example of the Beta distribution for various values of $\alpha$ and $\beta$:
# Plot comparison of different Beta prior distributions
fig, axes = plot_beta_prior_comparison()
The formal definition of the Beta distribution looks a bit intimidating; both the numerator and denominator are related to the probability (binomial) when the probability of having m successes out of n trials (binomial) where the basic event probability is set to x. Intuition: $\alpha - 1$ and $\beta - 1$ act like prior pseudo-counts of successes m and failures n-m. After observing data, you add the real counts. Special case (uniform prior): Beta(1,1) ⇒ posterior Beta(m + 1, n - m + 1).
$$ f(x, \alpha, \beta) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)} $$
with B being the Beta function B defined as normalizing constant:
$$ B(\alpha, \beta) = \int_0^1 t^{\alpha-1} (1-t)^{\beta-1} dt $$
There is no easy way to work with this formula "by hand" and that is one of the reasons Bayesian approaches were historically impractical as you cannot work with those "by hand" like with a Gaussian function. But now that we have stats packages in Python it is trivial to use. Here are some examples of the shapes it can take by picking different values of $\alpha$ and $\beta$.
Conservative Approach: Assuming We Know Nothing (Non-Informative Prior)¶
For the variant A, suppose we start with a non-informative prior — meaning we have no knowledge about the conversion rate $p_A$.
We assume $p_A$ could be anywhere between $0$ and $1$ with equal probability.
This is modeled by the Beta distribution:
$$ \mathrm{Beta}(1,1) $$
— which is just a uniform prior (flat line) on $[0,1]$.
Posterior After Observing Data¶
After running the experiment with:
- $n_A$ = number of trials (users shown variant A),
- $x_A$ = number of successes (conversions),
the posterior distribution for $p_A$ — thanks to the conjugacy of the Beta with the Bernoulli likelihood is:
$$ \mathrm{Beta}(x_A+1,\; n_A - x_A + 1). $$
This follows directly from Bayes’ theorem and the properties of the Beta distribution.
Expected Value of the Posterior¶
For any $\mathrm{Beta}(\alpha,\beta)$ distribution
expected_value_posterior = (xX_observed + 1) / (nX + 2)
print(f"Expected value of posterior distribution for p_A: {expected_value_posterior:.4f}")
Expected value of posterior distribution for p_A: 0.6902
Credible Intervals and Visualizing Prior vs. Posterior¶
A credible interval answers: “Given the data and our prior, what range of parameter values has (say) 95% posterior probability?”
Note that this is completely different than the confidence interval of NHST although it sounds kind of the same, it is not, this gives us a direct probability of the truth of our hypothesis we can say "we know that the true conversion rate is between a and b with x% probability. We can pick whatever x we want and we will get an interval
So let's say we use equal-tailed credible intervals (quantiles at 2.5% and 97.5%).
- Prior (uninformative): $\mathrm{Beta}(1,1)$
- Posterior after observing $x_A$ conversions out of $n_A$:
$\alpha_A = x_A+1,\; \beta_A = n_A - x_A + 1$
We can compute these intervals and plot how the posterior updates our belief compared with the prior.
# Posterior parameters
nhst_alpha = xX_observed + 1
beta_param = nX - xX_observed + 1
# Compute 95% credible interval (2.5th and 97.5th percentiles)
p_L = beta_dist.ppf(0.025, nhst_alpha, beta_param)
p_U = beta_dist.ppf(0.975, nhst_alpha, beta_param)
# Output the result
print(f"95% Credible Interval for p: [{p_L:.4f}, {p_U:.4f}]")
95% Credible Interval for p: [0.6699, 0.7102]
So we know with 95% chance of being true that the true conversion rate is between 16.13% and 29.30%. Here is a visualization
# Visualize non-informative prior vs posterior
fig, ax = plot_prior_vs_posterior(
alpha=nhst_alpha,
beta_param=beta_param,
control_group_conversion_rate=control_group_conversion_rate,
epsilon=epsilon,
p_L=p_L,
p_U=p_U
)
plt.show()
In this first version we assumed that "we know nothing" prior to the experiment, which is not realistic. We know we added a page with an extra click, so unless there is a serious bug the conversion should be "around" the control which has a historical mean of 20%, but we want to be open to the possibility of small deviation around this historical value. We can do this with what is known as a weakly informative prior.
Weakly Informative Prior Using the Control Group as the base for the prior¶
Instead of using a completely non-informative prior $\mathrm{Beta}(1,1)$, we can use the control group as we can see that the variants operate in the same range. That's a sound metholdology because in an web/mobile envrionment with frequent releases, seasonal shift in usage patterns, marketing campaigns, etc.. an historical prior is usually not a good defijtion of the "legacy" we want to compare the variant with. Only the control group give a us a clean comparable.
Suppose we know the control group conversion rate is in the variable control_group_conversion_rate , and we want to test for non-inferiority with a margin $\epsilon$.
For the variant, we choose a prior centered on what we expect the new mean to be control_group_conversion_rate but make it very "flast" meaning we really now sure where the conversion rate will end up, this is called a weakly informative prior i.e. but we want this prior to have **high entropy (wide uncertainty) so it does not dominate the data.
A Beta distribution with a modest $\alpha$ and $\beta$ can be informative about the center while still uncertain.
For a $\mathrm{Beta}(\alpha,\beta)$ distribution:
$$ \mu = \frac{\alpha}{\alpha+\beta} $$
- Smaller values of $\alpha$ and $\beta$ give a wider and flater (more uncertain) prior.
Let’s pick $\alpha = 20$ and solve for $\beta$ so the mean is control_group_conversion_rate Then recompute the posterior distribution but with this prior
# Informative prior parameters
expected_degradation = 0.0 # This is spund Baeysian prior consruction as long as teh prior is weakly informative
target_prior_mean = control_group_conversion_rate - expected_degradation
alpha_prior = 20 # Small value for high entropy
beta_prior = (alpha_prior / target_prior_mean) - alpha_prior # Solve for beta given mean
print(f"Prior: Beta({alpha_prior:.2f}, {beta_prior:.2f})")
print(f"Prior mean: {alpha_prior / (alpha_prior + beta_prior):.4f}")
print(f"Prior variance: {(alpha_prior * beta_prior) / ((alpha_prior + beta_prior)**2 * (alpha_prior + beta_prior + 1)):.6f}")
# Posterior parameters after observing data
alpha_posterior = xX_observed + alpha_prior
beta_posterior = (nX - xX_observed) + beta_prior
print(f"\nPosterior: Beta({alpha_posterior:.2f}, {beta_posterior:.2f})")
posterior_mean = alpha_posterior / (alpha_posterior + beta_posterior)
print(f"Posterior mean: {posterior_mean:.4f}")
# Compute probability that variant is non-inferior (p_A > p_C - epsilon)
# This is P(p_A > 0.17) under the posterior
non_inferiority_threshold = control_group_conversion_rate - epsilon
prob_non_inferior = 1 - beta_dist.cdf(non_inferiority_threshold, alpha_posterior, beta_posterior)
print(f"\nProbability that variant is non-inferior: {prob_non_inferior:.4f}")
print(f"This means there's a {prob_non_inferior*100:.2f}% probability that the variant conversion rate is above {non_inferiority_threshold:.2f}")
Prior: Beta(20.00, 8.20) Prior mean: 0.7093 Prior variance: 0.007062 Posterior: Beta(1416.00, 634.20) Posterior mean: 0.6907 Probability that variant is non-inferior: 0.5565 This means there's a 55.65% probability that the variant conversion rate is above 0.69
So here unlike the NHST, thanks to adding a reasonable prior that uses our understanding of how the CX for passkey is built, we end up having something actionable. We just have an above 95% probability of being over the cutoff so we are non-inferior, good to go.
Here is a diagram of this prior and the posterior after observing data $(x_A, n_A)$ for the variant, and various credible intervals. For non-inferiority we don't even need the credible interval (see next).
# Compute probability that variant is non-inferior (p_A > p_C - expected_degradation)
# This is P(p_A > 0.17) under the posterior
prob_non_inferior = 1 - beta_dist.cdf(control_group_conversion_rate - expected_degradation,
alpha_posterior, beta_posterior)
print(f"Probability that variant is non-inferior: {prob_non_inferior:.4f}")
print(f"This means there's a {prob_non_inferior*100:.2f}% probability that the variant conversion rate is above {control_group_conversion_rate - epsilon:.2f}")
Probability that variant is non-inferior: 0.0330 This means there's a 3.30% probability that the variant conversion rate is above 0.69
posterior_data = test_non_inferiority_weakly_informative(
n_control=nC,
x_control=xC_observed,
variants_data=variants,
epsilon=epsilon, # Business: can tolerate x% degradation
expected_degradation=expected_degradation, # 0 because we pick a weakly informative prior centered on control rate
alpha_prior_strength=20, # Weak prior (high entropy)
threshold=0.95 # 95% probability required
)
print(f'probability that variant A is non-inferior: {posterior_data["A"]["probability"]:.4f} ')
posterior_data
probability that variant A is non-inferior: 0.9645
{'A': {'is_non_inferior': np.True_,
'probability': np.float64(0.9645213339830435),
'control_rate': 0.7092755248240205,
'variant_rate': 0.7014530973280955,
'posterior_params': (3264, 1389.1977867556648),
'prior_params': (20, 8.19778675566485),
'prior_mean': 0.7092755248240205,
'threshold': 0.6892755248240204,
'epsilon': 0.02,
'n': 4625,
'x': 3244},
'B': {'is_non_inferior': np.False_,
'probability': np.float64(0.2595423969709174),
'control_rate': 0.7092755248240205,
'variant_rate': 0.6827372949273801,
'posterior_params': (1453, 675.1977867556649),
'prior_params': (20, 8.19778675566485),
'prior_mean': 0.7092755248240205,
'threshold': 0.6892755248240204,
'epsilon': 0.02,
'n': 2100,
'x': 1433},
'C': {'is_non_inferior': np.False_,
'probability': np.float64(0.5564853661754626),
'control_rate': 0.7092755248240205,
'variant_rate': 0.6906650710226104,
'posterior_params': (1416, 634.1977867556649),
'prior_params': (20, 8.19778675566485),
'prior_mean': 0.7092755248240205,
'threshold': 0.6892755248240204,
'epsilon': 0.02,
'n': 2022,
'x': 1396}}
# Prior vs Posterior with non-inferiority tail area (P(p_A > p_C - ε))
threshold = control_group_conversion_rate - epsilon
# Create the plot using the helper function
fig, ax, prob_non_inferior_post, prob_non_inferior_prior = plot_informative_prior_posterior_comparison(
alpha_prior=alpha_prior,
beta_prior=beta_prior,
alpha_posterior=alpha_posterior,
beta_posterior=beta_posterior,
threshold=threshold
)
plt.show()
print(f"Posterior P(p_A > {threshold:.2f}) = {prob_non_inferior_post:.4f} "
f"({prob_non_inferior_post*100:.2f}%)")
print(f"Prior P(p_A > {threshold:.2f}) = {prob_non_inferior_prior:.4f} "
f"({prob_non_inferior_prior*100:.2f}%)")
Posterior P(p_A > 0.69) = 0.5565 (55.65%) Prior P(p_A > 0.69) = 0.6129 (61.29%)
fix, ax = plot_weakly_informative_prior_with_variants(variants_results=posterior_data)
Picking the Best Variant¶
This is where the difference between NHST and a Bayesian approach becomes dramatic.
Let’s compare the main options.
🧪 NHST Approaches¶
1. Winner-Takes-All¶
- Pick the variant with the highest observed conversion rate.
Problems:
- Ignores uncertainty and sampling noise.
- Easily picks the wrong variant when samples are small.
2. Pairwise t-Tests with Bonferroni Correction¶
- Run one test for every pair (A vs B, A vs C, B vs C).
- Adjust the significance threshold to control false positives:
$$\alpha_\text{corrected} = \frac{0.05}{3} \approx 0.0167.$$
Problems:
- Multiple comparisons inflate Type I error; Bonferroni is very conservative (higher Type II error).
- Only gives “significant / not significant” — no direct probability of being best.
3. ANOVA + Post-Hoc Tests¶
- One-way ANOVA checks if any difference exists, then post-hoc tests (Tukey, Dunnett, etc.) try to find which.
Problems:
- Still needs multiple-comparison corrections.
- ANOVA only says “something differs” — not which is best or by how much.
4. Confidence Interval Overlap¶
- Compute 95% CIs for each variant and check for overlap.
Problems:
- Overlapping CIs don’t mean “no difference.”
- Often inconclusive and gives no probability a variant is best.
Key Takeaway¶
NHST was designed for asymmetric questions:
- "Is this drug better than placebo?" (directional)
- "Does this treatment have an effect?" (vs. no effect)
NHST struggles with symmetric questions:
- "Which variant is better, A or B?" (symmetric)
- Forces arbitrary directionality or multiple-testing corrections
- Cannot directly compute P(A > B | data)
Bayesian methods naturally handle symmetric comparisons:
- Compute posterior for each variant
- Directly calculate P(A > B), P(B > A), P(C is best), etc.
- No need for multiple-testing corrections
- Scales to any number of variants
- Provides actionable probabilities for decision-making
This is why Bayesian methods are superior for A/B testing where the goal is to pick the best variant, not just detect if one exists.
Summary: Why NHST Struggles with Symmetric A vs B Comparisons¶
All NHST approaches share fundamental limitations:
| Limitation | Impact |
|---|---|
| Computes P(data | hypothesis) | Not what we want: P(hypothesis | data) |
| Binary decisions | Reject/fail-to-reject; no probability of being better |
| Asymmetric framework | Must pick a direction or waste α budget |
| No direct answer | Cannot directly answer "Which is better?" |
| No expected value | Cannot compute expected value for decision-making |
What we actually want:
- P(A > B | data) — direct probability A is better
- P(B > A | data) — direct probability B is better
- Symmetric treatment of both variants
- Actionable metric for decision-making
The Bayesian approach provides exactly this. Let's demonstrate:
🌟 Bayesian Approach — Probability of Being Best¶
The Bayesian framework answers the question we actually care about:
Which variant is most likely the best?
Method:
- Compute the posterior Beta distribution for each variant using its prior and observed data.
- Draw a large number of samples (e.g., 100k) from each posterior.
- For each simulated draw, identify which variant has the highest conversion rate.
- Report the probabilities:
$$P(A \text{ is best}),\; P(B \text{ is best}),\; P(C \text{ is best}), \ldots$$
Advantages:
- Direct answer: “Variant B is best with 88.8% probability.”
- Single coherent analysis: no need for multiple-comparison corrections.
- Scales naturally: works the same way for 3, 5, 10, or 100 variants.
- Quantifies uncertainty: not just yes/no; can report $P(B>A)$, $P(B>A \;\&\; B>C)$, etc.
- Flexible: easily integrates prior knowledge and business context.
- Business-friendly: simple to factor in risk, cost, and implementation difficulty.
Key point:
Bayesian analysis gives a probability each variant is best — a direct, interpretable metric that scales cleanly and supports real-world decision making.
print("\nVariant conversion rates:")
for name, data in variants.items():
rate = data['x'] / data['n']
print(f" {name}: {rate:.4f} ({data['x']}/{data['n']})")
Variant conversion rates: A: 0.7014 (3244/4625) B: 0.6824 (1433/2100) C: 0.6904 (1396/2022)
All variants are above non-inferiority boundary: {control_group_conversion_rate - epsilon:.2f} But which one should we choose? In this specific case because all samples have the same size (n), we can just compare the mean but the "clean" way to do it is to run a Monte Carlo simulation to see which Variant would win "in all possible universes"
# Compute posterior distributions (using non-informative prior Beta(1,1))
posteriors = {}
print("\nPosterior Distributions:")
print("-" * 80)
for name, data in variants.items():
# Posterior parameters (with non-informative prior Beta(1,1))
alpha_post = data['x'] + 1
beta_post = data['n'] - data['x'] + 1
# Posterior statistics
posterior_mean = alpha_post / (alpha_post + beta_post)
posterior_var = (alpha_post * beta_post) / \
((alpha_post + beta_post)**2 * (alpha_post + beta_post + 1))
posterior_std = np.sqrt(posterior_var)
# Credible intervals
ci_95_lower = beta_dist.ppf(0.025, alpha_post, beta_post)
ci_95_upper = beta_dist.ppf(0.975, alpha_post, beta_post)
posteriors[name] = {
'alpha': alpha_post,
'beta': beta_post,
'mean': posterior_mean,
'std': posterior_std,
'ci_95': (ci_95_lower, ci_95_upper)
}
print(f"\nVariant {name}:")
print(f" Posterior: Beta(α={alpha_post}, β={beta_post})")
print(f" Posterior mean: {posterior_mean:.4f}")
print(f" Posterior std: {posterior_std:.4f}")
print(f" 95% Credible Interval: [{ci_95_lower:.4f}, {ci_95_upper:.4f}]")
Posterior Distributions: -------------------------------------------------------------------------------- Variant A: Posterior: Beta(α=3245, β=1382) Posterior mean: 0.7013 Posterior std: 0.0067 95% Credible Interval: [0.6881, 0.7144] Variant B: Posterior: Beta(α=1434, β=668) Posterior mean: 0.6822 Posterior std: 0.0102 95% Credible Interval: [0.6621, 0.7019] Variant C: Posterior: Beta(α=1397, β=627) Posterior mean: 0.6902 Posterior std: 0.0103 95% Credible Interval: [0.6699, 0.7102]
# Create visualization of the three posterior distributions
fig, ax = plot_multiple_posteriors_comparison(
posteriors=posteriors,
control_group_conversion_rate=control_group_conversion_rate,
epsilon=epsilon
)
plt.show()
print("\n✓ All three posterior distributions overlap significantly")
print(" This shows there's uncertainty about which is truly best")
✓ All three posterior distributions overlap significantly This shows there's uncertainty about which is truly best
# Run Monte Carlo simulation
n_simulations = 100000
print(f"Running {n_simulations:,} simulations...\n")
# Draw samples from each posterior
samples = {}
for name in ['A', 'B', 'C']:
alpha_p = posteriors[name]['alpha']
beta_p = posteriors[name]['beta']
samples[name] = beta_dist.rvs(alpha_p, beta_p, size=n_simulations)
# For each simulation, determine which variant is best
best_counts = {'A': 0, 'B': 0, 'C': 0}
for i in range(n_simulations):
# Get the sampled values for this simulation
sample_values = {
'A': samples['A'][i],
'B': samples['B'][i],
'C': samples['C'][i]
}
# Find which variant has the highest value in this simulation
best_variant = max(sample_values, key=lambda k: sample_values[k])
best_counts[best_variant] += 1
# Calculate probabilities
probabilities = {name: count / n_simulations for name, count in best_counts.items()}
print("RESULTS: Probability Each Variant is Best")
print("-" * 80)
for name in ['A', 'B', 'C']:
prob = probabilities[name]
bar = '█' * int(prob * 60)
print(f"P({name} is best) = {prob:.4f} ({prob*100:5.2f}%) {bar}")
# Determine the winner
winner = max(probabilities, key=probabilities.get)
winner_prob = probabilities[winner]
print("\n" + "="*80)
print("BAYESIAN CONCLUSION:")
print("="*80)
print(f"✓ Variant {winner} is most likely the best")
print(f" Probability: {winner_prob:.4f} ({winner_prob*100:.1f}%)")
print(f"\nInterpretation:")
print(f" - There's a {winner_prob*100:.1f}% chance that {winner} has the highest true conversion rate")
print(f" - This accounts for uncertainty in all three estimates")
print(f" - Clear, actionable decision with quantified confidence")
Running 100,000 simulations...
RESULTS: Probability Each Variant is Best -------------------------------------------------------------------------------- P(A is best) = 0.7815 (78.15%) ██████████████████████████████████████████████ P(B is best) = 0.0444 ( 4.44%) ██ P(C is best) = 0.1741 (17.41%) ██████████ ================================================================================ BAYESIAN CONCLUSION: ================================================================================ ✓ Variant A is most likely the best Probability: 0.7815 (78.2%) Interpretation: - There's a 78.2% chance that A has the highest true conversion rate - This accounts for uncertainty in all three estimates - Clear, actionable decision with quantified confidence