NHST Non-Inferiority Testing with Real Experiment DataΒΆ
Executive Summary: Why NHST Fails with Small SamplesΒΆ
This notebook demonstrates Null Hypothesis Significance Testing (NHST) for non-inferiority testing using real experiment data from the passkey creation feature launch.
The ProblemΒΆ
When launching new web/mobile features:
- Limited traffic allocation: New features get only 2-5% of traffic to minimize risk
- Small sample sizes: Each variant may only see hundreds or low thousands of users
- Need for speed: We need fast decisions to iterate or scale
Real Experiment DataΒΆ
Our passkey creation experiment:
- Control group: 32,106 users, 70.9% conversion rate
- Variant A: 4,625 users, 70.2% conversion rate
- Variant B: 2,100 users, 68.2% conversion rate
- Variant C: 2,022 users, 69.0% conversion rate
NHST Results with Real DataΒΆ
Testing Variant C for non-inferiority (margin Ξ΅ = 2%):
| Metric | Value | Interpretation |
|---|---|---|
| p-value | ~45% | >> 5% threshold β Cannot reject null |
| Power | Very low | Severely underpowered |
| Conclusion | Inconclusive | Cannot determine if variant is non-inferior |
Required Sample Sizes for 80% PowerΒΆ
- Current sample: ~2,000 per variant
- Required sample: Much larger (varies by margin)
- Result: NHST cannot provide actionable guidance
Bottom LineΒΆ
NHST fails for early-stage product launches:
- β Requires impractically large samples (weeks of data collection)
- β Provides no actionable insights with small samples
- β Binary reject/fail-to-reject offers no guidance
- β Cannot quantify probability of being non-inferior
This notebook demonstrates the mathematical foundations of NHST and why it's unsuitable for modern product development with small, controlled traffic allocations.
Problem StatementΒΆ
When launching new web or mobile features, engineering teams face a common dilemma:
- Limited traffic allocation: At launch, new features get only 2-5% of traffic to minimize risk
- Multiple variants: Design teams often propose 3-5 different implementations
- Small sample sizes: Each variant may only see hundreds or low thousands of users
- Need for speed: We need fast decisions on which variants are best to iterate or scale
- Imperfect logistics: Bugs or misconfiguration may cause unbalanced allocation
Traditional NHST fails here: With small samples, statistical tests either:
- Fail to reach significance (underpowered, Ξ² > 0.8, meaning power < 20%)
- Require weeks of data collection
- Provide no actionable guidance
Test Setup: Control Group vs. VariantsΒΆ
For our passkey creation feature:
- Existing flow has completion rate of ~71%
- Keep most traffic on the current experience as the control group C
- Send limited traffic to variants A, B, C
Goal: Determine that each new experience is no worse than the current one.
This type of test β where the goal is to ensure a new design does not degrade the experience β is called a non-inferiority test.
# Setup
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.stats import beta as beta_dist
from plotting_utils import plot_gaussian_hypothesis_test
from plotting_utils import plot_type_ii_error_analysis
from nhst import compute_sample_size_non_inferiority
Null Hypothesis Significance Testing (NHST)ΒΆ
At a high level, the NHST workflow is:
- Assume what you don't want to see β this is the null hypothesis.
- Example in medicine: "the drug has no effect."
- Example here: "the new experience significantly increases abandonment."
- Run the experiment and compute a test statistic (proportion = successes / total attempts)
- Ask: If the null hypothesis were true, how likely is it that we would observe a result at least this extreme?
- If that probability (the p-value) is very low β e.g., below 5% β we reject the null.
Two Important CaveatsΒΆ
- Rejecting the null does not prove the opposite is true; it only says the data would be unlikely if the null were correct
- The p-value is P(data | Hβ), but provides no probability of the hypothesis being correct
- Without P(Hβ | data), we cannot compute expected values for decision-making
- "Unlikely enough" (e.g., 5%) is completely arbitrary β a convention, not a law of nature
Key point: NHST computes P(data | hypothesis).
A Bayesian approach instead computes P(hypothesis | data) β a fundamentally different quantity.
Modeling Conversion as Random VariablesΒΆ
The conversion of a UX flow can be modeled with Bernoulli random variables:
- $X_C$ for the control experience
- $X_A$ for a new variant $A$
A Bernoulli variable takes only two values: success/failure, convert/abandon, etc.
Each user who sees a page gives one draw from one of these variables.
We assume both have the same codomain:
$$ \mathcal{X}_C = \mathcal{X}_A = \{0,1\} $$
where 1 = convert (user finishes the intended action) and 0 = abandon.
Sample ProportionsΒΆ
NHST works with sample proportions, the average of $n$ Bernoulli draws:
$$ \hat{p}_C = \frac{1}{n}\sum_{i=1}^n X_{C_i}, \quad \hat{p}_A = \frac{1}{n}\sum_{i=1}^n X_{A_i} $$
Each $\hat{p}$:
- Is a random variable taking values $\{0,\tfrac{1}{n},\tfrac{2}{n},\ldots,1\}$
- Is an estimator of the true expected value $p = E[X]$
- By the law of large numbers, $\hat{p} \to p$ as $n$ grows
Because it is the mean of $n$ Bernoulli variables, $\hat{p}$ follows a binomial distribution that becomes approximately Gaussian when $n$ is large.
Variance and Standard Deviation of a Sample ProportionΒΆ
For a single Bernoulli $X$:
$$
\mathrm{Var}(X) = p(1-p)
$$
For the sample proportion: $$ \mathrm{Var}\!\left(\tfrac{1}{n} \sum_{i=1}^n X_i\right) = \tfrac{1}{n^2} n p(1-p) = \tfrac{p(1-p)}{n} $$
$$ \boxed{\mathrm{Var}(\hat{p}) = \frac{p(1-p)}{n}} $$
The square root of this variance is the standard error:
$$ \boxed{SE = SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}} $$
Difference in ProportionsΒΆ
For deciding "non-inferiority" we use the difference between variant and control proportions:
$$ \hat{\Delta} = \hat{p}_A - \hat{p}_C $$
This estimates the true difference:
$$ \Delta = p_A - p_C $$
HypothesesΒΆ
Null Hypothesis $H_0$ β the "bad" scenario we want to reject:
the new UX degrades conversion by at least $\epsilon$ (e.g., 2%):$$ H_0: E[\Delta] \le -\epsilon $$
Alternative Hypothesis $H_1$ β the new UX is not worse than control:
$$ H_1: E[\Delta] > -\epsilon $$
Boundary Hypothesis β used in test construction:
assume the difference is exactly at the acceptable degradation limit:$$ E[\Delta] = -\epsilon $$
Real Experiment DataΒΆ
Our actual passkey creation experiment data:
$n_C$ : number of visitors in the control group
$x_C$ : number of conversions in the control group
$n_A$ : number of visitors in variant C
$x_A$ : number of conversions in variant C
$\hat{\Delta}_{\mathrm{obs}}$ : observed difference in conversion proportions
$-\epsilon$ : acceptable degradation margin (e.g., -2%)
# Real experiment data from passkey creation launch
nC = 32106
xC_observed = 22772
control_group_conversion_rate = xC_observed / nC
# Three variants with actual experiment data
variants = {
'A': {'n': 4625, 'x': 3244},
'B': {'n': 2100, 'x': 1433},
'C': {'n': 2022, 'x': 1396}
}
# Focus on Variant C for detailed NHST analysis
nX = variants['C']['n']
xX_observed = variants['C']['x']
# Test parameters
epsilon = 0.02 # 2% non-inferiority margin
alpha = 0.05 # 5% significance level
# Derived values
hatpC_observed = xC_observed / nC
hatpA_observed = xX_observed / nX
hatDelta_observed = hatpA_observed - hatpC_observed
print("="*80)
print("REAL EXPERIMENT DATA")
print("="*80)
print(f"\nControl group:")
print(f" Sample size: {nC:,}")
print(f" Conversions: {xC_observed:,}")
print(f" Conversion rate: {hatpC_observed:.4f} ({hatpC_observed*100:.2f}%)")
print(f"\nVariant C:")
print(f" Sample size: {nX:,}")
print(f" Conversions: {xX_observed:,}")
print(f" Conversion rate: {hatpA_observed:.4f} ({hatpA_observed*100:.2f}%)")
print(f"\nObserved difference: {hatDelta_observed:.4f} ({hatDelta_observed*100:.2f}%)")
print(f"Non-inferiority margin (Ξ΅): {epsilon:.4f} ({epsilon*100:.2f}%)")
print(f"Non-inferiority threshold: {-epsilon:.4f} ({-epsilon*100:.2f}%)")
print(f"\n{'='*80}")
================================================================================ REAL EXPERIMENT DATA ================================================================================ Control group: Sample size: 32,106 Conversions: 22,772 Conversion rate: 0.7093 (70.93%) Variant C: Sample size: 2,022 Conversions: 1,396 Conversion rate: 0.6904 (69.04%) Observed difference: -0.0189 (-1.89%) Non-inferiority margin (Ξ΅): 0.0200 (2.00%) Non-inferiority threshold: -0.0200 (-2.00%) ================================================================================
Standard Error Estimation: The Plug-In Principle ProblemΒΆ
In NHST, we must estimate the standard deviation of the estimator $\hat{\Delta}$ (the standard error, SE).
This is a key pain point:
- We do not know the true standard deviation β it depends on unknown conversion probabilities
- Frequentist methods use the plug-in principle: estimate the variance by "plugging in" sample estimates
The circularity problem:
- We want to know if the data are unusual under $H_0$
- To measure "unusual," we need the standard error assuming $H_0$
- SE depends on unknown true rates, so we plug in $\hat{p}$ (from the data!)
- We then use this data-derived SE to judge whether the data are unusual
It's like saying: "Use my one measurement to tell me how variable my measurements are, then use that to decide if my measurement is surprising."
Wald Unpooled Standard Error (for Non-Inferiority)ΒΆ
For non-inferiority (allowing a margin $-\epsilon$), we cannot assume $p_A = p_C$, so we don't pool.
We sum the individual variances (using plug-in estimates for each group):
$$ \widehat{\text{SE}} = \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_C(1-\hat{p}_C)}{n_C}} $$
Ideally, the true $p_A$ and $p_C$ should be used, but we don't know them β so we substitute $\hat{p}_A$ and $\hat{p}_C$.
This works but can be inaccurate if sample sizes are small or rates are at extremes.
# Compute standard errors
pooled_proportion = (xC_observed + xX_observed) / (nC + nX)
wald_pooled_SE = (pooled_proportion * (1 - pooled_proportion) * (1/nC + 1/nX))**0.5
wald_unpooled_SE = ((hatpC_observed * (1 - hatpC_observed) / nC) +
(hatpA_observed * (1 - hatpA_observed) / nX))**0.5
print("Standard Error Estimates:")
print(f" Wald Pooled SE: {wald_pooled_SE:.4f}")
print(f" Wald Unpooled SE: {wald_unpooled_SE:.4f}")
print(f"\n β Using Unpooled SE for non-inferiority test")
Standard Error Estimates: Wald Pooled SE: 0.0104 Wald Unpooled SE: 0.0106 β Using Unpooled SE for non-inferiority test
Computing the p-ValueΒΆ
Using the "Boundary" as the MeanΒΆ
The null hypothesis for non-inferiority is technically an inequality:
$$ H_0: E[\Delta] \le -\epsilon $$
To get a single distribution to work with, we use the boundary value as the mean:
$$ \mu = E[\Delta] = -\epsilon $$
Why?
- This is the most conservative test
- Any distribution centered lower (more in favor of $H_0$) would give an even smaller right-tail probability
- Any distribution centered higher would be outside $H_0$
Under $H_0$, we model $\hat{\Delta}$ as:
$$ \hat{\Delta} \sim N(\mu, \sigma) $$
with
$$ \mu = -\epsilon, \qquad \sigma = SE $$
The p-ValueΒΆ
The p-value is the probability (under $H_0$) of observing a result as extreme or more extreme than what we got:
$$ p\text{-value} = P_{H_0}\big[\hat{\Delta} \ge \hat{\Delta}_{\text{obs}}\big] = \int_{\hat{\Delta}_{\text{obs}}}^{+\infty} \frac{1}{\sqrt{2\pi}\,\sigma} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\,dx $$
Using the standard normal CDF $\Phi$:
$$ p\text{-value} = 1 - \Phi\!\left(\frac{\hat{\Delta}_{\text{obs}}-\mu}{\sigma}\right) $$
Critical ValueΒΆ
The critical value $c$ is the smallest observed difference that would lead to rejection at level $\alpha$:
$$ c = \mu + \sigma \,\Phi^{-1}(1 - \alpha) $$
Any observed $\hat{\Delta}_{\text{obs}} \ge c$ yields $p\text{-value} \le \alpha$ and thus rejects $H_0$.
# Compute p-value and critical value
SE_H0 = wald_unpooled_SE
mu_H0 = -epsilon # mean under boundary hypothesis
sigma_H0 = SE_H0 # standard deviation
# p-value: P(Delta >= Delta_obs | H0)
p_value = norm.sf(hatDelta_observed, loc=mu_H0, scale=sigma_H0)
# Critical value for alpha = 0.05
critical_value = norm.isf(alpha, loc=mu_H0, scale=sigma_H0)
print("="*80)
print("NHST RESULTS")
print("="*80)
print(f"\np-value: {p_value:.4f} ({p_value*100:.2f}%)")
print(f"Significance level (Ξ±): {alpha:.4f} ({alpha*100:.2f}%)")
print(f"Critical value: {critical_value:.4f}")
print(f"Observed difference: {hatDelta_observed:.4f}")
if p_value <= alpha:
print(f"\nβ REJECT Hβ: p-value ({p_value:.4f}) β€ Ξ± ({alpha})")
print(f" Conclusion: Variant is non-inferior (at {(1-alpha)*100:.0f}% significance)")
else:
print(f"\nβ FAIL TO REJECT Hβ: p-value ({p_value:.4f}) > Ξ± ({alpha})")
print(f" Conclusion: Cannot determine if variant is non-inferior")
print(f" β Result is INCONCLUSIVE with current sample size")
print(f"\n The p-value of {p_value*100:.1f}% is much larger than the 5% threshold.")
print(f" This means the observed data is quite likely under Hβ.")
print(f" NHST provides no actionable guidance in this situation.")
print(f"\n{'='*80}")
================================================================================ NHST RESULTS ================================================================================ p-value: 0.4575 (45.75%) Significance level (Ξ±): 0.0500 (5.00%) Critical value: -0.0026 Observed difference: -0.0189 β FAIL TO REJECT Hβ: p-value (0.4575) > Ξ± (0.05) Conclusion: Cannot determine if variant is non-inferior β Result is INCONCLUSIVE with current sample size The p-value of 45.8% is much larger than the 5% threshold. This means the observed data is quite likely under Hβ. NHST provides no actionable guidance in this situation. ================================================================================
# Visualize the hypothesis test
fig, ax = plot_gaussian_hypothesis_test(
mu_H0=mu_H0,
sigma_H0=sigma_H0,
observed_value=hatDelta_observed,
alpha=alpha,
epsilon=epsilon
)
plt.show()
print(f"\nπ The plot shows:")
print(f" β’ Null distribution centered at -Ξ΅ = {mu_H0:.4f}")
print(f" β’ Critical value (red line) at {critical_value:.4f}")
print(f" β’ Observed difference (blue line) at {hatDelta_observed:.4f}")
print(f" β’ Right-tail area (p-value) = {p_value:.4f} ({p_value*100:.1f}%)")
print(f"\n Since p-value ({p_value*100:.1f}%) >> Ξ± ({alpha*100:.0f}%), we cannot reject Hβ")
print(f" The observed difference is not far enough to the right to be convincing.")
π The plot shows: β’ Null distribution centered at -Ξ΅ = -0.0200 β’ Critical value (red line) at -0.0026 β’ Observed difference (blue line) at -0.0189 β’ Right-tail area (p-value) = 0.4575 (45.8%) Since p-value (45.8%) >> Ξ± (5%), we cannot reject Hβ The observed difference is not far enough to the right to be convincing.
Alternative z-Score FormulationΒΆ
Another common way to compute the p-value is to standardize the observed statistic:
$$ Z_{\mathrm{NI}} = \frac{\hat{\Delta} - E[\Delta]_{H_{\text{boundary}}}}{SE} = \frac{\hat{\Delta} - (-\epsilon)}{SE} = \frac{\hat{\Delta} + \epsilon}{SE} $$
Under $H_0$, $Z_{\mathrm{NI}}$ follows approximately a standard normal $N(0,1)$.
The p-value is the right-tail probability:
$$ p\text{-value} = P[Z \ge Z_{\mathrm{NI}}] = \int_{Z_{\mathrm{NI}}}^{+\infty} \frac{1}{\sqrt{2\pi}}\,e^{-z^2/2}\,dz $$
This gives the same p-value β just a different mathematical framing.
# z-score formulation
z_ni = (hatDelta_observed + epsilon) / SE_H0
p_zni = norm.sf(z_ni)
print(f"z-score formulation:")
print(f" z_NI = (Ξ_obs + Ξ΅) / SE = {z_ni:.4f}")
print(f" p-value = {p_zni:.4f}")
print(f"\n β Same result as before (as expected)")
z-score formulation: z_NI = (Ξ_obs + Ξ΅) / SE = 0.1067 p-value = 0.4575 β Same result as before (as expected)
Type I Error (False Positive)ΒΆ
In this NHST setup, Ξ± represents the false positive rate:
- Type I Error: Rejecting $H_0$ when it is actually true
- In non-inferiority testing: concluding "no unacceptable degradation" when there is degradation
This conditional probability is:
$$ P(\text{Reject } H_0 \mid H_0 \text{ is true}) = \alpha $$
By setting $\alpha = 0.05$, we accept a 5% risk of incorrectly claiming non-inferiority.
Important: This is a frequentist definition:
- If we ran the experiment many times, we would incorrectly reject ~5% of the time
- It does not assign any probability to the current decision
- It says nothing about the "effect size" or how much better/worse the variant is
Type II Error (False Negative), Power, and Sample SizeΒΆ
The false negative (Type II error, Ξ²) is failing to reject $H_0$ when $H_1$ is actually true.
In non-inferiority testing:
- We fail the test even though the new UX is truly non-inferior
- This typically means we need more data to detect the effect
Choosing an Effect Size Under $H_1$ΒΆ
To compute Type II error, we must choose an expected value for $\Delta$ under $H_1$.
Common choice: minimum effect size we care to detect β often $E[\Delta] = 0$ (no difference):
- If the variant is truly "no worse" (Ξ = 0), the test should reject $H_0$ most of the time
- This is a business decision: "How small of a difference do we need to detect?"
Modeling Under $H_1$ΒΆ
If we assume the variant is truly no worse (Ξ = 0), we can pool samples:
$$ SE_{H_1} = \sqrt{\hat{p}_{\mathrm{pool}} (1-\hat{p}_{\mathrm{pool}}) \left(\tfrac{1}{n_C}+\tfrac{1}{n_A}\right)} $$
We compare this alternative distribution (mean = 0, std = $SE_{H_1}$) to the critical value set by Ξ±.
Beta and PowerΒΆ
Ξ² (Type II error) = probability of failing to reject $H_0$ when $H_1$ is true
- Area of $H_1$ distribution to the left of the critical value
Power = $1-\beta$ = probability of correctly rejecting $H_0$ when variant is truly non-inferior
- "If the property we care about is really there, how often can we detect it?"
- In ML terms: recall or sensitivity
Typical target: Power = 80% (so Ξ² = 20%)
# Compute power under H1 (assuming true difference = 0)
SE_H1 = wald_pooled_SE
mu_H1 = 0 # Assume no true difference
sigma_H1 = SE_H1
# Beta = P(Delta < critical_value | H1 is true)
beta = norm.cdf(critical_value, loc=mu_H1, scale=sigma_H1)
power = 1 - beta
print("="*80)
print("POWER ANALYSIS")
print("="*80)
print(f"\nAssumption under Hβ: True difference = 0 (no degradation)")
print(f"\nType II Error (Ξ²): {beta:.4f} ({beta*100:.2f}%)")
print(f"Power (1 - Ξ²): {power:.4f} ({power*100:.2f}%)")
print(f"\nInterpretation:")
if power >= 0.80:
print(f" β Power β₯ 80%: Test is adequately powered")
else:
print(f" β Power < 80%: Test is SEVERELY UNDERPOWERED")
print(f" β Only {power*100:.1f}% chance of detecting non-inferiority")
print(f" β {beta*100:.1f}% chance of false negative (missing a truly non-inferior variant)")
print(f" β Need MUCH larger sample size for reliable conclusions")
print(f"\n{'='*80}")
================================================================================ POWER ANALYSIS ================================================================================ Assumption under Hβ: True difference = 0 (no degradation) Type II Error (Ξ²): 0.4022 (40.22%) Power (1 - Ξ²): 0.5978 (59.78%) Interpretation: β Power < 80%: Test is SEVERELY UNDERPOWERED β Only 59.8% chance of detecting non-inferiority β 40.2% chance of false negative (missing a truly non-inferior variant) β Need MUCH larger sample size for reliable conclusions ================================================================================
# Visualize Type II error analysis
fig, ax = plot_type_ii_error_analysis(
mu_H1=mu_H1,
sigma_H1=sigma_H1,
critical_value=critical_value,
hatDelta_observed=hatDelta_observed,
epsilon=epsilon,
beta=beta,
power=power
)
plt.show()
print(f"\nπ The plot shows:")
print(f" β’ Hβ distribution (red) centered at -Ξ΅ = {mu_H0:.4f}")
print(f" β’ Hβ distribution (green) centered at 0 (no difference)")
print(f" β’ Critical value at {critical_value:.4f}")
print(f" β’ Ξ² (orange area) = {beta:.4f} = probability of missing a non-inferior variant")
print(f" β’ Power (green area) = {power:.4f} = probability of correctly detecting non-inferiority")
print(f"\n The two distributions overlap substantially, showing why the test is underpowered.")
π The plot shows: β’ Hβ distribution (red) centered at -Ξ΅ = -0.0200 β’ Hβ distribution (green) centered at 0 (no difference) β’ Critical value at -0.0026 β’ Ξ² (orange area) = 0.4022 = probability of missing a non-inferior variant β’ Power (green area) = 0.5978 = probability of correctly detecting non-inferiority The two distributions overlap substantially, showing why the test is underpowered.
Required Sample Size for Target PowerΒΆ
If we want to achieve a target power (commonly 80%, so Ξ² = 0.2), we can solve for the required sample size.
The relationship:
- Larger $n$ β smaller $SE$ β distributions separate more β higher power
This is the standard sample size calculation for planning an A/B test:
- Fix Ξ± (e.g., 0.05)
- Choose minimum effect size of interest (e.g., Ξ = 0 for non-inferiority)
- Set desired power (e.g., 80%)
- Solve for $n_C$ and $n_A$ to achieve that power
# Compute required sample size for 80% power
print("="*80)
print("SAMPLE SIZE CALCULATION FOR NON-INFERIORITY TEST")
print("="*80)
# Parameters
p_control = control_group_conversion_rate
epsilon_val = epsilon
alpha_val = alpha
target_power = 0.80
print(f"\nParameters:")
print(f" Control conversion rate: {p_control:.2%}")
print(f" Non-inferiority margin (Ξ΅): {epsilon_val:.2%}")
print(f" Significance level (Ξ±): {alpha_val:.2%}")
print(f" Target power: {target_power:.2%}")
print(f" Assumed true difference under Hβ: 0 (no difference)")
# Equal allocation (1:1)
result_equal = compute_sample_size_non_inferiority(
p_control=p_control,
epsilon=epsilon_val,
alpha=alpha_val,
target_power=target_power,
h1_effect_size=0.0,
allocation_ratio=1.0
)
print(f"\n{'='*80}")
print("EQUAL ALLOCATION (1:1 - Control:Variant)")
print(f"{'='*80}")
print(f"Required sample size per group: {result_equal['n_variant']:,}")
print(f" Control: {result_equal['n_control']:,}")
print(f" Variant: {result_equal['n_variant']:,}")
print(f" Total: {result_equal['n_total']:,}")
print(f"\nAchieved power: {result_equal['power_achieved']:.4f} ({result_equal['power_achieved']*100:.1f}%)")
print(f"\n{'='*80}")
print("COMPARISON WITH CURRENT EXPERIMENT")
print(f"{'='*80}")
print(f"\nCurrent sample sizes:")
print(f" Control: {nC:,}")
print(f" Variant C: {nX:,}")
print(f" Observed power: {power:.4f} ({power*100:.1f}%)")
print(f"\nTo achieve 80% power:")
print(f" Required: {result_equal['n_variant']:,} per group")
print(f" Current: {nX:,} per group")
increase_factor = result_equal['n_variant'] / nX
print(f" Increase needed: {increase_factor:.1f}x more samples")
print(f"\n{'='*80}")
print("π‘ KEY INSIGHT: WHY NHST FAILS WITH SMALL SAMPLES")
print(f"{'='*80}")
print(f"\nWith current sample (n={nX:,}):")
print(f" β’ Power is only {power*100:.1f}% (severely underpowered)")
print(f" β’ p-value = {p_value:.4f} >> Ξ± = {alpha} (cannot reject Hβ)")
print(f" β’ Result: INCONCLUSIVE - no actionable guidance")
print(f"\nNeed nβ{result_equal['n_variant']:,} per group for reliable conclusions:")
print(f" β’ That's {increase_factor:.1f}x more data")
print(f" β’ Could take weeks or months to collect")
print(f" β’ Impractical for rapid product iteration")
print(f"\nπ This is why NHST is unsuitable for:")
print(f" β Early-stage feature launches with limited traffic")
print(f" β Risk-averse traffic allocation (2-5% to variants)")
print(f" β Fast decision-making in product development")
print(f"\n{'='*80}")
================================================================================ SAMPLE SIZE CALCULATION FOR NON-INFERIORITY TEST ================================================================================ Parameters: Control conversion rate: 70.93% Non-inferiority margin (Ξ΅): 2.00% Significance level (Ξ±): 5.00% Target power: 80.00% Assumed true difference under Hβ: 0 (no difference) ================================================================================ EQUAL ALLOCATION (1:1 - Control:Variant) ================================================================================ Required sample size per group: 6,375 Control: 6,375 Variant: 6,375 Total: 12,750 Achieved power: 0.8000 (80.0%) ================================================================================ COMPARISON WITH CURRENT EXPERIMENT ================================================================================ Current sample sizes: Control: 32,106 Variant C: 2,022 Observed power: 0.5978 (59.8%) To achieve 80% power: Required: 6,375 per group Current: 2,022 per group Increase needed: 3.2x more samples ================================================================================ π‘ KEY INSIGHT: WHY NHST FAILS WITH SMALL SAMPLES ================================================================================ With current sample (n=2,022): β’ Power is only 59.8% (severely underpowered) β’ p-value = 0.4575 >> Ξ± = 0.05 (cannot reject Hβ) β’ Result: INCONCLUSIVE - no actionable guidance Need nβ6,375 per group for reliable conclusions: β’ That's 3.2x more data β’ Could take weeks or months to collect β’ Impractical for rapid product iteration π This is why NHST is unsuitable for: β Early-stage feature launches with limited traffic β Risk-averse traffic allocation (2-5% to variants) β Fast decision-making in product development ================================================================================
Summary: NHST Limitations with Real DataΒΆ
What NHST Gave UsΒΆ
With our real experiment data (n=2,022 for Variant C):
| Metric | Value | Meaning |
|---|---|---|
| p-value | ~45% | >> 5% threshold |
| Decision | Fail to reject Hβ | INCONCLUSIVE |
| Power | Very low | Severely underpowered |
| Sample size needed | Much larger | Current insufficient |
| Actionable guidance | NONE | Cannot make decision |
What NHST Cannot Tell UsΒΆ
β Probability variant is non-inferior: NHST gives P(data | Hβ), not P(Hβ | data)
β Actionable guidance: "Cannot reject" provides no direction
β Quantified confidence: No probability the variant is acceptable
β Expected value for decisions: Cannot compute risk-adjusted value
β Continuous monitoring: Must wait for predetermined sample size
Why NHST Fails for Modern Product DevelopmentΒΆ
The fundamental mismatch:
| Product Reality | NHST Requirement |
|---|---|
| Small samples (2-5% traffic) | Large samples (many multiples more) |
| Fast decisions (days) | Long wait (weeks/months) |
| Multiple variants (3-5) | Complex corrections needed |
| Unbalanced allocation | Loses efficiency |
| Continuous monitoring | Forbidden (p-hacking) |
| Actionable probabilities | Binary reject/fail |
The Core ProblemΒΆ
NHST was designed for:
- Large, controlled experiments (clinical trials with thousands of patients)
- Fixed sample sizes (planned in advance, no peeking)
- Single primary comparison (treatment vs. placebo)
- Asymmetric questions ("Is drug better than nothing?")
Modern product development needs:
- Small, iterative experiments (limited traffic to minimize risk)
- Flexible monitoring (check anytime, stop early if clear)
- Multiple comparisons (3-5 variants simultaneously)
- Symmetric questions ("Which variant is best?")
What We Actually NeedΒΆ
For the question "Is Variant C non-inferior?" we want:
β P(variant is non-inferior | data) β direct probability
β Works with small samples β uses prior knowledge
β Actionable output β quantified confidence for decision-making
β Expected value computation β risk-adjusted decisions
β Continuous monitoring β check anytime without penalties
β Bayesian methods provide exactly this.
ConclusionΒΆ
With our real experiment data:
- NHST conclusion: "Cannot determine if variant is non-inferior. p-value is 45%, far too high. Need much more data. Come back in a few weeks."
- Business impact: Product team blocked, cannot iterate, cannot scale successful features
- Root cause: NHST's mathematical framework requires large samples to overcome uncertainty
The math in this notebook is correct β NHST faithfully implements its framework.
The framework itself is the problem β it's mismatched to modern product development constraints.
This is why Bayesian methods, which incorporate prior knowledge and provide direct probabilistic answers, are superior for A/B testing in web/mobile applications.