Present code
library(tibble)
library(ggplot2)
library(dplyr)
library(tidyr)
library(latex2exp)
library(scales)
library(knitr)
Over the previous few years working in advertising measurement, I’ve seen that energy evaluation is without doubt one of the most poorly understood testing and measurement matters. Generally it’s misunderstood and generally it’s not utilized in anyway regardless of its foundational position in check design. This text and the collection that comply with are my makes an attempt to alleviate this.
On this phase, I’ll cowl:
- What’s statistical energy?
- How can we compute it?
- What can affect energy?
Energy evaluation is a statistical subject and as a consequence, there will likely be math and statistics (loopy proper?) however I’ll attempt to tie these technical particulars again to actual world issues or primary instinct at any time when attainable.
With out additional ado, let’s get to it.
Error sorts in testing: Sort I vs. Sort II
In testing, there are two forms of error:
- Sort I:
- Technical Definition: We erroneously reject the null speculation when the null speculation is true
- Layman’s Definition: We are saying there was an impact when there actually wasn’t
- Instance: A/B testing a brand new artistic and concluding that it performs higher than the outdated design when in actuality, each designs carry out the identical
- Sort II:
- Technical Definition: We fail to reject the null speculation when the null speculation is fake
- Layman’s Definition: We are saying there was no impact when there actually was
- Instance: A/B testing a brand new artistic and concluding that it performs the identical because the outdated design when in actuality, the brand new design performs higher
What’s statistical energy?
Most individuals are aware of Sort I error. It’s the error that we management by setting a significance degree. Energy pertains to Sort II error. Extra particularly, energy is the chance of accurately rejecting the null speculation when it’s false. It’s the complement of Sort II error (i.e., 1 – Sort II error). In different phrases, energy is the chance of detecting a real impact if one exists. It ought to be clear why that is essential:
- Underpowered assessments are more likely to miss true results, resulting in missed alternatives for enchancment
- Underpowered assessments can result in false confidence within the outcomes, as we might conclude that there is no such thing as a impact when there truly is one
- … and most easily, underpowered assessments waste cash and sources
The position of α and β
If each are essential, why are Sort II error and energy so misunderstood and ignored whereas Sort I is at all times thought of? It’s as a result of we are able to simply choose our Sort I error fee. The truth is, that’s precisely what we’re doing after we set the importance degree α (sometimes α = 0.05) for our assessments. We’re stating that we’re comfy with a sure share of Sort I error. Throughout check setup, we make an announcement, “we’re comfy with an X % false optimistic fee,” after which set α = X %. After the check, if our p-value falls under α, we reject the null speculation (i.e., “the outcomes are vital”), and if the p-value falls above α, we fail to reject the null speculation (i.e., “the outcomes aren’t vital”).
Figuring out Sort II error, β (sometimes β = 0.20), and thus energy, is just not as easy. It requires us to make assumptions and carry out evaluation, known as “energy evaluation.” To know the method, it’s greatest to first stroll by the method of testing after which backtrack to determine how energy might be computed and influenced. Let’s use a easy A/B artistic check for example.
| Idea | Image | Typical Worth(s) | Technical Definition | Plain-Language Definition |
|---|---|---|---|---|
| Sort I Error | α | 0.05 (5%) | Likelihood of rejecting the null speculation when the null is definitely true | Saying there may be an impact when in actuality there is no such thing as a distinction |
| Sort II Error | β | 0.20 (20%) | Likelihood of failing to reject the null speculation when the null is definitely false | Saying there is no such thing as a impact when in actuality there may be one |
| Energy | 1 − β | 0.80 (80%) | Likelihood of accurately rejecting the null speculation when the choice is true | The possibility we detect a real impact if there may be one |
Computing energy: step-by-step
A pair notes earlier than we get began:
- I made just a few assumptions and approximations to simplify the instance. For those who can spot them, nice. If not, don’t fear about it. The aim is to grasp the ideas and course of, not the nitty gritty particulars.
- I seek advice from the choice threshold within the z-score area because the vital worth. Important worth sometimes refers back to the threshold within the authentic area (e.g., conversion charges) however I’ll use it interchangeably so I don’t have to introduce a brand new time period.
- There are code snippets all through tied to the textual content and ideas. For those who copy the code your self, you may mess around with the parameters to see how issues change. A few of the code snippets are hidden to maintain the article readable. Click on “Present the code” to see the code.
- Do this: Edit the pattern measurement within the check setup in order that the check statistic is just under the vital worth after which run the facility evaluation. Are the outcomes what you anticipated?
Check setup and the check statistic
As acknowledged above, it’s greatest to stroll by the testing course of first after which backtrack to determine how energy might be computed. Let’s just do that.
# Set parameters for the A/B check
N_a <- 1000 # Pattern measurement for artistic A
N_b <- 1000 # Pattern measurement for artistic B
alpha <- 0.05 # Significance degree
# Perform to compute the vital z-value for a one-tailed check
critical_z <- operate(alpha, two_sided = FALSE) {
if (two_sided) qnorm(1 - alpha/2) else qnorm(1 - alpha)
}
As acknowledged above, it’s greatest to stroll by the testing course of first after which backtrack to determine how energy might be computed. Let’s just do that.
Our check setup:
- Null speculation: The conversion fee of A equals the conversion fee of B.
- Various speculation: The conversion fee of B is bigger than the conversion fee of A.
- Pattern measurement:
- Na = 1,000 — Quantity of people that obtain artistic A
- Nb = 1,000 — Quantity of people that obtain artistic B
- Significance degree: α = 0.05
- Important worth: The vital worth is the z-score that corresponds to the importance degree α. We name this Z1−α. For a one-tailed check with α = 0.05, that is roughly 1.64.
- Check sort: Two-proportion z-test
x_a <- 100 # Variety of conversions for artistic A
x_b <- 150 # Variety of conversions for artistic B
p_a <- x_a / N_a # Conversion fee for artistic A
p_b <- x_b / N_b # Conversion fee for artistic B
Our outcomes:
- xa = 100 — Variety of conversions from artistic A
- xb = 150 — Variety of conversions from artistic B
- pa = xa / Na = 0.10 — Conversion fee of artistic A
- pb = xb / Nb = 0.15 — Conversion fee of artistic B
Beneath the null speculation, the distinction in conversion charges follows an roughly regular distribution with:
- Imply: μ = 0 (no distinction in conversion charges)
- Customary deviation:
σ = √[ pa(1 − pa)/Na + pb(1 − pb)/Nb ] ≈ 0.01
z_score <- operate(p_a, p_b, N_a, N_b) {
(p_b - p_a) / sqrt((p_a * (1 - p_a) / N_a) + (p_b * (1 - p_b) / N_b))
}
From these values, we are able to compute the check statistic:
[
z = frac{p_b – p_a}
{sqrt{frac{p_a (1 – p_a)}{N_a} + frac{p_b (1 – p_b)}{N_b}}}
approx 3.39
]
If our check statistic, z, is bigger than the vital worth, we reject the null speculation and conclude that Inventive B performs higher than Inventive A. If z is lower than or equal to the vital worth, we fail to reject the null speculation and conclude that there is no such thing as a vital distinction between the 2 creatives.
In different phrases, if our outcomes are unlikely to be noticed when the conversion charges of A and B are really the identical, we reject the null speculation and state that Inventive B performs higher than Inventive A. In any other case, we fail to reject the null speculation and state that there is no such thing as a vital distinction between the 2 creatives.
Given our check outcomes, we reject the null speculation and conclude that Inventive B performs higher than Inventive A.
z <- z_score(p_a, p_b, N_a, N_b)
critical_value <- critical_z(alpha)
if (z > critical_value) {
outcome <- "Reject null speculation: Inventive B performs higher than Inventive A"
} else {
outcome <- "Fail to reject null speculation: No vital distinction between creatives"
}
outcome
#> [1] "Reject null speculation: Inventive B performs higher than Inventive A"
The instinct behind energy
Now that we have walked through the testing process, where does power come into play? In the process above, we record sample conversion rates, pa and pb, and then compute the test statistic, z. However, if we repeated the test many times, we would get different sample conversion rates and different test statistics, all centering around the true conversion rates of the creatives.
Assume the true conversion rate of Creative B is higher than that of Creative A. Some of these tests will still fail to reject the null hypothesis due to natural variance. Power is the percentage of these tests that reject the null hypothesis. This is the underlying mechanism behind all power analysis and hints at the missing ingredient: the true conversion rates—or more generally, the true effect size.
Intuitively, if the true effect size is higher, our measured effect would typically be higher and we would reject the null hypothesis more often, increasing power.
Choosing the true effect size
If we need true conversion rates to compute power, how do we get them? If we had them, we wouldn’t need to perform testing. Therefore, we need to make an assumption. Broadly, there are two approaches:
- Choose the meaningful effect size: In this approach, we assign the true effect size (or true difference in conversion rates) to a level that would be meaningful. If Creative B only increased conversion rates by 0.01%, would we actually care and take action on those results? Probably not. So why would we care about being able to detect that small of an effect? On the other hand, if Creative B increased conversion rates by 50%, we certainly would care. In practice, the meaningful effect size likely falls between these two points.
- Note: This is often referred to as the minimal detectable effect. However, the minimal detectable effect of the study and the minimal detectable effect that we care about (for example, we may only care about 5% or greater effects, but the study is designed to detect 1% or greater effects) may differ. For that reason, I prefer to use the term meaningful effect when referring to this strategy.
- Use prior studies: If we have data from prior studies or models that measure the efficiency of this creative or similar creatives, we can use those values to assign the true effect size.
Both of the above approaches are valid.
If you only care to see meaningful effects and don’t mind if you miss out on detecting smaller effects, go with the first option. If you must see “statistical significance”, go with the second option and be conservative with the values you use (more on that in another article).
Technical Note
Because we don’t have true conversion rates, we are technically assigning a specific expected distribution to the alternative hypothesis and then computing power based on that. The true mean in the following passages is technically the expected mean under the alternative hypothesis. I will use the term true to keep the language simple and concise.
Computing and visualizing power
Now that we have the missing ingredients, true conversion rates, we can compute power. Instead of the measured pa and pb, we now have true conversion rates ra and rb.
We measure power as:
[
1 – beta = 1 – P(z < Z_{1-alpha} ;|; N_a, N_b, r_a, r_b)
]
This may be confusing at first glance, so let’s break it down.
We are stating that power (1 − β) is computed by subtracting the Type II error rate from one. The Type II error rate is the likelihood that a test results in a z-score below our significance threshold, given our sample size and true conversion rates ra and rb. How do we compute that last part?
In a two-proportion z-score test, we know that:
- Mean: μ = rb − ra
- Standard deviation: σ = √[ ra(1 − ra)/Na + rb(1 − rb)/Nb ]
Now we need to compute:
[
P(X > Z_{1-alpha}), quad X sim N!left(frac{mu}{sigma},,1right)
]
This is the area under the above distribution that lies to the right of Z1−α and is equivalent to computing:
[
P!left(X > frac{mu}{sigma} – Z_{1-alpha}right), quad X sim N(0,1)
]
If we had a textbook with a z-score table, we could simply look up the p-value associated with
(μ / σ − Z1−α), and that would give us the power.
Let’s show this visually:
Show the code
r_a <- p_a # true baseline conversion rate; we are reusing the measured value
r_b <- p_b # true treatment conversion rate; we are reusing the measure value
alpha <- 0.05
two_sided <- FALSE # set TRUE for two-sided test
mu_diff <- function(r_a, r_b) r_b - r_a
sigma_diff <- function(r_a, r_b, N_a, N_b) {
sqrt(r_a*(1 - r_a)/N_a + r_b*(1 - r_b)/N_b)
}
power_value <- function(r_a, r_b, N_a, N_b, alpha, two_sided = FALSE) {
mu <- mu_diff(r_a, r_b)
sd1 <- sigma_diff(r_a, r_b, N_a, N_b)
zc <- critical_z(alpha, two_sided)
thr <- zc * sigma_diff(r_a, r_b, N_a, N_b)
if (!two_sided) {
1 - pnorm(thr, mean = mu, sd = sd1)
} else {
pnorm(-thr, mean = mu, sd = sd1) + (1 - pnorm(thr, mean = mu, sd = sd1))
}
}
# Build plot data
mu <- mu_diff(r_a, r_b)
sd1 <- sigma_diff(r_a, r_b, N_a, N_b)
zc <- critical_z(alpha, two_sided)
thr <- zc * sigma_diff(r_a, r_b, N_a, N_b)
# x-range covering both curves and thresholds
x_min <- min(-4*sd1, mu - 4*sd1, -thr) - 0.1*sd1
x_max <- max( 4*sd1, mu + 4*sd1, thr) + 0.1*sd1
xx <- seq(x_min, x_max, length.out = 2000)
df <- tibble(
x = xx,
H0 = dnorm(xx, mean = 0, sd = sd1), # distribution used by test threshold
H1 = dnorm(xx, mean = mu, sd = sd1) # true (alternative) distribution
)
# Regions to shade for power
if (!two_sided) {
shade <- df %>% filter(x >= thr)
} else {
shade <- bind_rows(
df %>% filter(x >= thr),
df %>% filter(x <= -thr)
)
}
# Numeric power for subtitle
pow <- power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
# Plot
ggplot(df, aes(x = x)) +
# H1 shaded power region
geom_area(
data = shade, aes(y = H1), alpha = 0.25
) +
# Curves
geom_line(aes(y = H0), linewidth = 1) +
geom_line(aes(y = H1), linewidth = 1, linetype = "dashed") +
# Critical line(s)
geom_vline(xintercept = thr, linetype = "dotted", linewidth = 0.8) +
{ if (two_sided) geom_vline(xintercept = -thr, linetype = "dotted", linewidth = 0.8) } +
# Mean markers
geom_vline(xintercept = 0, alpha = 0.3) +
geom_vline(xintercept = mu, alpha = 0.3, linetype = "dashed") +
# Labels
labs(
title = "Power as shaded area under H1 beyond critical threshold",
subtitle = TeX(sprintf(r"($1 - beta$ = %.1f%% | $mu$ = %.4f, $sigma$ = %.4f, $z^*$ = %.3f, threshold = %.4f)",
100*pow, mu, sd1, zc, thr)),
x = TeX(r"(Difference in conversion rates ($D = p_b - p_a$))"),
y = "Density"
) +
annotate("text", x = mu, y = max(df$H1)*0.95, label = TeX(r"(H1: $N(mu, sigma^2)$)"), hjust = -0.05) +
annotate("text", x = 0, y = max(df$H0)*0.95, label = TeX(r"(H0: $N(0, sigma^2)$)"), hjust = 1.05) +
theme_minimal(base_size = 13)
In the plot above, power is the area under the alternative distribution (H1) (where we assume the alternative is distributed according to our true conversion rates) that is beyond the critical threshold (i.e., the area where we reject the null hypothesis). With the parameters we set, the power is 0.96. This means that if we repeated this test many times with the same parameters, we would expect to reject the null hypothesis approximately 96% of the time.
Power curves
Now that we have intuition and math behind power, we can explore how power changes based on different parameters. The plots generated from such analysis are called power curves.
Note
Throughout the plots, you’ll notice that 80% power is highlighted. This is a common target for power in testing, as it balances the risk of Type II error with the cost of increasing sample size or adjusting other parameters. You’ll see this value highlighted in many software packages as a consequence.
Relationship with effect size
Earlier, I stated that the larger the effect size, the higher the power. Intuitively, this makes sense. We are essentially shifting the right bell curve in the plot above further to the right, so the area beyond the critical threshold increases. Let’s test that theory.
Show the code
# Function to compute power for varying effect sizes
power_curve <- function(effect_sizes, N_a, N_b, alpha, two_sided = FALSE) {
sapply(effect_sizes, function(e) {
r_a <- p_a
r_b <- p_a + e # Adjust r_b based on effect size
power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
})
}
# Generate effect sizes
effect_sizes <- seq(0, 0.1, length.out = 100) # Effect sizes from 0 to 10%
# Compute power for each effect size
power_values <- power_curve(effect_sizes, N_a, N_b, alpha)
# Create a data frame for plotting
power_df <- tibble(
effect_size = effect_sizes,
power = power_values
)
# Plot the power curve
ggplot(power_df, aes(x = effect_size, y = power)) +
geom_line(color = "blue", size = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) + # target power guide
labs(
title = "Power vs. Effect Size",
x = TeX(r"(Effect Size ($r_b - r_a$))"),
y = TeX(r'(Power ($1 - beta $))')
) +
scale_x_continuous(labels = scales::percent_format(accuracy = 0.01)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
theme_minimal(base_size = 13)

Theory confirmed: as the effect size increases, power increases. It approaches 100% as the effect size increases and our decision threshold moves down the long-tail of the normal distribution.
Relationship with sample size
Unfortunately, we cannot control effect size. It is either the meaningful effect size you wish to detect or based on prior studies. It is what it is. What we can control is sample size. The larger the sample size, the smaller the standard deviation of the distribution and the larger the area under the curve beyond the critical threshold (imagine squeezing the sides to compress the bell curves in the plot earlier). In other words, larger sample sizes should lead to higher power. Let’s test this theory as well.
Show the code
power_sample_size <- function(N_a, N_b, r_a, r_b, alpha, two_sided = FALSE) {
power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
}
# Generate sample sizes
sample_sizes <- seq(100, 5000, by = 100) # Sample sizes from 100 to 5000
# Compute power for each sample size
power_values_sample <- sapply(sample_sizes, function(N) {
power_sample_size(N, N, r_a, r_b, alpha)
})
# Create a data frame for plotting
power_sample_df <- tibble(
sample_size = sample_sizes,
power = power_values_sample
)
# Plot the power curve for varying sample sizes
ggplot(power_sample_df, aes(x = sample_size, y = power)) +
geom_line(color = "blue", size = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) + # target power guide
labs(
title = "Power vs. Sample Size",
x = TeX(r"(Sample Size ($N$))"),
y = TeX(r"(Power (1 - $beta$))")
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
theme_minimal(base_size = 13)

We again see the expected relationship: as sample size increases, power increases.
Note
In this specific setup, we can increase power by increasing sample size. More generally, this is an increase in precision. In other test setups, precision—and thus power—can be increased through other means. For example, in Geo-testing, we can increase precision by selecting predictable markets or through the inclusion of exogenous features (more on this in a future article).
Relationship with significance level
Does the significance level α influence power? Intuitively, if we are more willing to accept Type I error, we are more likely to reject the null hypothesis and thus (1 − β) should be higher. Let’s test this theory.
Show the code
power_of_alpha <- function(alpha_vec, r_a, r_b, N_a, N_b, two_sided = FALSE) {
sapply(alpha_vec, function(a)
power_value(r_a, r_b, N_a, N_b, a, two_sided)
)
}
alpha_grid <- seq(0.001, 0.20, length.out = 400)
power_grid <- power_of_alpha(alpha_grid, r_a, r_b, N_a, N_b, two_sided)
# Current point
power_now <- power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
df_alpha_power <- tibble(alpha = alpha_grid, power = power_grid)
ggplot(df_alpha_power, aes(x = alpha, y = power)) +
geom_line(color = "blue", size = 1) +
geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) + # target power guide
geom_vline(xintercept = alpha, linetype = "dashed", alpha = 0.6) + # your alpha
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
labs(
title = TeX(r"(Power vs. Significance Level)"),
subtitle = TeX(sprintf(r"(At $alpha$ = %.1f%%, $1 - beta$ = %.1f%%)",
100*alpha, 100*power_now)),
x = TeX(r"(Significance Level ($alpha$))"),
y = TeX(r"(Power (1 - $beta$))")
) +
theme_minimal(base_size = 13)

Yet again, the results match our intuition. There is no free lunch in statistics. All else equal, if we want to decrease our Type II error rate (β), we must be willing to accept a higher Type I rate (α).
Power analysis
So what is power analysis? Power analysis is the process of computing power given the parameters of the test. In power analysis, we fix parameters we cannot control and then optimize the parameters we can control to achieve a desired power level. For example, we can fix the true effect size and then compute the sample size needed to achieve a desired power level. Power curves are often used to assist with this decision-making process. Later in the series, I will walk through power analysis in detail with a real-world example.
Sources
[1] R. Larsen and M. Marx, An Introduction to Mathematical Statistics and Its Applications
What’s next in the Series?
I haven’t fully decided but I definitely want to cover the following topics:
- Power analysis in Geo Testing
- Detailed guide on setting the true effect size in various contexts
- Real world end-to-end examples
Happy to hear ideas. Feel free to reach out. My contact info is below:
