beneath uncertainty. Not simply as soon as, however in a sequence over time. We depend on our previous experiences and expectations of the longer term to take advantage of knowledgeable and optimum decisions doable.
Consider a enterprise that provides a number of merchandise. These merchandise are procured at a price and offered for a revenue. Nevertheless, unsold stock might incur a restocking charge, might carry salvage worth, or in some circumstances, have to be scrapped completely.
Companies, due to this fact, faces a vital query: how a lot to inventory? This choice should usually be made earlier than demand is totally identified; that’s, beneath censored demand. If the enterprise overstocks, it observes the total demand, since all buyer requests are fulfilled. But when it understocks, it solely sees that demand exceeded provide and the precise demand stays unknown, making it a censored statement.
Any such downside is sometimes called a Newsvendor Mannequin. In fields akin to operations analysis and utilized arithmetic, the optimum stocking choice has been studied by framing it as a basic newspaper stocking downside; therefore the identify.
On this article, we discover a Sequential Choice-Making framework for the stocking downside beneath uncertainty and develop a dynamic optimization algorithm utilizing Bayesian studying.
Our strategy carefully follows the framework laid out by Warren B. Powell, Reinforcement Learning and Stochastic Optimization (2019) and implements the paper from Negoescu, Powell, and Frazier (2011), Optimal Learning Policies for the Newsvendor Problem with Censored Demand and Unobservable Lost Sales, printed in Operations Analysis.
Drawback Setup
Following an analogous setup to that of Negoescu et al., we body the issue as optimizing the stock degree for a single merchandise over a sequence of time steps. The fee and promoting worth are thought-about fastened. Unsold stock is discarded with no salvage worth, whereas every unit offered generates income. Demand is unknown, and when the out there inventory is lower than precise demand, the demand statement is taken into account censored.
Demand ( W ) in every interval is drawn from an exponential distribution with an unknown charge parameter, for simulation functions.
[
begin{aligned}
x &in mathbb{R}_+ &&: text{Order quantity (decision variable)}
W &sim mathrm{Exponential}(lambda) &&: text{Random demand with unknown rate parameter } lambda
lambda &&&: text{Demand rate (unknown, to be estimated)}
c &&&: text{Unit cost to procure or produce the item}
p &&&: text{Unit selling price (assume } p > c text{ for profitability)}
end{aligned}
]
The parameter (lambda) within the exponential distribution represents the charge of demand; that’s, how shortly demand occasions happen. The common demand is given by (mathbb{E}[W] = frac{1}{lambda}).

We will observe from the Likelihood Density Operate (PDF) of the Exponential distribution that increased values of demand (W) change into much less doubtless. Thus, the Exponential distribution serves as an applicable selection for demand modeling.
Sequential Choice Formulation
We formulate the stock management downside as sequential choice course of beneath uncertainty. The aim is to maximise complete anticipated revenue over a finite time horizon ( N ), whereas studying unknown demand charge by making use of Bayesian Studying principals.
We outline a mannequin with an preliminary state and a probabilistic mannequin that represents its perception about future states over time. At every time step, the mannequin comes to a decision based mostly on a coverage that maps its present perception to an motion. The aim is to search out the optimum coverage that maximizes a predefined reward perform.
After taking an motion, the mannequin observes the ensuing state and updates its perception, accordingly, persevering with this cycle of choice, statement, and perception replace.

1) State Variable
We mannequin demand in every interval as a random variable drawn from an Exponential distribution with an unknown charge parameter λ. Since ( lambda ) shouldn’t be immediately observable, we encode our uncertainty about its worth utilizing a Gamma prior:
[
lambda sim mathrm{Gamma}(a_0, b_0)
]
The parameters ( a_0 ) and ( b_0 ) outline the form and charge of our preliminary perception in regards to the demand charge. These two parameters function our state variables. At every time step, they summarize all previous data and are up to date as new demand observations change into out there.
As we acquire extra information, the posterior distribution over (lambda) evolves from a large and unsure form to a narrower and extra assured one, regularly concentrating across the true demand charge.
This course of is captured naturally by the Gamma distribution, which flexibly adjusts its form based mostly on the quantity of knowledge we’ve seen. Early on, the distribution is diffuse, signaling excessive uncertainty. As observations accumulate, the idea turns into sharper, permitting for extra dependable and responsive decision-making. Likelihood Density Operate (PDF) of Gamma distribution could be seen under:

We are going to later outline a transition perform that updates the state, that’s, how ( (a_n, b_n) ) evolves to ( (a_{n+1}, b_{n+1}) ), based mostly on newly noticed information. This enables the mannequin to repeatedly refine its perception about demand and make extra knowledgeable stock selections over time.
Notice that anticipated worth of the Gamma distribution is outlined as:
[
mathbb{E}[lambda] = frac{a}{b}
]
2) Choice Variable
The choice variable at time ( n ) is the stocking degree:
[
x_n in mathbb{R}_+
]
That is the variety of items to order earlier than demand ( W_{n+1} ) is realized. The choice relies upon solely on the present perception ( (a_n, b_n) ).
3) Exogenous Info
After deciding on ( x_n ), demand ( W_{n+1} ) is revealed:
[
W_{n+1} sim text{Exp}(lambda)
]
Since ( lambda ) is unknown, demand is random. Observations are:
- Uncensored if ( W_{n+1} < x_n ) (we observe the precise demand)
- Censored if ( W_{n+1} ge x_n ) (we solely know demand exceeding provide ranges)
This censoring limits the knowledge out there for perception updating. Although the total demand isn’t noticed, the censored statement nonetheless carries worthwhile data and shouldn’t be ignored in our modeling strategy.
4) Transition Operate
The transition perform defines how the mannequin’s perception, represented by the state variables, is up to date over time. It maps the prior state to the anticipated future state, and in our case, this replace is ruled by Bayesian studying.
Bayesian Uncertainty Modelling
Bayes’ theorem combines prior perception with noticed information to kind a posterior distribution. This up to date distribution displays each prior information and the newly noticed data.
[
p_{n+1}(lambda mid w_{n+1}) = frac{p(w_{n+1} mid lambda) cdot p_n(lambda)}{p(w_{n+1})}
]
The place:
[
p(w_{n+1} mid lambda) : text{ Likelihood of new observation at time } n+1
]
[
p_n(lambda) : text{ Prior at time } n
]
[
p(w_{n+1}) : text{ Marginal likelihood (normalizing constant) at time } n+1
]
[
p_{n+1}(lambda mid w_{n+1}) : text{ Posterior after observing } w_{n+1}
]
We arrange our downside such that in every interval, demand W is drawn from an Exponential distribution. Prior perception over λ can be modelled utilizing a Gamma distribution.
[
p_{n+1}(lambda mid w_{n+1})
=
frac{
underbrace{lambda e^{-lambda w_{n+1}}}_{text{Likelihood}}
cdot
underbrace{frac{b_n^{a_n}}{Gamma(a_n)} lambda^{a_n – 1} e^{-b_n lambda}}_{text{Prior (Gamma)}}
}{
underbrace{
int_0^infty lambda e^{-lambda w_{n+1}} cdot frac{b_n^{a_n}}{Gamma(a_n)} lambda^{a_n – 1} e^{-b_n lambda} , dlambda
}_{text{Marginal (evidence)}}
}
]
The Gamma and Exponential distributions kind a well known conjugate prior in Bayesian statistics. When utilizing a Gamma prior and an Exponential chance, the ensuing posterior can be a Gamma distribution. This property of the prior and posterior belonging to the identical distributional household is what defines a conjugate prior. Posterior additionally belongs to the Gamma household; a property that simplifies Bayesian updating considerably.
For reference, closed-form conjugate updates like this may be present in customary conjugate prior tables, such because the one on Wikipedia. Utilizing this reference, we are able to formulate the posterior as:
Let:
[
lambda mid w_1, dots, w_n sim mathrm{Gamma}left(a_0 + n, b_0 + sum_{i=1}^n w_iright)
]
[
lambda sim mathrm{Gamma}(a_0, b_0) quad : text{ Prior}
]
[
w sim mathrm{Exp}(lambda) quad : text{ Likelihood}
]
For n unbiased observations ( w_1, dots, w_n ), the Gamma prior and Exponential chance end in a Gamma posterior:
After observing a single (uncensored) demand ( w ), the posterior simplifies to under by leveraging conjugate priors:
[
lambda mid w sim mathrm{Gamma}(a_0 + 1, b_0 + w)
]
- The form parameter will increase by 1 as a result of one new information level has been noticed.
- The speed parameter will increase by ( w ) as a result of the Exponential chance consists of the time period ( e^{-lambda w} ), which mixes with the prior’s exponential time period and provides to the entire exponent.
The Replace Operate
The posterior parameters (state variables) are up to date based mostly on the character of the statement:
- Uncensored (( W_{n+1} < x_n )):
[
a_{n+1} = a_n + 1, quad b_{n+1} = b_n + W_{n+1}
]
- Censored (( W_{n+1} ge x_n )):
[
a_{n+1} = a_n, quad b_{n+1} = b_n + x_{n}
]
These updates replicate how every statement, full or partial, informs the posterior perception over ( lambda ).
We will outline the transition perform in Python as under:
from typing import Tuple
def transition_a_b(
a_n: float,
b_n: float,
x_n: float,
W_n1: float
) -> Tuple[float, float]:
"""
Updates the posterior parameters (a, b) after observing demand.
Args:
a_n (float): Present form parameter of Gamma prior.
b_n (float): Present charge parameter of Gamma prior.
x_n (float): Order amount at time n.
W_n1 (float): Noticed demand at time n+1 (could also be censored).
Returns:
Tuple[float, float]: Up to date (a_{n+1}, b_{n+1}) values.
"""
if W_n1 < x_n:
# Uncensored: full demand noticed
a_n1 = a_n + 1
b_n1 = b_n + W_n1
else:
# Censored: solely know that W >= x
a_n1 = a_n
b_n1 = b_n + x_n
return a_n1, b_n1
5) Goal Operate
The mannequin seeks a coverage ( pi ), mapping beliefs to stocking selections so as to maximize complete anticipated revenue.
- Revenue from ordering ( x_n ) items and dealing with demand ( W_{n+1} ):
[
F(x_n, W_{n+1}) = p cdot min(x_n, W_{n+1}) – c cdot x_n
]
- The cumulative goal is:
[
max_pi mathbb{E} left[ sum_{n=0}^{N-1} F(x_n, W_{n+1}) right]
]
- ( pi ) maps ( (a_n, b_n) ) to ( x_n )
- ( p ) is the promoting worth per unit offered
- ( c ) is the unit value of ordering
- Unsold items are discarded with no salvage worth
Notice that this goal perform maximizes solely the anticipated instant reward throughout the whole time horizon. Within the subsequent part, we introduce an expanded model that includes the worth of future studying. This encourages the mannequin to discover, accounting for the knowledge that censored demand can reveal over time.
We will outline the revenue perform in Python as under:
def profit_function(x: float, W: float, p: float, c: float) -> float:
"""
Revenue perform outlined as:
F(x, W) = p * min(x, W) - c * x
This represents the reward obtained when fulfilling demand W with stock x,
incomes worth p per unit offered and incurring value c per unit ordered.
Args:
x (float): Stock degree / choice variable.
W (float): Realized demand.
p (float, non-obligatory): Unit promoting worth.
c (float, non-obligatory): Unit value.
Returns:
float: The revenue (reward) for this era.
"""
return p * min(x, W) - c * x
Coverage Capabilities
We are going to outline a number of coverage capabilities as outlined by Negoescu et al, which is able to replace the worth of (x_{n+1}) (stocking degree) based mostly on our present perception of the state ((a_{n}, b_{n})).
1) Level Estimate Coverage
Beneath this coverage, mannequin estimates the unknown demand charge (lambda) utilizing the present posterior and chooses and order amount ( x_{n+1} ) to maximise the instant anticipated revenue.
At time (n), present posterior about (lambda ~ Gamma(a_{n}, b_{n})) is:
[
hat{lambda}_n = frac{a_n}{b_n}
]
We deal with this estimate because the “true” worth of (lambda) and assume demand (W sim textual content{Exp}(hat{lambda}_n)).
Anticipated Worth
The revenue for order amount (x) and realized demand (W) is:
[
F(x, W) = p cdot min(x, W) – c cdot x
]
We search to maximise the anticipated revenue.
[
max_{x geq 0} quad mathbb{E}_W left[ p min(x, W) – c x right]
]
Anticipated worth of a random variable is:
[
mathbb{E}[X] = int_{-infty}^{infty} x cdot f(x) , dx
]
Thus, the target perform could be written as:
[
max_{x geq 0} left[ p left( int_0^x w f_W(w) , dw + x int_x^infty f_W(w) , dw right) – c x right]
]
The place:
- (f_W(x)): Likelihood density perform (PDF) of demand evaluated at (x)
PDF of (Exponential(lambda)) is:
[
f_W(w) = hat{lambda}_n e^{-hat{lambda}_n w}
]
This may be solved as:
[
mathbb{E}[F(x, W)] = p cdot frac{1 – e^{-hat{lambda}_n x}}{hat{lambda}_n} – c x
]
First Order Optimality Situation
We set the by-product of the anticipated revenue perform to zero, and clear up for (x) to search out the stocking worth which maximize the anticipated revenue:
[
frac{d}{dx} mathbb{E}[F(x, W)] = p e^{-hat{lambda}_n x} – c = 0
]
[
e^{-hat{lambda}_n x^*} = frac{c}{p}
quad Rightarrow quad
x^* = frac{1}{hat{lambda}_n} logleft( frac{p}{c} right)
]
Substitute (hat{lambda}_n = frac{a_n}{b_n}):
[
x_n = frac{b_n}{a_n} logleft( frac{p}{c} right)
]
Python implementation:
import math
def point_estimate_policy(
a_n: float,
b_n: float,
p: float,
c: float
) -> float:
"""
Level Estimate Coverage, chooses x_n based mostly on posterior imply at time n.
Args:
a_n (float): Gamma form parameter at time n.
b_n (float): Gamma charge parameter at time n.
p (float): Promoting worth per unit.
c (float): Unit value.
Returns:
float: Stocking degree x_n
"""
lambda_hat = a_n / b_n
return (1 / lambda_hat) * math.log(p / c)
2) Distribution Coverage
The Distribution Coverage optimizes the anticipated instant revenue by integrating over the whole present perception distribution of the demand charge (lambda). Not like the Level Estimate Coverage, it doesn’t collapse the posterior to a single worth.
At time (n), the idea about (lambda) is:
[
lambda sim text{Gamma}(a_n, b_n)
]
Demand is modelled as:
[
W sim text{Exp}(lambda)
]
This coverage chooses order amount (x_{n}) by maximizing the anticipated instant revenue, averaged over each the uncertainty in demand and the uncertainty in (lambda)
[
x_n = argmax_{x ge 0} mathbb{E}_{lambda sim text{Gamma}(a_n, b_n)} left[ mathbb{E}_{W sim text{Exp}(lambda)} left[ p cdot min(x, W) – c x right] proper]
]
That is the anticipated instant revenue, averaged over each the uncertainty in demand and the uncertainty in (lambda).
Anticipated Worth
From the earlier coverage, we all know that:
[
mathbb{E}_W[min(x, W)] = frac{1 – e^{-hat{lambda}_n x}}{hat{lambda}_n}
]
Thus:
[
mathbb{E}_{lambda} left[ mathbb{E}_{W mid lambda}[min(x, W)] proper]
= mathbb{E}_{lambda} left[ frac{1 – e^{-lambda x}}{lambda} right]
]
If we denote the Gamma density as:
[
f(lambda) = frac{b^a}{Gamma(a)} lambda^{a – 1} e^{-b lambda}
]
Then expectation turns into:
[
mathbb{E}_lambda left[ frac{1 – e^{-lambda x}}{lambda} right]
=int_0^infty frac{1 – e^{-lambda x}}{lambda} f(lambda) , dlambda
= frac{b^a}{Gamma(a)} int_0^infty (1 – e^{-lambda x}) lambda^{a – 2} e^{-b lambda} , dlambda
]
With out going over the total proof, expectation turns into:
[
mathbb{E}[text{Profit}] = p cdot mathbb{E}_{lambda} left[ frac{1 – e^{-lambda x}}{lambda} right] – c x
= p cdot frac{b}{a – 1} left(1 – left( frac{b}{b + x} proper)^{a – 1} proper) – c x
]
First Order Optimality Situation
Once more, we set the by-product of the anticipated revenue perform to zero, and clear up for (x) to search out the stocking worth which maximize the anticipated revenue:
[
frac{d}{dx} mathbb{E}[text{Profit}]
= frac{d}{dx} left[ p cdot frac{b}{a – 1} left(1 – left( frac{b}{b + x} right)^{a – 1} right) – c x right] = 0
]
With out going over the proof, closed kind expresion based mostly on Negoescu et al’s paper is:
[
x_n = b_n left( left( frac{p}{c} right)^{1/a_n} – 1 right)
]
Python implementation:
def distribution_policy(
a_n: float,
b_n: float,
p: float,
c: float
) -> float:
"""
Distribution Coverage, chooses x_n by integrating over full posterior at time n.
Args:
a_n (float): Gamma form parameter at time n.
b_n (float): Gamma charge parameter at time n.
p (float): Promoting worth per unit.
c (float): Unit value.
Returns:
float: Stocking degree x_n
"""
return b_n * ((p / c) ** (1 / a_n) - 1)
Data Gradient (KG) Coverage
The Data Gradient (KG) coverage is a Bayesian studying coverage that balances exploitation (maximizing instant revenue) and exploration (ordering to achieve details about demand for future selections).
As a substitute of simply maximizing right now’s revenue, KG chooses the order amount that maximizes:
Revenue now + Worth of knowledge gained for the longer term
[
x_n = argmax_x mathbb{E}left[ p cdot min(x, W_{n+1}) – c x + V(a_{n+1}, b_{n+1}) mid a_n, b_n, x right]
]
The place:
- (W_{n+1} sim textual content{Exp}(lambda)) (with (lambda sim textual content{Gamma}(a_n, b_n)))
- (V(a_{n+1}, b_{n+1})) is the worth of anticipated future earnings beneath up to date beliefs after observing (W_{n+1})
We have no idea (a_{n+1}, b_{n+1}) at time (n) as a result of we haven’t but noticed demand. So, we compute their anticipated worth beneath the doable statement outcomes (censored vs uncensored).
The KG coverage then evaluates every candidate stocking amount (x) by:
- Simulating its impact on posterior beliefs
- Computing the instant revenue
- Computing the worth of future studying based mostly on perception updates
Goal Operate
We outline the entire worth of selecting (x) at time (n) as:
[
F_{text{KG}}(x) = underbrace{mathbb{E}[p cdot min(x, W) – c x]}_{textual content{Fast revenue}} + underbrace{(N – n) cdot mathbb{E}_{textual content{posterior}} left[ max_{x’} mathbb{E}_{lambda sim text{posterior}}[ p cdot min(x’, W) – c x’ ] proper]}_{textual content{Worth of studying}}
]
- The primary time period is simply anticipated instant revenue.
- The second time period accounts for a way this selection improves future revenue by sharpening our perception about (lambda).
- Horizon Issue ((N-n)): We are going to make ((N-n)) extra selections sooner or later. So the worth of higher selections as a result of studying right now will get multiplied by this issue.
- Posterior Averaging (mathbb{E}_{textual content{posterior}}[⋅]): This implies we’re averaging over all of the doable posterior beliefs we’d find yourself with after observing the end result of demand; as a result of demand is random and probably censored, we gained’t get good data, however we’ll replace our perception.
The paper makes use of beforehand mentioned Distribution Coverage as proxy for estimating the Future Worth perform. Thus:
[
x^*(a, b) = V(a, b) = frac{b p}{a – 1} left( 1 – left( frac{b}{b + x^*} right)^{a – 1} right) – c x^* = b left( left( frac{p}{c} right)^{1/a} – 1 right)
]
Anticipated Worth
Anticipated worth of (V) is expressed as under per Negoescu et al. Because the proof of this equation is sort of advanced, we’ll not be going over the main points.
[
begin{align*}
mathbb{E}[V] &= mathbb{E} left[ mathbb{E} left[ b^{n+1} left( frac{p}{a^{n+1} – 1} left( 1 – left( frac{c}{p} right)^{1 – frac{1}{a^{n+1}}} right) – c left( left( frac{c}{p} right)^{-frac{1}{a^{n+1}}} – 1 right) right) Big| lambda right] Huge| a^n, b^n, x^n proper]
&= mathbb{E} left[ int_0^{x^n} left( b^n + y right) left( frac{p}{a^n} left( 1 – left( frac{c}{p} right)^{1 – frac{1}{a^{n+1}}} right) – c left( left( frac{c}{p} right)^{-frac{1}{a^{n+1}}} – 1 right) right) lambda e^{-lambda y} , dy right.
&quad + left. int_{x^n}^{infty} left( b^n + x^n right) left( frac{p}{a^n – 1} left( 1 – left( frac{c}{p} right)^{1 – frac{1}{a^n}} right) – c left( left( frac{c}{p} right)^{-frac{1}{a^n}} – 1 right) right) lambda e^{-lambda y} , dy right].
finish{align*}
]
As we already know the anticipated worth of the instant revenue perform as described beneath earlier insurance policies, we are able to specific the additive anticipated worth of KG coverage as summation. As this equation is sort of lengthy, we’ll not be going over the main points, however it may be discovered within the paper.
First Order Optimality Situation
On this coverage as properly, we set the by-product of the anticipated revenue perform to zero, and clear up for (x) to search out the stocking worth which maximize the anticipated revenue. Closed kind answer of the equation based mostly on the paper is:
[
x_n = b_n left[ left( frac{r}{1 + (N – n) cdot left( 1 + frac{a_n r}{a_n – 1} – frac{(a_n + 1) r}{a_n} right)} right)^{-1 / a_n} – 1 right]
]
The place:
- (r = frac{c}{p}): Value to cost ratio
Python implementation:
def knowledge_gradient_policy(
a_n: float,
b_n: float,
p: float,
c: float,
n: int,
N: int
) -> float:
"""
Data Gradient Coverage, one-step lookahead coverage for exponential demand
with Gamma(a_n, b_n) posterior.
Args:
a_n (float): Gamma form parameter at time n.
b_n (float): Gamma charge parameter at time n.
p (float): Promoting worth per unit.
c (float): Unit value per unit.
n (int): Present interval index (0-based).
N (int): Complete variety of intervals within the horizon.
Returns:
float: Stocking degree x_n
"""
a = max(a_n, 1.001) # Keep away from division by zero for small form values
r = c / p
future_factor = (N - (n + 1)) / N
adjustment = 1.0 - future_factor * (1.0 / a)
adjusted_r = min(max(r * adjustment, 1e-4), 0.99)
return b_n * ((1 / adjusted_r) ** (1 / a) - 1)
Monte Carlo Coverage Analysis
To guage a coverage (pi) in a stochastic surroundings, we simulate its efficiency over a number of pattern demand paths.
Let:
- (M) be the variety of unbiased simulations (demand paths), every denoted (omega^m) for (m = 1, 2, dots, M)
- (N) be the time horizon
- (p_n(omega^m)) be the simulated worth at time (n) on path (m)
- (x_n(omega^m)) be the choice taken at time (n) beneath coverage (pi) on path (n)
Cumulative Reward on a Single Path
For every pattern path (omega^m), compute the entire reward:
[
hat{F}^pi(omega^m) = sum_{n=0}^{N-1} left[ p cdot minleft(x_n(omega^m), W_{n+1}(omega^m)right) – c cdot x_n(omega^m) right]
]
This represents the realized worth of the coverage (pi) alongside that particular trajectory.
Python implementation:
import numpy as np
def simulate_policy(
N: int,
a_0: float,
b_0: float,
lambda_true: float,
policy_name: str,
p: float,
c: float,
seed: int = 42
) -> float:
"""
Simulates the sequential stock decision-making course of utilizing a specified coverage.
Args:
N (int): Variety of time intervals.
a_0 (float): Preliminary form parameter of Gamma prior.
b_0 (float): Preliminary charge parameter of Gamma prior.
lambda_true (float): True exponential demand charge.
policy_name (str): Considered one of {'point_estimate', 'distribution', 'knowledge_gradient'}.
p (float): Promoting worth per unit.
c (float): Procurement value per unit.
seed (int): Random seed for reproducibility.
Returns:
float: Complete cumulative reward over N intervals.
"""
np.random.seed(seed)
a_n, b_n = a_0, b_0
rewards = []
for n in vary(N):
# Select order amount based mostly on specified coverage
if policy_name == "point_estimate":
x_n = point_estimate_policy(a_n=a_n, b_n=b_n, p=p, c=c)
elif policy_name == "distribution":
x_n = distribution_policy(a_n=a_n, b_n=b_n, p=p, c=c)
elif policy_name == "knowledge_gradient":
x_n = knowledge_gradient_policy(a_n=a_n, b_n=b_n, p=p, c=c, n=n, N=N)
else:
increase ValueError(f"Unknown coverage: {policy_name}")
# Pattern demand
W_n1 = np.random.exponential(1 / lambda_true)
# Compute revenue and replace perception
reward = profit_function(x_n, W_n1, p, c)
rewards.append(reward)
a_n, b_n = transition_a_b(a_n, b_n, x_n, W_n1)
return sum(rewards)
Estimate Anticipated Worth by Averaging
The anticipated reward of coverage (pi) is approximated utilizing the pattern common throughout all (M) simulations:
[
bar{F}^pi = frac{1}{N} sum_{m=1}^{N} hat{F}^pi(omega^m)
]
This (bar{F}^pi) is an unbiased estimator of the true anticipated reward beneath coverage (pi).
Python implementation:
import numpy as np
def policy_monte_carlo(
N_sim: int,
N: int,
a_0: float,
b_0: float,
lambda_true: float,
policy_name: str,
p: float = 10.0,
c: float = 4.0,
base_seed: int = 42
) -> float:
"""
Runs a number of Monte Carlo simulations to judge the typical cumulative reward
for a given stock coverage beneath exponential demand.
Args:
N_sim (int): Variety of Monte Carlo simulations to run.
N (int): Variety of time steps in every simulation.
a_0 (float): Preliminary Gamma form parameter.
b_0 (float): Preliminary Gamma charge parameter.
lambda_true (float): True charge of exponential demand.
policy_name (str): Identify of the coverage to make use of: {"point_estimate", "distribution", "knowledge_gradient"}.
p (float): Promoting worth per unit.
c (float): Procurement value per unit.
base_seed (int): Seed offset for reproducibility throughout simulations.
Returns:
float: Common cumulative reward throughout all simulations.
"""
total_rewards = []
for i in vary(N_sim):
reward = simulate_policy(
N=N,
a_0=a_0,
b_0=b_0,
lambda_true=lambda_true,
policy_name=policy_name,
p=p,
c=c,
seed=base_seed + i
)
total_rewards.append(reward)
return np.imply(total_rewards)
# Parameters
N_sim = 10000 # Variety of simulations
N = 100 # Variety of time intervals
a_0 = 10.0 # Preliminary form parameter of Gamma prior
b_0 = 5.0 # Preliminary charge parameter of Gamma prior
lambda_true = 0.25 # True charge of exponential demand
p = 26.0 # Promoting worth per unit
c = 20.0 # Unit value
base_seed = 1234 # Base seed for reproducibility
outcomes = {
coverage: policy_monte_carlo(
N_sim=N_sim,
N=N,
a_0=a_0,
b_0=b_0,
lambda_true=lambda_true,
policy_name=coverage,
p=p,
c=c,
base_seed=base_seed
)
for coverage in ["point_estimate", "distribution", "knowledge_gradient"]
}
print(outcomes)
Outcomes


The left plot reveals how common cumulative revenue evolves over time, whereas the fitting plot reveals the typical reward per time step. From this simulation, we observe that the Data Gradient (KG) coverage considerably outperforms the opposite two, because it optimizes not solely instant rewards but additionally the longer term worth of cumulative rewards. Each the Level Estimate and Distribution insurance policies carry out equally.

We will observe from above plots that the Bayesian Studying algorithm regularly converges to the true imply demand (W).
These findings spotlight the significance of incorporating the worth of knowledge in sequential choice making beneath uncertainty. Whereas easier heuristics just like the Level Estimate and Distribution insurance policies focus solely on instant features, the Data Gradient coverage leverages future studying potential, yielding superior long-term efficiency.