Math graphic
📐 Concept diagram

23-08 — Policy Gradient Methods

Phase: 23 — Reinforcement Learning Mathematics Subject: 23-08 Prerequisites: 23-01 — MDPs, 23-04 — Monte Carlo Methods, 23-06 — Q-Learning Next subject: 23-09 — Proximal Policy Optimization (PPO)


Learning Objectives

By the end of this subject, you will be able to:

  1. Derive the Policy Gradient Theorem: $\nabla_\theta J(\theta) = \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$
  2. Implement the REINFORCE algorithm using Monte Carlo returns
  3. Add a learned baseline (value function) to reduce gradient variance
  4. Construct the Actor-Critic architecture and derive the TD-based critic update
  5. Connect policy gradients to the score function estimator and the REINFORCE trick

Core Content

Why Policy Gradients?

Value-based methods (Q-learning, DQN) learn $Q(s,a)$ and derive a policy via $\arg\max$. This has limitations: - $\arg\max$ is discontinuous: small changes in $Q$ can cause large policy changes - Cannot represent stochastic policies (essential for games, partially observable environments) - The greedy policy may be hard to compute in continuous action spaces

Policy gradient methods directly parameterize and optimize the policy $\pi_\theta(a|s)$ using gradient ascent on the expected return.

The Objective Function

For episodic tasks, the objective is the expected return from the start state:

$$J(\theta) = \mathbb{E}{\tau \sim \pi\theta}[G_0] = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^{T} \gamma^t R_{t+1}\right]$$

where $\tau = (s_0, a_0, r_1, s_1, a_1, \dots)$ is a trajectory sampled from $\pi_\theta$. For continuing tasks, we use the average reward formulation.

The Policy Gradient Theorem

Theorem (Sutton et al., 1999):

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$$

Proof sketch:

The probability of a trajectory under $\pi_\theta$ is:

$$P(\tau | \theta) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) P(s_{t+1} | s_t, a_t)$$

The log-probability:

$$\log P(\tau | \theta) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \left[\log \pi_\theta(a_t | s_t) + \log P(s_{t+1} | s_t, a_t)\right]$$

⚠️ CRITICAL: The transition dynamics $P(s_{t+1}|s_t, a_t)$ do NOT depend on $\theta$. Taking the gradient:

$$\nabla_\theta \log P(\tau | \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t)$$

The expected return gradient uses the log-derivative trick (also called the REINFORCE trick or score function estimator):

$$\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}{\tau \sim \pi\theta}[R(\tau)] = \mathbb{E}{\tau \sim \pi\theta}[R(\tau) \nabla_\theta \log P(\tau | \theta)]$$

Expanding $R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_{t+1}$ and applying causality (future rewards don't affect past actions):

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\sum_{t=0}^{T-1} \gamma^t G_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t)\right]$$

Where $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$ is the Monte Carlo return from step $t$. Since $\mathbb{E}[G_t | s_t, a_t] = Q^\pi(s_t, a_t)$, we can replace $G_t$ with $Q^\pi$:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]$$

where the expectation is over the state-action visitation distribution $d^\pi(s,a)$.

The REINFORCE Algorithm

REINFORCE (Williams, 1992) is the simplest Monte Carlo policy gradient algorithm:

Algorithm: 1. Initialize policy parameters $\theta$ randomly 2. For each episode: - Generate trajectory $(s_0, a_0, r_1, \dots, s_T)$ using $\pi_\theta$ - For each step $t = 0, \dots, T-1$: - Compute return $G_t = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1}$ - Update parameters: $\theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \log \pi_\theta(a_t | s_t)$

Intuition: If action $a$ in state $s$ leads to higher return $G_t$, increase its probability ($\nabla_\theta \log \pi_\theta$ points toward higher probability). If lower return, decrease probability.

Variance Reduction with Baseline

REINFORCE has high variance because $G_t$ can vary wildly. The solution: subtract a baseline $b(s)$ that doesn't depend on the action:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q^\pi(s,a) - b(s))\right]$$

Why baselines don't bias the gradient:

$$\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = b(s) \cdot \mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s)] = b(s) \cdot \nabla_\theta 1 = 0$$

The baseline term integrates to zero because $\sum_a \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$.

Optimal baseline: A natural choice is the state value function $V^\pi(s)$, giving the advantage function:

$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$

The advantage represents how much better action $a$ is compared to the average action in state $s$. Using the advantage reduces variance because it centers the returns:

$$\nabla_\theta J(\theta) = \mathbb{E}{\pi\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot A^\pi(s,a)\right]$$

Actor-Critic Architecture

The Actor-Critic combines policy gradient (actor) with value function approximation (critic):

The actor is updated using the policy gradient with the critic's advantage estimate:

$$\theta \leftarrow \theta + \alpha_\theta \cdot \hat{A}(s,a) \cdot \nabla_\theta \log \pi_\theta(a|s)$$

The critic is updated using TD learning (or MC):

$$\phi \leftarrow \phi + \alpha_\phi \cdot \delta \cdot \nabla_\phi V_\phi(s)$$

where $\delta = R + \gamma V_\phi(s') - V_\phi(s)$ is the TD error, which itself is an unbiased estimate of the advantage: $\mathbb{E}[\delta | s,a] = A^\pi(s,a)$.

N-step advantage estimate: For lower bias, use $n$-step returns:

$$\hat{A}t^{(n)} = \sum{k=0}^{n-1} \gamma^k R_{t+k+1} + \gamma^n V_\phi(s_{t+n}) - V_\phi(s_t)$$

⚠️ CRITICAL: Actor-Critic combines two learning processes that are coupled: the critic's accuracy affects the actor's gradient quality, and the actor's policy determines the data distribution the critic learns from. This can cause instability.

Policy Parameterizations

Discrete actions (softmax):

$$\pi_\theta(a|s) = \frac{\exp(f_\theta(s,a))}{\sum_{a'} \exp(f_\theta(s,a'))}$$

$$\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \sum_{a'} \pi_\theta(a'|s) \nabla_\theta f_\theta(s,a')$$

Continuous actions (Gaussian):

$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta^2(s))$$

$$\log \pi_\theta(a|s) = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(a - \mu)^2}{2\sigma^2}$$

$$\nabla_\theta \log \pi_\theta(a|s) = \frac{a - \mu_\theta(s)}{\sigma_\theta^2(s)} \nabla_\theta \mu_\theta(s) + \frac{(a - \mu_\theta(s))^2 - \sigma_\theta^2(s)}{\sigma_\theta^3(s)} \nabla_\theta \sigma_\theta(s)$$



Key Terms

Worked Examples

Example 1: REINFORCE on a Simple Bandit

A 2-armed bandit has true values $Q^(a_1) = 1.0$, $Q^(a_2) = 2.0$. The policy is softmax with logits $\theta = [\theta_1, \theta_2] = [0, 0]$ initially. One episode: pick $a_1$, get reward 0.8. Compute the REINFORCE update with $\alpha = 0.1$.

Solution:

Initial probabilities: $\pi(a_1) = \pi(a_2) = \frac{e^0}{e^0 + e^0} = 0.5$

$\nabla_\theta \log \pi(a_1|\theta) = [1 - 0.5, -0.5] = [0.5, -0.5]$ (one-hot minus probabilities)

$G = 0.8$ (episode return = single reward)

Update: $\theta \leftarrow \theta + \alpha G \nabla_\theta \log \pi(a_1|\theta)$ $\theta_1 \leftarrow 0 + 0.1 \cdot 0.8 \cdot 0.5 = 0.04$ $\theta_2 \leftarrow 0 + 0.1 \cdot 0.8 \cdot (-0.5) = -0.04$

Click for answer $\theta = [0.04, -0.04]$. New probabilities: $\pi(a_1) \approx 0.52$, $\pi(a_2) \approx 0.48$. Even though $a_2$ is truly better, the update increases $\pi(a_1)$ because $a_1$ was taken and the return (0.8) was positive. Over many episodes, $a_2$ will be sampled and its higher returns will dominate.

Example 2: Baseline Variance Reduction

Two actions at state $s$: action $a_1$ gives returns $G = [10, 10, 10]$; action $a_2$ gives $G = [-5, 35, 0]$. Both have true $Q(s,a) = 10$. Compute the variance of the REINFORCE gradient (scalar $\nabla_\theta \log \pi = \pm 1$) with and without baseline $b = V(s) = 10$.

Solution:

Without baseline: Updates are proportional to $G_t$ directly. $a_1$ updates: $[+10, +10, +10]$ → mean = 10, variance = 0 $a_2$ updates: $[-5, +35, 0]$ → mean = 10, variance = $(15^2 + 25^2 + 10^2)/3 = (225 + 625 + 100)/3 = 316.7$

Pooled variance (assuming equal state visitation): $(0 + 316.7)/2 = 158.3$

With baseline $b = 10$: Updates proportional to $G_t - 10$. $a_1$: $[0, 0, 0]$ → mean = 0, variance = 0 $a_2$: $[-15, +25, -10]$ → mean = 0, variance = $(225 + 625 + 100)/3 = 316.7$

Wait — the variance is the same! This is because both actions have the same true $Q$. With different $Q$ values, the baseline helps more dramatically.

Click for answer In this example baseline doesn't reduce variance because both actions have equal $Q$. But in general, $V(s) = \mathbb{E}_a[Q(s,a)]$ centers the advantage, and when actions have very different $Q$ values, subtracting $V(s)$ dramatically reduces the gradient magnitude spread.

Example 3: Actor-Critic Update Step

At state $s$, the actor selects action $a$ with probability $\pi_\theta(a|s) = 0.3$. The critic estimates $V_\phi(s) = 5.0$. The agent receives $R = 2$, transitions to $s'$ where $V_\phi(s') = 4.5$, $\gamma = 0.99$. Actor learning rate $\alpha_\theta = 0.01$, critic $\alpha_\phi = 0.1$. Compute both updates.

Solution:

TD error: $\delta = 2 + 0.99(4.5) - 5.0 = 2 + 4.455 - 5.0 = 1.455$

Critic update: $\phi \leftarrow \phi + 0.1 \cdot 1.455 \cdot \nabla_\phi V_\phi(s)$ (The gradient direction depends on the specific parameterization — typically $\nabla_\phi V_\phi(s)$ for linear features is just the feature vector $\mathbf{x}(s)$)

Actor update: $\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta \log 0.3 = \nabla_\theta (-\log 0.3)$ (depends on parameterization) $\theta \leftarrow \theta + 0.01 \cdot 1.455 \cdot \nabla_\theta \log \pi_\theta(a|s)$

Since $\delta > 0$, the action was better than expected — increase its probability.

Click for answer The positive TD error (1.455) means the transition yielded a higher return than the critic predicted. The actor increases $\pi(a|s)$ and the critic increases $V(s)$ toward $R + \gamma V(s') = 6.455$.

Practice Problems

Problem 1: Prove that $\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0$ for any state $s$.

Answer (click to expand) $$\mathbb{E}_{a \sim \pi_\theta(\cdot|s)}[\nabla_\theta \log \pi_\theta(a|s)] = \sum_a \pi_\theta(a|s) \cdot \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} = \sum_a \nabla_\theta \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$$ This identity is crucial: it proves that subtracting ANY state-dependent baseline $b(s)$ from the return does not bias the policy gradient.

Problem 2: Derive the softmax policy gradient for discrete actions. If $\pi_\theta(a|s) = \frac{e^{f_\theta(s,a)}}{\sum_{a'} e^{f_\theta(s,a')}}$, show that $\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \mathbb{E}{a' \sim \pi\theta}[\nabla_\theta f_\theta(s,a')]$.

Answer (click to expand) $$\log \pi_\theta(a|s) = f_\theta(s,a) - \log \sum_{a'} e^{f_\theta(s,a')}$$ $$\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s,a) - \frac{\sum_{a'} e^{f_\theta(s,a')} \nabla_\theta f_\theta(s,a')}{\sum_{a'} e^{f_\theta(s,a')}}$$ $$= \nabla_\theta f_\theta(s,a) - \sum_{a'} \frac{e^{f_\theta(s,a')}}{\sum_k e^{f_\theta(s,k)}} \nabla_\theta f_\theta(s,a')$$ $$= \nabla_\theta f_\theta(s,a) - \sum_{a'} \pi_\theta(a'|s) \nabla_\theta f_\theta(s,a')$$ The gradient is the "score" for the chosen action minus the expected score over all actions — a contrastive signal.

Problem 3: Show that the TD error $\delta_t = R_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ is an unbiased estimate of the advantage $A^\pi(s_t, a_t)$ when $V = V^\pi$.

Answer (click to expand) $$\mathbb{E}[\delta_t | s_t, a_t] = \mathbb{E}[R_{t+1} + \gamma V^\pi(s_{t+1}) - V^\pi(s_t) | s_t, a_t]$$ $$= \mathbb{E}[R_{t+1} | s_t, a_t] + \gamma \mathbb{E}[V^\pi(s_{t+1}) | s_t, a_t] - V^\pi(s_t)$$ By definition of the Bellman equation: $\mathbb{E}[R_{t+1} + \gamma V^\pi(s_{t+1}) | s_t, a_t] = Q^\pi(s_t, a_t)$ Therefore: $\mathbb{E}[\delta_t | s_t, a_t] = Q^\pi(s_t, a_t) - V^\pi(s_t) = A^\pi(s_t, a_t)$. This is why actor-critic methods can use the TD error as a low-variance (though biased due to function approximation) advantage estimate.

Problem 4: REINFORCE has high variance. Quantify the variance: if returns $G$ have variance $\sigma_G^2$ and the policy gradient magnitude $|\nabla_\theta \log \pi|$ has variance $\sigma_{\nabla}^2$, what is the variance of the product $G \cdot \nabla_\theta \log \pi$ assuming independence?

Answer (click to expand) Under independence: $\text{Var}[G \cdot \nabla \log \pi] = \mathbb{E}[G^2]\mathbb{E}[(\nabla \log \pi)^2] - (\mathbb{E}[G]\mathbb{E}[\nabla \log \pi])^2$ Since $\mathbb{E}[\nabla \log \pi] = 0$: $\text{Var} = \mathbb{E}[G^2] \cdot \mathbb{E}[(\nabla \log \pi)^2] = (\sigma_G^2 + \mu_G^2) \cdot \sigma_{\nabla}^2$ The variance grows with the squared return — this is why long episodes with large accumulated returns make REINFORCE nearly unusable without variance reduction. With baseline subtraction, $G$ becomes $G - b \approx A$, and $\mu_A \approx 0$, reducing the term to $\sigma_A^2 \cdot \sigma_{\nabla}^2$.

Problem 5: Compare the computational cost of REINFORCE vs. Actor-Critic for an episode of length $T$. REINFORCE stores all $(s_t, a_t, r_t)$ and computes returns at the end. Actor-Critic updates online. What are the memory requirements?

Answer (click to expand) **REINFORCE (MC):** - Memory: $O(T \cdot (|s| + |a| + 1))$ — must store entire trajectory - Time: $O(T \cdot |\theta|)$ — one backward pass per step after episode ends - Cannot update during episode **Actor-Critic (TD):** - Memory: $O(|s| + |a|)$ — only current transition - Time: $O(T \cdot (|\theta| + |\phi|))$ — two backward passes per step - Updates online, can learn from incomplete episodes For $T = 1000$ steps, feature dimension $d = 256$: REINFORCE needs ≈ 256KB for the trajectory buffer; Actor-Critic needs ≈ 1KB. But Actor-Critic introduces bias through the value function approximation. This is the classic bias-variance-memory tradeoff.

Summary


Quiz

Question 1: The Policy Gradient Theorem states that $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)]$. Why does the gradient of the transition dynamics $P(s'|s,a)$ not appear?

A. It integrates to zero B. The transition dynamics do not depend on the policy parameters $\theta$ C. It's approximated away D. The model is deterministic

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. The environment dynamics $P(s'|s,a)$ are independent of the agent's policy parameters $\theta$, so $\nabla_\theta P(s'|s,a) = 0$. - If you chose A: Individual transition gradients are zero, not just their expectation. - If you chose C: No approximation is needed — it's exact. - If you chose D: The theorem works for stochastic dynamics too.

Question 2: REINFORCE updates are applied:

A. At every time step, using the immediate reward B. At the end of each episode, using the full Monte Carlo return C. Only after the policy achieves a target success rate D. Using Q-learning style TD errors

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. REINFORCE is a Monte Carlo method — it waits until episode end to compute the return $G_t$ for each step. - If you chose A: That's TD-based actor-critic, not pure REINFORCE. - If you chose C: REINFORCE doesn't require a success threshold. - If you chose D: REINFORCE uses $G_t$, not TD errors.

Question 3: Why does subtracting a state-dependent baseline $b(s)$ not bias the policy gradient?

A. Because $\mathbb{E}[\nabla_\theta \log \pi_\theta(a|s)] = 0$ B. Because $b(s)$ is always zero C. Because the baseline is learned separately D. Baselines do introduce bias, but it's negligible

Correct Answer: A

Explanation (click to expand) - **If you chose A:** Correct. $\mathbb{E}[b(s)\nabla_\theta \log \pi] = b(s)\mathbb{E}[\nabla_\theta \log \pi] = b(s) \cdot 0 = 0$. The baseline term contributes zero in expectation. - If you chose B: The baseline is typically non-zero (e.g., $V(s)$). - If you chose C: Separate learning doesn't guarantee unbiasedness; the mathematical property does. - If you chose D: The baseline introduces zero bias — it's a free lunch for variance reduction.

Question 4: In Actor-Critic, what does the TD error $\delta$ estimate?

A. The policy gradient directly B. The advantage $A^\pi(s,a)$ C. The value function $V(s)$ D. The immediate reward

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. $\mathbb{E}[\delta | s,a] = Q^\pi(s,a) - V^\pi(s) = A^\pi(s,a)$. The TD error is an unbiased estimate of the advantage. - If you chose A: $\delta$ times the score function gives the policy gradient, but $\delta$ itself estimates advantage. - If you chose C: $\delta$ is the error in $V(s)$, not $V(s)$ itself. - If you chose D: $\delta$ includes the next state value, not just reward.

Question 5: Which has LOWER variance: REINFORCE or Actor-Critic with TD(0)?

A. REINFORCE — because it uses the full return B. Actor-Critic with TD(0) — because it bootstraps from a learned value function C. They have the same variance D. It depends on the discount factor

Correct Answer: B

Explanation (click to expand) - **If you chose B:** Correct. TD(0) replaces the noisy sum of future rewards with a value function estimate $V(s')$, which has lower variance than the Monte Carlo return $G_t$ (at the cost of bias from function approximation error). - If you chose A: REINFORCE has high variance precisely because it uses noisy long-term returns. - If you chose C: Actor-Critic universally has lower variance. - If you chose D: Discount factor affects both similarly; the variance reduction from bootstrapping is the key difference.

Question 6: The "log-derivative trick" refers to:

A. $\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta$ B. $\nabla_\theta \mathcal{L} = \nabla_\theta Q(s,a)$ C. $\log(\nabla_\theta \pi) = \nabla_\theta \log \pi$ D. Using logarithms to stabilize gradients

Correct Answer: A

Explanation (click to expand) - **If you chose A:** Correct. Since $\nabla_\theta \log \pi = \frac{\nabla_\theta \pi}{\pi}$, multiplying both sides by $\pi$ gives $\nabla_\theta \pi = \pi \nabla_\theta \log \pi$. This lets us write $\nabla_\theta \mathbb{E}_\pi[f] = \mathbb{E}_\pi[f \nabla_\theta \log \pi]$, converting a gradient of an expectation into an expectation of a gradient. - If you chose B: That's the Q-learning gradient, not the log-derivative trick. - If you chose C: The log of a gradient is not the gradient of a log. - If you chose D: While logarithms do stabilize numerically, that's not what the "log-derivative trick" refers to.

Pitfalls

  1. Forgetting causality: Using $R(\tau)$ instead of $G_t$ — the full trajectory return shouldn't affect early action probabilities. The correct update uses per-step returns.
  2. No baseline: Raw REINFORCE without baseline has extreme variance; even a simple running-average baseline helps enormously.
  3. Learning rate too high: Policy gradient steps are noisier than supervised learning steps. Use smaller learning rates.
  4. Collapsing entropy: The policy can become deterministic too quickly (low entropy). Entropy regularization $\mathcal{H}(\pi)$ encourages exploration.
  5. Confusing on-policy vs off-policy: Policy gradient methods are typically on-policy — data must be collected under the current policy. Reusing old data introduces bias.

Next Steps

Next up: 23-09 — Proximal Policy Optimization (PPO) — where we learn to take policy gradient steps that are large enough to make progress but small enough to avoid destructive policy collapse.




Q8: Why are policy gradient methods typically on-policy?

A) The gradient $$\mathbb{E}{\pi\theta}[\nabla_\theta \log \pi_\theta \cdot Q^\pi]$$ is an expectation under the current policy — using data from a different policy would require importance sampling corrections B) Off-policy algorithms are always unstable C) Neural networks can't handle off-policy data D) On-policy methods are legally required

Answer and Explanations **Correct: A)** The policy gradient is an expectation over the state-action distribution induced by $$\pi_\theta$$. If data comes from a different policy $$\pi_{\text{old}}$$, the gradient estimate is biased unless corrected by importance sampling ratios $$\frac{\pi_\theta}{\pi_{\text{old}}}$$, which can cause variance issues (same problems as off-policy MC). PPO addresses this by using clipping on the importance ratio. - B) Off-policy algorithms like Q-learning can be stable when well-designed. - C) DQN is an off-policy method that uses neural networks successfully. - D) There are no legal requirements for RL algorithm design.