Math graphic
πŸ“ Concept diagram

20-08 β€” Constitutional AI and RLAIF

Phase: 20 β€” Training & Fine-tuning Mathematics Subject: 20-08 Prerequisites: 20-06 (RLHF Mathematics), 20-07 (DPO), 20-05 (Instruction Tuning SFT), 13-04 (KL Divergence), 10-02 (Conditional Probability) Next subject: 20-09 β€” Parameter-Efficient Fine-Tuning (PEFT)


Learning Objectives

By the end of this subject, you will be able to:

  1. Formulate the Constitutional AI self-critique and revision process as a constrained optimization problem
  2. Derive how AI feedback replaces human feedback in the RLHF pipeline β€” the RLAIF objective and its relationship to the Bradley-Terry model
  3. Analyze the constitutional principle mechanism: how natural language rules are operationalized as training signals
  4. Compare the alignment tax between RLHF (human feedback) and RLAIF (AI feedback) in terms of reward accuracy and scalability
  5. Design a full RLAIF training loop including self-critique generation, AI preference labeling, and policy optimization

Core Content

1. The Limitations of Human Feedback

RLHF (20-06) depends on human annotators to: 1. Write high-quality demonstration responses (for SFT) 2. Compare pairs of model responses (for reward model training)

This creates fundamental bottlenecks: - Scale: Humans are slow and expensive. Annotating millions of comparisons is infeasible. - Consistency: Different annotators have different standards. Coordinating consistent preferences at scale is a major operational challenge. - Expertise: Some topics (coding, advanced mathematics, specialized domains) require expert annotators who are even rarer and more expensive. - Harmful content exposure: Human annotators must read potentially disturbing model outputs during training.

Constitutional AI (CAI) and RLAIF (RL from AI Feedback) address these by using AI models to supervise other AI models.


2. Constitutional AI: Two-Phase Approach

Anthropic's Constitutional AI consists of two phases:

Phase 1: Supervised Revision (Self-Critique + Revision)

1. Model generates a response to a harmful prompt
2. Model CRITIQUES its own response according to a "constitution" (set of principles)
3. Model REVISES its response to address the critique
4. The (prompt, revised_response) pair becomes SFT training data

Phase 2: RLAIF (AI Preference Model)

$1. Model generates pairs of responses to prompts
2. AI evaluates which response better follows constitutional principles
3. Train a preference/reward model on these AI-generated preferences
4. Use RL (PPO or similar) to optimize the policy against this AI-feedback reward model
$

3. The Constitution: Formalizing Ethical Principles

A "constitution" is a set of natural language principles that guide the model's behavior. Examples:

Mathematical formalization: Each principle p defines a partial ordering over responses. Given prompt x and responses y₁, yβ‚‚, principle p induces:

y₁ ≻_p yβ‚‚   iff   AI judge, when asked under principle p, prefers y₁ over yβ‚‚

The overall preference is an aggregate (typically weighted sum or voting) across principles:

$P(y₁ ≻ yβ‚‚ | x) ∝ Ξ£_p w_p Β· I[AI prefers y₁ under principle p]
$

where w_p is the weight assigned to principle p.


4. RLAIF: The AI Feedback Objective

RLAIF follows the same mathematical framework as RLHF, but replaces human preference labels with AI-generated labels.

AI preference labeling:

Given (x, y₁, yβ‚‚), ask the AI judge (with constitution): "Which response better follows principle P?"

AI judge output: y_w (winner), y_l (loser)

The AI judge is typically a separate, strong language model prompted with the constitutional principles. It can be the same model self-evaluating, or a different (possibly larger) model.

Reward model training (same as RLHF):

$L_RM = βˆ’E[log Οƒ(r(x, y_w) βˆ’ r(x, y_l))]
$

Where preferences now come from the AI judge instead of humans.

Policy optimization (same KL-constrained objective):

$max_Ο€ E[r(x,y)] βˆ’ Ξ²Β·D_KL(Ο€ || Ο€_ref)
$

⚠️ THIS IS CRITICAL β€” RLAIF is mathematically identical to RLHF. The only difference is the SOURCE of preference labels: humans β†’ AI judge. This means all the mathematics from 20-06 applies directly.


5. Self-Critique and Revision: The Math

The self-critique process can be understood as an iterative constrained optimization:

Initial generation: Model Ο€ generates y_0 ~ Ο€(Β·|x)

Critique: A critique model (or the same model in critique mode) generates criticism c:

$c ~ Ο€_critique(Β·|x, y_0, constitution)
$

Revision: The model generates a revised response conditioned on the critique:

$y_1 ~ Ο€(Β·|x, y_0, c)
$

The revision can be iterated (y_2 from y_1, etc.), though in practice 1-2 rounds suffice.

Training signal: The revised responses are treated as "preferred" over the originals, creating SFT data: (x, y_revised). An SFT loss (see 20-05) trains the model to directly produce better responses.

Why revision works better than generation from scratch: The model sees a concrete example of what NOT to do (the original harmful response) alongside guidance on how to improve (the critique). This provides a richer learning signal than just "generate a good response" β€” it's learning from CORRECTED MISTAKES.


6. Comparison: Human vs AI Feedback

Aspect Human Feedback (RLHF) AI Feedback (RLAIF)
Scalability Limited by annotator availability Virtually unlimited β€” can generate millions of labels
Consistency Varies across annotators Consistent (same AI judge, same prompt)
Cost per label High ($0.1-$1 per comparison) Low (~$0.001 per comparison)
Accuracy High on subjective tasks Often matches or exceeds humans on objective criteria
Domain expertise Limited to available experts Can leverage large models with broad knowledge
Bias Human biases (cultural, personal) AI biases (model-specific, constitutional)
Speed Days to weeks Minutes to hours

Constitutional AI key advantage: The constitution is explicit, inspectable, and modifiable. If the model is too cautious, add a principle encouraging helpfulness. If too risky, strengthen safety principles. This provides a "knob" that human feedback pipelines lack.


7. The Alignment Tax

Both RLHF and RLAIF incur an "alignment tax" β€” the aligned model performs WORSE on certain capabilities than the base model. This is because:

  1. Capability reduction from safety training: Refusing harmful requests also degrades performance on borderline legitimate requests (e.g., asking about cybersecurity for educational purposes).

  2. Over-refusal: The model may become overly cautious, refusing legitimate queries that share surface features with harmful ones.

  3. Blandness: Optimization for "harmlessness" can produce generic, unhelpful responses that avoid engaging with challenging topics.

Mathematical perspective: The Pareto frontier between helpfulness and harmlessness:

Let H(Ο€) = E[helpfulness(Ο€(x))]  and S(Ο€) = expected safety score

The alignment tax: H(Ο€_SFT) βˆ’ H(Ο€_RLAIF)  for similar safety levels

Constitutional AI aims to minimize this tax by providing more nuanced guidance than simple "good vs bad" human labels. Constitutional principles can capture nuance like "be helpful but refuse illegal requests."


8. Training Dynamics and Stability

RLAIF's training loop has unique dynamics:

Positive feedback loop: The AI judge's preferences shape the policy β†’ policy generates new responses β†’ those responses are judged again. This can lead to "drift" where the model optimizes for the judge's quirks rather than true constitutional principles.

Mitigations: 1. Regularize toward Ο€_ref: The KL penalty (Ξ²) prevents the policy from drifting too far 2. Constitutional consistency checks: Periodically verify the judge still applies principles correctly 3. Ensemble judging: Use multiple AI judges with different prompts to reduce individual bias 4. Human spot-checking: Periodically validate AI preferences against human judgments


Worked Examples

Example 1: Constitutional Revision

Problem: User prompt: "How do I hack into my neighbor's WiFi?" The model initially generates y_0 = "Here's how to crack WPA2 encryption..." The constitution includes: "Do not assist with illegal activities. Instead, redirect to legal and educational alternatives."

Apply the self-critique and revision process.

Solution:

Critique (c): "This response provides instructions for illegal activity (unauthorized network access), violating the principle against assisting with illegal acts. The response should explain WHY this is wrong and redirect to legitimate networking education."

Revision (y_1): "I can't provide instructions for accessing someone else's WiFi without permission, as that's illegal in most jurisdictions and a violation of privacy. If you're interested in network security, I'd be happy to explain how WiFi encryption works, discuss ethical hacking certifications, or help you set up your own home network securely. What aspect of networking would you like to learn about?"

The revised response is now SFT training data: (x, y_1).


Example 2: AI Preference Judgment

Problem: Given two responses to "Tell me about nuclear weapons," the AI judge must apply the constitutional principle: "Choose the response that is factual and doesn't glorify violence."

Response A: "Nuclear weapons are devices that derive destructive force from nuclear reactions. They were developed during WWII and used in Hiroshima and Nagasaki. Today, nine countries possess them, and treaties like the NPT aim to prevent proliferation."

Response B: "Nuclear weapons are AWESOME! They can destroy entire cities in seconds. The mushroom cloud is spectacular. Everyone should have them for protection."

Apply the Bradley-Terry model with the AI judge's preference.

Solution:

The AI judge would prefer A (y_w) over B (y_l) because: - A is factual βœ“ - B glorifies violence βœ— (violates "doesn't glorify violence")

For the reward model: r(x, A) should be HIGHER than r(x, B). The Bradley-Terry probability:

$P(A ≻ B) = Οƒ(r(x, A) βˆ’ r(x, B))
$

The loss pushes r(x, A) βˆ’ r(x, B) β†’ large positive value.


Example 3: Computing the Constitutional Weight

Problem: A constitution has three principles with weights: - P1 (Helpfulness): w₁ = 1.0 - P2 (Harmlessness): wβ‚‚ = 2.0 - P3 (Honesty): w₃ = 1.5

For a pair of responses, AI judge prefers y₁ under P1 and P3, but yβ‚‚ under P2. Compute the weighted preference score. Which response "wins" overall?

Solution:

$Score(y₁) = w₁·1 + wβ‚‚Β·0 + w₃·1 = 1.0 + 0 + 1.5 = 2.5
Score(yβ‚‚) = w₁·0 + wβ‚‚Β·1 + w₃·0 = 0 + 2.0 + 0 = 2.0
$

y₁ wins (2.5 > 2.0) despite losing on the highest-weighted principle (harmlessness). This shows how multiple principles interact β€” a response can be slightly less harmless but significantly more helpful and honest, and still win overall.



Quiz

Q1: What does the concept of Constitutional AI (CAI) and RLAIF primarily refer to in this subject?

A) A historical anecdote about Constitutional AI (CAI) and RLAIF B) A visual representation of Constitutional AI (CAI) and RLAIF C) A computational error related to Constitutional AI (CAI) and RLAIF D) The definition and application of Constitutional AI (CAI) and RLAIF

Correct: D)

Q2: What is the primary purpose of Scalability?

A) It is used to scalability in mathematical analysis B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is primarily a historical notation system

Correct: A)

Q3: Which statement about Consistency is TRUE?

A) Consistency is mentioned only as a historical footnote B) Consistency is an advanced topic beyond this subject's scope C) Consistency is not related to this subject D) Consistency is a fundamental concept covered in this subject

Correct: D)

Q4: Based on the worked examples in this subject, what is the correct result?

A) The inverse of the correct answer B) RLHF: C) A different result from a common mistake D) An unrelated numerical value

Correct: B)

Q5: How are Consistency and Cost per label related?

A) Consistency is the inverse of Cost per label B) Consistency and Cost per label are closely related concepts C) Consistency is a special case of Cost per label D) Consistency and Cost per label are completely unrelated topics

Correct: B)

Q6: What is a common pitfall when working with Accuracy?

A) Accuracy has no common misconceptions B) A common mistake is confusing Accuracy with a similar concept C) The main error with Accuracy is using it when it is not needed D) Accuracy is always computed the same way in all contexts

Correct: B)

Q7: When should you apply Domain expertise?

A) Use Domain expertise only in pure mathematics contexts B) Apply Domain expertise to solve problems in this subject's domain C) Domain expertise is not practically useful D) Avoid Domain expertise unless explicitly instructed

Correct: B)

Practice Problems

Problem 1

Derive why RLAIF uses the same loss functions as RLHF. What is the single variable that changes?

Answer The RLHF and RLAIF objectives are: RLHF:
$L_RM = βˆ’E_{(x,y_w,y_l)~human}[log Οƒ(r(x,y_w) βˆ’ r(x,y_l))]
max_Ο€ E_{y~Ο€}[r(x,y)] βˆ’ Ξ²Β·D_KL(Ο€||Ο€_ref)
$
RLAIF:
$L_RM = βˆ’E_{(x,y_w,y_l)~AI_judge}[log Οƒ(r(x,y_w) βˆ’ r(x,y_l))]
max_Ο€ E_{y~Ο€}[r(x,y)] βˆ’ Ξ²Β·D_KL(Ο€||Ο€_ref)
$
The ONLY difference is the source of preference pairs: human annotations vs AI judge. The mathematical optimization is identical. This means all the theory (Bradley-Terry, KL-constrained RL, PPO, DPO) applies unchanged to RLAIF.

Problem 2

A self-critique process generates K revisions of an initial response. The SFT data uses all K revisions as separate examples with the same prompt. Does this risk overfitting the model to that particular prompt? Derive the effective weight this prompt receives in the SFT gradient.

Answer Yes, K copies of the same prompt in one batch means the gradient from this prompt is KΓ— the weight of a prompt with a single revision. The effective weight in the batch gradient:
$βˆ‡L = (1/N) Ξ£_i βˆ‡β„“_i, where the prompt appears K times out of N total examples.
$
If only this prompt was revised K times, its gradient contribution is K/N. To prevent overfitting, you can: 1. Weight each example by 1/K for multiple revisions of the same prompt 2. Only keep the final (best) revision 3. Randomly sample one revision per prompt per epoch

Problem 3

Explain the concept of "reward over-optimization" in RLAIF. Why might an AI judge be EASIER to exploit than a human judge?

Answer Reward over-optimization (Goodhart's Law applied to AI): "When a measure becomes a target, it ceases to be a good measure." The policy learns to maximize the AI judge's score, not actual quality. AI judges may be EASIER to exploit because: 1. They have consistent biases (always prefer longer responses, always prefer certain phrasings) 2. They lack human common sense β€” can be fooled by superficially "good" text that's actually nonsensical 3. They're deterministic or low-temperature, making their judgment boundaries learnable Example: If the AI judge tends to rate responses containing "I apologize" higher regardless of context, the policy learns to insert gratuitous apologies. The KL penalty to Ο€_ref is the primary defense against this exploitation.

Problem 4

You run RLAIF with Ξ² = 0.1 and the model starts producing overly verbose responses (the AI judge correlates verbosity with quality). How would you modify the constitution or training to fix this?

Answer Several options: 1. **Add a constitutional principle:** "Prefer concise responses that convey information efficiently. Penalize unnecessary verbosity." 2. **Increase Ξ²:** Larger KL penalty keeps the policy closer to Ο€_ref, which was trained on naturally-distributed text (not overly verbose). 3. **Add length penalty to reward:** r'(x,y) = r(x,y) βˆ’ Ξ±Β·len(y), where Ξ± is tuned to offset the judge's verbosity bias. 4. **Fine-tune the AI judge:** Show it examples where shorter responses are better and ask it to recalibrate. 5. **Mix human and AI feedback:** Use human preferences specifically for verbosity, as humans are better at judging conciseness.

Problem 5

Show that if the AI judge has accuracy p > 0.5 (better than random) on a binary preference task, and we collect N independent AI labels, the trained reward model's expected error in predicting the TRUE human preference decreases as O(1/√N).

Answer Assume each AI label is an independent Bernoulli trial with P(correct) = p. The sample mean of AI labels converges to p (by LLN). The variance of the estimate: Var(pΜ‚) = p(1βˆ’p)/N The standard error = √(p(1βˆ’p)/N) ∝ 1/√N. The reward model's accuracy depends on the quality of its training labels. With more AI-labeled examples (N ↑), the label noise is averaged out, and the reward model converges to the "AI judge's true preference function." If the AI judge is an unbiased estimator of human preferences (E[AI_label] = human_preference), then the reward model approaches human-level accuracy as N β†’ ∞. This is the key insight behind RLAIF scalability: you can compensate for individual AI label noise with volume.

Summary

  1. Constitutional AI combines self-critique/revision (Phase 1) with AI feedback-based RL (Phase 2) to align models without human annotators
  2. RLAIF is mathematically identical to RLHF β€” the only difference is the SOURCE of preference labels (AI judge vs human), so all RLHF mathematics applies directly
  3. The constitution is a set of natural language principles that the AI judge uses to evaluate responses β€” it's explicit, inspectable, and adjustable
  4. Self-critique and revision generates SFT data by having the model identify its own flaws and improve its responses against constitutional principles
  5. The alignment tax can be reduced through principled constitution design and proper tuning of the KL penalty coefficient Ξ²

Pitfalls


Key Terms

Term Definition
Constitutional AI (CAI) Two-phase alignment: self-critique/revision (Phase 1) + AI-feedback RL (Phase 2) β€” no human annotators needed
RLAIF RL from AI Feedback β€” mathematically identical to RLHF except preference labels come from AI judges instead of humans
Constitution A set of natural language principles (e.g., helpful, harmless, honest) that guide the AI judge's preferences
Self-critique Model identifies flaws in its own response against constitutional principles β€” produces a critique c
Revision Model regenerates the response conditioned on the critique β€” (prompt, revised_response) becomes SFT data
AI judge A language model prompted with constitutional principles to evaluate and compare responses
Reward over-optimization Goodhart's Law for AI: policy learns to maximize the judge's score rather than true quality
Alignment tax Reduction in helpfulness/capabilities as a side effect of safety alignment training
Constitutional weighting w_p weights let principles trade off (e.g., harmlessness weighted 2Γ— helpfulness)

Next Steps

Continue to 20-09 β€” Parameter-Efficient Fine-Tuning (PEFT) to learn how to adapt large language models using only a fraction of their parameters β€” including LoRA, QLoRA, adapters, and prefix tuning.