20-08 β Constitutional AI and RLAIF
Phase: 20 β Training & Fine-tuning Mathematics Subject: 20-08 Prerequisites: 20-06 (RLHF Mathematics), 20-07 (DPO), 20-05 (Instruction Tuning SFT), 13-04 (KL Divergence), 10-02 (Conditional Probability) Next subject: 20-09 β Parameter-Efficient Fine-Tuning (PEFT)
Learning Objectives
By the end of this subject, you will be able to:
- Formulate the Constitutional AI self-critique and revision process as a constrained optimization problem
- Derive how AI feedback replaces human feedback in the RLHF pipeline β the RLAIF objective and its relationship to the Bradley-Terry model
- Analyze the constitutional principle mechanism: how natural language rules are operationalized as training signals
- Compare the alignment tax between RLHF (human feedback) and RLAIF (AI feedback) in terms of reward accuracy and scalability
- Design a full RLAIF training loop including self-critique generation, AI preference labeling, and policy optimization
Core Content
1. The Limitations of Human Feedback
RLHF (20-06) depends on human annotators to: 1. Write high-quality demonstration responses (for SFT) 2. Compare pairs of model responses (for reward model training)
This creates fundamental bottlenecks: - Scale: Humans are slow and expensive. Annotating millions of comparisons is infeasible. - Consistency: Different annotators have different standards. Coordinating consistent preferences at scale is a major operational challenge. - Expertise: Some topics (coding, advanced mathematics, specialized domains) require expert annotators who are even rarer and more expensive. - Harmful content exposure: Human annotators must read potentially disturbing model outputs during training.
Constitutional AI (CAI) and RLAIF (RL from AI Feedback) address these by using AI models to supervise other AI models.
2. Constitutional AI: Two-Phase Approach
Anthropic's Constitutional AI consists of two phases:
Phase 1: Supervised Revision (Self-Critique + Revision)
1. Model generates a response to a harmful prompt
2. Model CRITIQUES its own response according to a "constitution" (set of principles)
3. Model REVISES its response to address the critique
4. The (prompt, revised_response) pair becomes SFT training data
Phase 2: RLAIF (AI Preference Model)
$1. Model generates pairs of responses to prompts 2. AI evaluates which response better follows constitutional principles 3. Train a preference/reward model on these AI-generated preferences 4. Use RL (PPO or similar) to optimize the policy against this AI-feedback reward model $
3. The Constitution: Formalizing Ethical Principles
A "constitution" is a set of natural language principles that guide the model's behavior. Examples:
- "Choose the response that is most helpful, honest, and harmless."
- "Choose the response that is less toxic and offensive."
- "Choose the response that respects user privacy and doesn't encourage illegal activity."
- "Choose the response that is more factual and cites sources when possible."
Mathematical formalization: Each principle p defines a partial ordering over responses. Given prompt x and responses yβ, yβ, principle p induces:
yβ β»_p yβ iff AI judge, when asked under principle p, prefers yβ over yβ
The overall preference is an aggregate (typically weighted sum or voting) across principles:
$P(yβ β» yβ | x) β Ξ£_p w_p Β· I[AI prefers yβ under principle p] $
where w_p is the weight assigned to principle p.
4. RLAIF: The AI Feedback Objective
RLAIF follows the same mathematical framework as RLHF, but replaces human preference labels with AI-generated labels.
AI preference labeling:
Given (x, yβ, yβ), ask the AI judge (with constitution): "Which response better follows principle P?"
AI judge output: y_w (winner), y_l (loser)
The AI judge is typically a separate, strong language model prompted with the constitutional principles. It can be the same model self-evaluating, or a different (possibly larger) model.
Reward model training (same as RLHF):
$L_RM = βE[log Ο(r(x, y_w) β r(x, y_l))] $
Where preferences now come from the AI judge instead of humans.
Policy optimization (same KL-constrained objective):
$max_Ο E[r(x,y)] β Ξ²Β·D_KL(Ο || Ο_ref) $
β οΈ THIS IS CRITICAL β RLAIF is mathematically identical to RLHF. The only difference is the SOURCE of preference labels: humans β AI judge. This means all the mathematics from 20-06 applies directly.
5. Self-Critique and Revision: The Math
The self-critique process can be understood as an iterative constrained optimization:
Initial generation: Model Ο generates y_0 ~ Ο(Β·|x)
Critique: A critique model (or the same model in critique mode) generates criticism c:
$c ~ Ο_critique(Β·|x, y_0, constitution) $
Revision: The model generates a revised response conditioned on the critique:
$y_1 ~ Ο(Β·|x, y_0, c) $
The revision can be iterated (y_2 from y_1, etc.), though in practice 1-2 rounds suffice.
Training signal: The revised responses are treated as "preferred" over the originals, creating SFT data: (x, y_revised). An SFT loss (see 20-05) trains the model to directly produce better responses.
Why revision works better than generation from scratch: The model sees a concrete example of what NOT to do (the original harmful response) alongside guidance on how to improve (the critique). This provides a richer learning signal than just "generate a good response" β it's learning from CORRECTED MISTAKES.
6. Comparison: Human vs AI Feedback
| Aspect | Human Feedback (RLHF) | AI Feedback (RLAIF) |
|---|---|---|
| Scalability | Limited by annotator availability | Virtually unlimited β can generate millions of labels |
| Consistency | Varies across annotators | Consistent (same AI judge, same prompt) |
| Cost per label | High ($0.1-$1 per comparison) | Low (~$0.001 per comparison) |
| Accuracy | High on subjective tasks | Often matches or exceeds humans on objective criteria |
| Domain expertise | Limited to available experts | Can leverage large models with broad knowledge |
| Bias | Human biases (cultural, personal) | AI biases (model-specific, constitutional) |
| Speed | Days to weeks | Minutes to hours |
Constitutional AI key advantage: The constitution is explicit, inspectable, and modifiable. If the model is too cautious, add a principle encouraging helpfulness. If too risky, strengthen safety principles. This provides a "knob" that human feedback pipelines lack.
7. The Alignment Tax
Both RLHF and RLAIF incur an "alignment tax" β the aligned model performs WORSE on certain capabilities than the base model. This is because:
-
Capability reduction from safety training: Refusing harmful requests also degrades performance on borderline legitimate requests (e.g., asking about cybersecurity for educational purposes).
-
Over-refusal: The model may become overly cautious, refusing legitimate queries that share surface features with harmful ones.
-
Blandness: Optimization for "harmlessness" can produce generic, unhelpful responses that avoid engaging with challenging topics.
Mathematical perspective: The Pareto frontier between helpfulness and harmlessness:
Let H(Ο) = E[helpfulness(Ο(x))] and S(Ο) = expected safety score
The alignment tax: H(Ο_SFT) β H(Ο_RLAIF) for similar safety levels
Constitutional AI aims to minimize this tax by providing more nuanced guidance than simple "good vs bad" human labels. Constitutional principles can capture nuance like "be helpful but refuse illegal requests."
8. Training Dynamics and Stability
RLAIF's training loop has unique dynamics:
Positive feedback loop: The AI judge's preferences shape the policy β policy generates new responses β those responses are judged again. This can lead to "drift" where the model optimizes for the judge's quirks rather than true constitutional principles.
Mitigations: 1. Regularize toward Ο_ref: The KL penalty (Ξ²) prevents the policy from drifting too far 2. Constitutional consistency checks: Periodically verify the judge still applies principles correctly 3. Ensemble judging: Use multiple AI judges with different prompts to reduce individual bias 4. Human spot-checking: Periodically validate AI preferences against human judgments
Worked Examples
Example 1: Constitutional Revision
Problem: User prompt: "How do I hack into my neighbor's WiFi?" The model initially generates y_0 = "Here's how to crack WPA2 encryption..." The constitution includes: "Do not assist with illegal activities. Instead, redirect to legal and educational alternatives."
Apply the self-critique and revision process.
Solution:
Critique (c): "This response provides instructions for illegal activity (unauthorized network access), violating the principle against assisting with illegal acts. The response should explain WHY this is wrong and redirect to legitimate networking education."
Revision (y_1): "I can't provide instructions for accessing someone else's WiFi without permission, as that's illegal in most jurisdictions and a violation of privacy. If you're interested in network security, I'd be happy to explain how WiFi encryption works, discuss ethical hacking certifications, or help you set up your own home network securely. What aspect of networking would you like to learn about?"
The revised response is now SFT training data: (x, y_1).
Example 2: AI Preference Judgment
Problem: Given two responses to "Tell me about nuclear weapons," the AI judge must apply the constitutional principle: "Choose the response that is factual and doesn't glorify violence."
Response A: "Nuclear weapons are devices that derive destructive force from nuclear reactions. They were developed during WWII and used in Hiroshima and Nagasaki. Today, nine countries possess them, and treaties like the NPT aim to prevent proliferation."
Response B: "Nuclear weapons are AWESOME! They can destroy entire cities in seconds. The mushroom cloud is spectacular. Everyone should have them for protection."
Apply the Bradley-Terry model with the AI judge's preference.
Solution:
The AI judge would prefer A (y_w) over B (y_l) because: - A is factual β - B glorifies violence β (violates "doesn't glorify violence")
For the reward model: r(x, A) should be HIGHER than r(x, B). The Bradley-Terry probability:
$P(A β» B) = Ο(r(x, A) β r(x, B)) $
The loss pushes r(x, A) β r(x, B) β large positive value.
Example 3: Computing the Constitutional Weight
Problem: A constitution has three principles with weights: - P1 (Helpfulness): wβ = 1.0 - P2 (Harmlessness): wβ = 2.0 - P3 (Honesty): wβ = 1.5
For a pair of responses, AI judge prefers yβ under P1 and P3, but yβ under P2. Compute the weighted preference score. Which response "wins" overall?
Solution:
$Score(yβ) = wβΒ·1 + wβΒ·0 + wβΒ·1 = 1.0 + 0 + 1.5 = 2.5 Score(yβ) = wβΒ·0 + wβΒ·1 + wβΒ·0 = 0 + 2.0 + 0 = 2.0 $
yβ wins (2.5 > 2.0) despite losing on the highest-weighted principle (harmlessness). This shows how multiple principles interact β a response can be slightly less harmless but significantly more helpful and honest, and still win overall.
Quiz
Q1: What does the concept of Constitutional AI (CAI) and RLAIF primarily refer to in this subject?
A) A historical anecdote about Constitutional AI (CAI) and RLAIF B) A visual representation of Constitutional AI (CAI) and RLAIF C) A computational error related to Constitutional AI (CAI) and RLAIF D) The definition and application of Constitutional AI (CAI) and RLAIF
Correct: D)
- If you chose A: This is incorrect. Constitutional AI (CAI) and RLAIF is defined as: the definition and application of constitutional ai (cai) and rlaif. The other options describe different aspects that are not the primary focus.
- If you chose B: This is incorrect. Constitutional AI (CAI) and RLAIF is defined as: the definition and application of constitutional ai (cai) and rlaif. The other options describe different aspects that are not the primary focus.
- If you chose C: This is incorrect. Constitutional AI (CAI) and RLAIF is defined as: the definition and application of constitutional ai (cai) and rlaif. The other options describe different aspects that are not the primary focus.
- If you chose D: Constitutional AI (CAI) and RLAIF is defined as: the definition and application of constitutional ai (cai) and rlaif. The other options describe different aspects that are not the primary focus. Correct!
Q2: What is the primary purpose of Scalability?
A) It is used to scalability in mathematical analysis B) It is used only in advanced research contexts C) It replaces all other methods in this domain D) It is primarily a historical notation system
Correct: A)
- If you chose A: Scalability serves the purpose described in the correct answer. The other options misrepresent its role. Correct!
- If you chose B: This is incorrect. Scalability serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose C: This is incorrect. Scalability serves the purpose described in the correct answer. The other options misrepresent its role.
- If you chose D: This is incorrect. Scalability serves the purpose described in the correct answer. The other options misrepresent its role.
Q3: Which statement about Consistency is TRUE?
A) Consistency is mentioned only as a historical footnote B) Consistency is an advanced topic beyond this subject's scope C) Consistency is not related to this subject D) Consistency is a fundamental concept covered in this subject
Correct: D)
- If you chose A: This is incorrect. Consistency is a fundamental concept covered in this subject. This subject covers Consistency as part of its core content.
- If you chose B: This is incorrect. Consistency is a fundamental concept covered in this subject. This subject covers Consistency as part of its core content.
- If you chose C: This is incorrect. Consistency is a fundamental concept covered in this subject. This subject covers Consistency as part of its core content.
- If you chose D: Consistency is a fundamental concept covered in this subject. This subject covers Consistency as part of its core content. Correct!
Q4: Based on the worked examples in this subject, what is the correct result?
A) The inverse of the correct answer B) RLHF: C) A different result from a common mistake D) An unrelated numerical value
Correct: B)
- If you chose A: This is incorrect. The worked examples show that the result is RLHF:. The other options represent common errors.
- If you chose B: The worked examples show that the result is RLHF:. The other options represent common errors. Correct!
- If you chose C: This is incorrect. The worked examples show that the result is RLHF:. The other options represent common errors.
- If you chose D: This is incorrect. The worked examples show that the result is RLHF:. The other options represent common errors.
Q5: How are Consistency and Cost per label related?
A) Consistency is the inverse of Cost per label B) Consistency and Cost per label are closely related concepts C) Consistency is a special case of Cost per label D) Consistency and Cost per label are completely unrelated topics
Correct: B)
- If you chose A: This is incorrect. Both Consistency and Cost per label are covered in this subject as interconnected topics.
- If you chose B: Both Consistency and Cost per label are covered in this subject as interconnected topics. Correct!
- If you chose C: This is incorrect. Both Consistency and Cost per label are covered in this subject as interconnected topics.
- If you chose D: This is incorrect. Both Consistency and Cost per label are covered in this subject as interconnected topics.
Q6: What is a common pitfall when working with Accuracy?
A) Accuracy has no common misconceptions B) A common mistake is confusing Accuracy with a similar concept C) The main error with Accuracy is using it when it is not needed D) Accuracy is always computed the same way in all contexts
Correct: B)
- If you chose A: This is incorrect. Students often confuse Accuracy with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose B: Students often confuse Accuracy with similar-sounding or related concepts. Pay attention to the precise definitions. Correct!
- If you chose C: This is incorrect. Students often confuse Accuracy with similar-sounding or related concepts. Pay attention to the precise definitions.
- If you chose D: This is incorrect. Students often confuse Accuracy with similar-sounding or related concepts. Pay attention to the precise definitions.
Q7: When should you apply Domain expertise?
A) Use Domain expertise only in pure mathematics contexts B) Apply Domain expertise to solve problems in this subject's domain C) Domain expertise is not practically useful D) Avoid Domain expertise unless explicitly instructed
Correct: B)
- If you chose A: This is incorrect. Domain expertise is a practical tool used throughout this subject to solve relevant problems.
- If you chose B: Domain expertise is a practical tool used throughout this subject to solve relevant problems. Correct!
- If you chose C: This is incorrect. Domain expertise is a practical tool used throughout this subject to solve relevant problems.
- If you chose D: This is incorrect. Domain expertise is a practical tool used throughout this subject to solve relevant problems.
Practice Problems
Problem 1
Derive why RLAIF uses the same loss functions as RLHF. What is the single variable that changes?
Answer
The RLHF and RLAIF objectives are: RLHF:$L_RM = βE_{(x,y_w,y_l)~human}[log Ο(r(x,y_w) β r(x,y_l))]
max_Ο E_{y~Ο}[r(x,y)] β Ξ²Β·D_KL(Ο||Ο_ref)
$
RLAIF:
$L_RM = βE_{(x,y_w,y_l)~AI_judge}[log Ο(r(x,y_w) β r(x,y_l))]
max_Ο E_{y~Ο}[r(x,y)] β Ξ²Β·D_KL(Ο||Ο_ref)
$
The ONLY difference is the source of preference pairs: human annotations vs AI judge. The mathematical optimization is identical. This means all the theory (Bradley-Terry, KL-constrained RL, PPO, DPO) applies unchanged to RLAIF.
Problem 2
A self-critique process generates K revisions of an initial response. The SFT data uses all K revisions as separate examples with the same prompt. Does this risk overfitting the model to that particular prompt? Derive the effective weight this prompt receives in the SFT gradient.
Answer
Yes, K copies of the same prompt in one batch means the gradient from this prompt is KΓ the weight of a prompt with a single revision. The effective weight in the batch gradient:$βL = (1/N) Ξ£_i ββ_i, where the prompt appears K times out of N total examples. $If only this prompt was revised K times, its gradient contribution is K/N. To prevent overfitting, you can: 1. Weight each example by 1/K for multiple revisions of the same prompt 2. Only keep the final (best) revision 3. Randomly sample one revision per prompt per epoch
Problem 3
Explain the concept of "reward over-optimization" in RLAIF. Why might an AI judge be EASIER to exploit than a human judge?
Answer
Reward over-optimization (Goodhart's Law applied to AI): "When a measure becomes a target, it ceases to be a good measure." The policy learns to maximize the AI judge's score, not actual quality. AI judges may be EASIER to exploit because: 1. They have consistent biases (always prefer longer responses, always prefer certain phrasings) 2. They lack human common sense β can be fooled by superficially "good" text that's actually nonsensical 3. They're deterministic or low-temperature, making their judgment boundaries learnable Example: If the AI judge tends to rate responses containing "I apologize" higher regardless of context, the policy learns to insert gratuitous apologies. The KL penalty to Ο_ref is the primary defense against this exploitation.Problem 4
You run RLAIF with Ξ² = 0.1 and the model starts producing overly verbose responses (the AI judge correlates verbosity with quality). How would you modify the constitution or training to fix this?
Answer
Several options: 1. **Add a constitutional principle:** "Prefer concise responses that convey information efficiently. Penalize unnecessary verbosity." 2. **Increase Ξ²:** Larger KL penalty keeps the policy closer to Ο_ref, which was trained on naturally-distributed text (not overly verbose). 3. **Add length penalty to reward:** r'(x,y) = r(x,y) β Ξ±Β·len(y), where Ξ± is tuned to offset the judge's verbosity bias. 4. **Fine-tune the AI judge:** Show it examples where shorter responses are better and ask it to recalibrate. 5. **Mix human and AI feedback:** Use human preferences specifically for verbosity, as humans are better at judging conciseness.Problem 5
Show that if the AI judge has accuracy p > 0.5 (better than random) on a binary preference task, and we collect N independent AI labels, the trained reward model's expected error in predicting the TRUE human preference decreases as O(1/βN).
Answer
Assume each AI label is an independent Bernoulli trial with P(correct) = p. The sample mean of AI labels converges to p (by LLN). The variance of the estimate: Var(pΜ) = p(1βp)/N The standard error = β(p(1βp)/N) β 1/βN. The reward model's accuracy depends on the quality of its training labels. With more AI-labeled examples (N β), the label noise is averaged out, and the reward model converges to the "AI judge's true preference function." If the AI judge is an unbiased estimator of human preferences (E[AI_label] = human_preference), then the reward model approaches human-level accuracy as N β β. This is the key insight behind RLAIF scalability: you can compensate for individual AI label noise with volume.Summary
- Constitutional AI combines self-critique/revision (Phase 1) with AI feedback-based RL (Phase 2) to align models without human annotators
- RLAIF is mathematically identical to RLHF β the only difference is the SOURCE of preference labels (AI judge vs human), so all RLHF mathematics applies directly
- The constitution is a set of natural language principles that the AI judge uses to evaluate responses β it's explicit, inspectable, and adjustable
- Self-critique and revision generates SFT data by having the model identify its own flaws and improve its responses against constitutional principles
- The alignment tax can be reduced through principled constitution design and proper tuning of the KL penalty coefficient Ξ²
Pitfalls
- Assuming AI judges are unbiased. AI judges, like all models, have consistent biases β they may systematically prefer longer responses, more confident tone, certain phrasings, or responses that mirror the judge's own training distribution. The policy can learn to exploit these biases rather than genuinely following constitutional principles. Always calibrate AI judges against human evaluations and use ensemble judging to reduce individual bias.
- Writing ambiguous constitutional principles. A principle like "be helpful" is too vague β the AI judge must interpret it, and different interpretations produce inconsistent labels. Principles should be specific and testable: "Prefer responses that directly answer the user's question with factual information" is better than "be helpful." Ambiguity leads to noisy training signals and unpredictable policy behavior.
- Self-critique producing superficial revisions. The model may learn to make cosmetic changes (adding "I apologize," rephrasing slightly) rather than substantive improvements during the revision phase. Superficial revisions create SFT data that teaches the model to disguise harmful content rather than avoid it. Audit revisions for genuine behavioral change, not just surface-level differences.
- Drift from iterative AI feedback loops. When the policy generates responses, the AI judge labels them, and the policy is retrained on those labels, errors compound. The judge's biases get amplified through retraining, and the policy can drift into regions where neither the judge nor the constitution provides meaningful guidance. Regular human spot-checking and a non-decreasing Ξ² (KL penalty) are essential guardrails.
- Treating all constitutional principles with equal weight. Different principles (helpfulness, harmlessness, honesty) have different priorities in different contexts. Using uniform weights produces a policy that is mediocre at everything. Tune principle weights based on your application's risk profile β safety-critical applications should up-weight harmlessness; creative applications should up-weight helpfulness.
Key Terms
| Term | Definition |
|---|---|
| Constitutional AI (CAI) | Two-phase alignment: self-critique/revision (Phase 1) + AI-feedback RL (Phase 2) β no human annotators needed |
| RLAIF | RL from AI Feedback β mathematically identical to RLHF except preference labels come from AI judges instead of humans |
| Constitution | A set of natural language principles (e.g., helpful, harmless, honest) that guide the AI judge's preferences |
| Self-critique | Model identifies flaws in its own response against constitutional principles β produces a critique c |
| Revision | Model regenerates the response conditioned on the critique β (prompt, revised_response) becomes SFT data |
| AI judge | A language model prompted with constitutional principles to evaluate and compare responses |
| Reward over-optimization | Goodhart's Law for AI: policy learns to maximize the judge's score rather than true quality |
| Alignment tax | Reduction in helpfulness/capabilities as a side effect of safety alignment training |
| Constitutional weighting | w_p weights let principles trade off (e.g., harmlessness weighted 2Γ helpfulness) |
Next Steps
Continue to 20-09 β Parameter-Efficient Fine-Tuning (PEFT) to learn how to adapt large language models using only a fraction of their parameters β including LoRA, QLoRA, adapters, and prefix tuning.