Technical Explanation: LLM as a Judge Pattern

This document provides a technical walkthrough of the examples-agent-patterns-llm-as-a-judge.py script. The script demonstrates a powerful agentic pattern known as "LLM as a Judge," where one language model agent generates content and a second agent evaluates and critiques it, creating an iterative refinement loop.

Overview

The core concept is to improve the quality of a generated output by using another LLM as a "judge" to provide feedback. This automates a cycle of creation and critique that would otherwise require human intervention.

In this specific example: 1. A Generator Agent creates a short story outline based on a user's prompt. 2. An Evaluator Agent (the "Judge") assesses the outline and provides a score and constructive feedback. 3. The process loops, feeding the feedback back to the Generator Agent, until the Evaluator Agent is satisfied with the outline.

Core Components

The script is built around two main agents:

1. `story_outline_generator`

Purpose: To generate a very short story outline.
Instructions: It is instructed to create an outline based on the user's initial input. Crucially, it is also designed to incorporate any feedback it receives to improve subsequent versions of the outline.

story_outline_generator = Agent(
    name="story_outline_generator",
    instructions=(
        "You generate a very short story outline based on the user's input. "
        "If there is any feedback provided, use it to improve the outline."
    ),
)

2. `evaluator`

Purpose: To act as the "judge" of the story outline.
Instructions: This agent evaluates the outline and decides if it's good enough. It has specific rules:
- It must never give a "pass" on the first try, forcing at least one round of feedback.
- After 5 attempts, it can pass a good enough outline without demanding perfection.
Output Type: The evaluator's output is structured using a dataclass called EvaluationFeedback, ensuring a consistent and predictable output format.

@dataclass
class EvaluationFeedback:
    feedback: str
    score: Literal["pass", "needs_improvement", "fail"]

evaluator = Agent[None](
    name="evaluator",
    instructions=(
        "You evaluate a story outline and decide if it's good enough. "
        "If it's not good enough, you provide feedback on what needs to be improved. "
        "Never give it a pass on the first try. After 5 attempts, you can give it a pass if the story outline is good enough - do not go for perfection"
    ),
    output_type=EvaluationFeedback,
)

Execution Flow

The script's logic is orchestrated within the async def main() function.

Initialization: The user is prompted for a story idea. This initial prompt becomes the first message in the conversation history (input_items).
The Refinement Loop: The script enters a while True loop that represents the core of the pattern.
- Generate: The story_outline_generator is run with the current input_items (which includes the user prompt and any previous feedback).
- Update History: The generated outline is added to the conversation history.
- Evaluate: The evaluator agent is run with the same conversation history, so it has the context of the generated outline.
- Check Score: The script checks the score from the evaluator's feedback.
  - If the score is "pass", the loop breaks, and the process is complete.
  - If the score is anything else, the feedback from the evaluator is formatted and appended to the input_items list as a new user message.
- Repeat: The loop continues, and the story_outline_generator runs again, but this time it has the judge's feedback to guide its next attempt.
Termination: The loop ends when the evaluator is satisfied. The script then prints the final, approved story outline.

How to Run the Script

Install Dependencies: Make sure you have the required packages installed from requirements.txt. bash pip install -r requirements.txt
Set Up Environment: The script uses python-dotenv to load environment variables. Ensure you have a .env file in the directory with your API keys (e.g., OPENAI_API_KEY).
Execute: Run the script from your terminal. bash python examples-agent-patterns-llm-as-a-judge.py

Sample Output Analysis

The following output demonstrates the script in action:

What kind of story would you like to hear? Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: pass
Story outline is good enough, exiting.
Final story outline: Title: Roots in the Redwoods
...

This output clearly shows the iterative loop: - An outline is generated. - The evaluator scores it as "needs_improvement". - The script announces it is "Re-running with feedback." - This cycle repeats three times, showing the generator refining its output based on unseen feedback from the judge. - Finally, the evaluator gives a "pass", the loop terminates, and the final high-quality outline is presented.