Technical Explanation: LLM as a Judge Pattern

This document provides a technical walkthrough of the examples-agent-patterns-llm-as-a-judge.py script. The script demonstrates a powerful agentic pattern known as "LLM as a Judge," where one language model agent generates content and a second agent evaluates and critiques it, creating an iterative refinement loop.

Overview

The core concept is to improve the quality of a generated output by using another LLM as a "judge" to provide feedback. This automates a cycle of creation and critique that would otherwise require human intervention.

In this specific example: 1. A Generator Agent creates a short story outline based on a user's prompt. 2. An Evaluator Agent (the "Judge") assesses the outline and provides a score and constructive feedback. 3. The process loops, feeding the feedback back to the Generator Agent, until the Evaluator Agent is satisfied with the outline.

Core Components

The script is built around two main agents:

1. story_outline_generator

story_outline_generator = Agent(
    name="story_outline_generator",
    instructions=(
        "You generate a very short story outline based on the user's input. "
        "If there is any feedback provided, use it to improve the outline."
    ),
)

2. evaluator

@dataclass
class EvaluationFeedback:
    feedback: str
    score: Literal["pass", "needs_improvement", "fail"]

evaluator = Agent[None](
    name="evaluator",
    instructions=(
        "You evaluate a story outline and decide if it's good enough. "
        "If it's not good enough, you provide feedback on what needs to be improved. "
        "Never give it a pass on the first try. After 5 attempts, you can give it a pass if the story outline is good enough - do not go for perfection"
    ),
    output_type=EvaluationFeedback,
)

Execution Flow

The script's logic is orchestrated within the async def main() function.

  1. Initialization: The user is prompted for a story idea. This initial prompt becomes the first message in the conversation history (input_items).

  2. The Refinement Loop: The script enters a while True loop that represents the core of the pattern.

  3. Termination: The loop ends when the evaluator is satisfied. The script then prints the final, approved story outline.

How to Run the Script

  1. Install Dependencies: Make sure you have the required packages installed from requirements.txt. bash pip install -r requirements.txt

  2. Set Up Environment: The script uses python-dotenv to load environment variables. Ensure you have a .env file in the directory with your API keys (e.g., OPENAI_API_KEY).

  3. Execute: Run the script from your terminal. bash python examples-agent-patterns-llm-as-a-judge.py

Sample Output Analysis

The following output demonstrates the script in action:

What kind of story would you like to hear? Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: pass
Story outline is good enough, exiting.
Final story outline: Title: Roots in the Redwoods
...

This output clearly shows the iterative loop: - An outline is generated. - The evaluator scores it as "needs_improvement". - The script announces it is "Re-running with feedback." - This cycle repeats three times, showing the generator refining its output based on unseen feedback from the judge. - Finally, the evaluator gives a "pass", the loop terminates, and the final high-quality outline is presented.