This document provides a technical walkthrough of the examples-agent-patterns-llm-as-a-judge.py script. The script demonstrates a powerful agentic pattern known as "LLM as a Judge," where one language model agent generates content and a second agent evaluates and critiques it, creating an iterative refinement loop.
The core concept is to improve the quality of a generated output by using another LLM as a "judge" to provide feedback. This automates a cycle of creation and critique that would otherwise require human intervention.
In this specific example: 1. A Generator Agent creates a short story outline based on a user's prompt. 2. An Evaluator Agent (the "Judge") assesses the outline and provides a score and constructive feedback. 3. The process loops, feeding the feedback back to the Generator Agent, until the Evaluator Agent is satisfied with the outline.
The script is built around two main agents:
story_outline_generatorstory_outline_generator = Agent(
name="story_outline_generator",
instructions=(
"You generate a very short story outline based on the user's input. "
"If there is any feedback provided, use it to improve the outline."
),
)evaluatordataclass called EvaluationFeedback, ensuring a consistent and predictable output format.@dataclass
class EvaluationFeedback:
feedback: str
score: Literal["pass", "needs_improvement", "fail"]
evaluator = Agent[None](
name="evaluator",
instructions=(
"You evaluate a story outline and decide if it's good enough. "
"If it's not good enough, you provide feedback on what needs to be improved. "
"Never give it a pass on the first try. After 5 attempts, you can give it a pass if the story outline is good enough - do not go for perfection"
),
output_type=EvaluationFeedback,
)The script's logic is orchestrated within the async def main() function.
Initialization: The user is prompted for a story idea. This initial prompt becomes the first message in the conversation history (input_items).
The Refinement Loop: The script enters a while True loop that represents the core of the pattern.
story_outline_generator is run with the current input_items (which includes the user prompt and any previous feedback).evaluator agent is run with the same conversation history, so it has the context of the generated outline.score from the evaluator's feedback.
score is "pass", the loop breaks, and the process is complete.score is anything else, the feedback from the evaluator is formatted and appended to the input_items list as a new user message.story_outline_generator runs again, but this time it has the judge's feedback to guide its next attempt.Termination: The loop ends when the evaluator is satisfied. The script then prints the final, approved story outline.
Install Dependencies: Make sure you have the required packages installed from requirements.txt.
bash
pip install -r requirements.txt
Set Up Environment: The script uses python-dotenv to load environment variables. Ensure you have a .env file in the directory with your API keys (e.g., OPENAI_API_KEY).
Execute: Run the script from your terminal.
bash
python examples-agent-patterns-llm-as-a-judge.py
The following output demonstrates the script in action:
What kind of story would you like to hear? Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: needs_improvement
Re-running with feedback
Story outline generated
Evaluator score: pass
Story outline is good enough, exiting.
Final story outline: Title: Roots in the Redwoods
...This output clearly shows the iterative loop:
- An outline is generated.
- The evaluator scores it as "needs_improvement".
- The script announces it is "Re-running with feedback."
- This cycle repeats three times, showing the generator refining its output based on unseen feedback from the judge.
- Finally, the evaluator gives a "pass", the loop terminates, and the final high-quality outline is presented.