Key Components of a Judge Prompt
1. Role (Optional)
Clearly state what the evaluator is assessing: Example Structure:2. Evaluation Criteria (Essential)
Evaluation criteria are the clear rules or checkpoints you want the judge to use when reviewing an output. They help make your evaluation process fair, repeatable, and easy to understand. Recommendations:- Be specific: Do not use vague terms like “good quality.” Instead, explain exactly what quality means for your use case.
- Break it down (itemization): Break down your criteria into separate, clearly defined points. Make them easier to check, score, and understand. Each criterion should focus on one thing at a time.
- Use objective signals: Look for things you can see or measure, not just opinions or feelings.
- Give examples: Show what a good answer looks like, and what a bad answer looks like, for each criterion.
- Number your criteria: This makes them easier to reference and ensures nothing important is missed.
3. Scoring Rubric (Essential)
The scoring rubric is how you decide what score to give for each output. It sets the rules for what counts as a pass or fail, and helps make your evaluation fair and consistent. Here are the main things to keep in mind: A. Define clear thresholds. Make your scoring rules easy to understand and apply. For example, if using binary scoring (either a 0 or a 1), specify exactly what counts as a pass or a fail. B. Cover edge cases: Think about what happens if criteria conflict, if data is missing, or if something is unclear. Make sure your rubric explains how to handle these situations.All-or-nothing logic: recommended for direct system validation. Use this when you want a strict pass/fail: score 1 only when every requirement is satisfied; otherwise score 0.
Common Pitfalls to Avoid
- Vague criteria: “Check if response is good” → Instead: “Check if response addresses all parts of the user’s question”
- Ambiguous thresholds: “Mostly correct” → Instead: “At least 3 out of 4 criteria met”
- Missing edge cases: Not specifying what happens with partial matches or ambiguous data
- Subjective language: “Natural” or “appropriate” without defining what that means
- Overlapping criteria: Multiple criteria testing the same thing differently
The key is: If you can’t program it as a rule-based system, your criteria aren’t specific enough for an LLM judge either.
Template
This template serves as a starting skeleton that you can adapt and modify based on your specific evaluation needs. Feel free to add more criteria, adjust the scoring scale, or restructure sections to better fit your use case.
How to implement it with Galtea?
You can create your custom metric in Galtea using the Dashboard or the SDK. Here’s an example of how to create a custom metric using the SDK with the template shown above:Next Steps
AI Metric Generation
Let AI generate judge prompts from your specifications automatically.
Evaluate with Custom Scores
Use self-hosted deterministic metrics alongside LLM-as-a-judge.