What is Human Evaluation?
Human Evaluation lets your team manually review and score AI outputs. Instead of an LLM judge, evaluations enter aPENDING_HUMAN status and wait for a human annotator to submit a score through the dashboard.
It is especially useful when you need subjective judgment, domain expertise, or a human-in-the-loop quality gate that automated scoring cannot provide.
How It Works
- Create a user group — organize the team members who will review outputs
- Create a Human Evaluation metric — route evaluations to your group instead of an LLM judge
- Run evaluations — evaluations enter
PENDING_HUMANstatus - Annotate in the dashboard — group members review outputs and submit scores
Step 1: Create a User Group
User Groups organize evaluators and control who can score evaluations for specific metrics.Step 2: Add Users to the Group
Link users by their user IDs. Users in the group will see pending evaluations for linked metrics in their Human Evaluations dashboard page.Step 3: Create a Human Evaluation Metric
Create a metric with the Human Evaluation type and link it to your user group. When evaluations run against this metric, they enterPENDING_HUMAN status and appear in the linked group members’ dashboards.
Option A: From the Dashboard
Go to the Metrics creation form and configure:- Evaluation Type — Select Human Evaluation
- User Groups — Assign one or more user groups
- Evaluation Parameters — Choose which parameters (Input, Expected Output, Actual Output, etc.) annotators will see
- Evaluation Guidelines — Write the criteria annotators should follow when scoring
Option B: From the SDK
Step 4: Run Evaluations
Run evaluations using the SDK or from the dashboard just like any other metric. The only difference is that each evaluation will enterPENDING_HUMAN status instead of being scored by an LLM.
See Run Test-Based Evaluations, Evaluating Conversations, or Direct Inferences and Evaluations from the Platform for step-by-step instructions.
Step 5: Annotate in the Dashboard
Navigate to the Human Evaluations page in the sidebar. Click Start Evaluating to open the annotation dialog. For each evaluation, review the conversation and context, submit a score (0–100, normalized to 0–1), and optionally add a reason.Managing Groups
Update a Group
Remove Users or Metrics
Related
Evaluation Types
Understand AI Evaluation, Human Evaluation, and Self-Hosted scoring.
User Group Concept
Learn more about user groups and their properties.