Skip to main content

What is Human Evaluation?

Human Evaluation lets your team manually review and score AI outputs. Instead of an LLM judge, evaluations enter a PENDING_HUMAN status and wait for a human annotator to submit a score through the dashboard. It is especially useful when you need subjective judgment, domain expertise, or a human-in-the-loop quality gate that automated scoring cannot provide.

How It Works

  1. Create a user group — organize the team members who will review outputs
  2. Create a Human Evaluation metric — route evaluations to your group instead of an LLM judge
  3. Run evaluations — evaluations enter PENDING_HUMAN status
  4. Annotate in the dashboard — group members review outputs and submit scores

Step 1: Create a User Group

User Groups organize evaluators and control who can score evaluations for specific metrics.
user_group = galtea.user_groups.create(
    name="quality-reviewers-" + run_identifier,
    description="Team responsible for reviewing output quality",
)
You can also create groups from the dashboard: navigate to your organization’s Groups tab.

Step 2: Add Users to the Group

Link users by their user IDs. Users in the group will see pending evaluations for linked metrics in their Human Evaluations dashboard page.
galtea.user_groups.link_users(
    user_group_id=user_group_id,
    user_ids=[user_id_1],
)
You can find user IDs in the dashboard under Settings > Members, or by using the organization members API.

Step 3: Create a Human Evaluation Metric

Create a metric with the Human Evaluation type and link it to your user group. When evaluations run against this metric, they enter PENDING_HUMAN status and appear in the linked group members’ dashboards.

Option A: From the Dashboard

Go to the Metrics creation form and configure:
  • Evaluation Type — Select Human Evaluation
  • User Groups — Assign one or more user groups
  • Evaluation Parameters — Choose which parameters (Input, Expected Output, Actual Output, etc.) annotators will see
  • Evaluation Guidelines — Write the criteria annotators should follow when scoring

Option B: From the SDK

metric = galtea.metrics.create(
    name="domain-expert-review-" + run_identifier,
    source="human_evaluation",
    judge_prompt="Review the assistant's response for accuracy and helpfulness. Score 1 if the response is correct and useful, 0 if it contains errors or is unhelpful.",
    evaluation_params=["input", "actual_output", "expected_output"],
    user_group_ids=[user_group_id],
    description="Domain expert review of response quality",
)

# Link the metric to the user group
galtea.user_groups.link_metrics(
    user_group_id=user_group_id,
    metric_ids=[metric.id],
)

Step 4: Run Evaluations

Run evaluations using the SDK or from the dashboard just like any other metric. The only difference is that each evaluation will enter PENDING_HUMAN status instead of being scored by an LLM. See Run Test-Based Evaluations, Evaluating Conversations, or Direct Inferences and Evaluations from the Platform for step-by-step instructions.

Step 5: Annotate in the Dashboard

Navigate to the Human Evaluations page in the sidebar. Click Start Evaluating to open the annotation dialog. For each evaluation, review the conversation and context, submit a score (0–100, normalized to 0–1), and optionally add a reason.

Managing Groups

Update a Group

user_group = galtea.user_groups.update(
    user_group_id=user_group_id,
    name="senior-quality-reviewers-" + run_identifier,
    description="Senior team for quality reviews",
)

Remove Users or Metrics

galtea.user_groups.unlink_users(
    user_group_id=user_group_id,
    user_ids=[user_id_1],
)
galtea.user_groups.unlink_metrics(
    user_group_id=user_group_id,
    metric_ids=[metric_id_2],
)

Evaluation Types

Understand AI Evaluation, Human Evaluation, and Self-Hosted scoring.

User Group Concept

Learn more about user groups and their properties.