Skip to main content

Overview

AI Metric Generation lets you automatically create evaluation metrics from your product’s specifications. Instead of manually crafting judge prompts and configuring evaluation parameters, the AI analyzes your specifications and generates ready-to-use metrics.
Evaluation parameters are automatically selected based on each specification’s description and test type. The generated judge prompt follows a format optimized for reliable LLM-based evaluation across different evaluator models.

Requirements

  • A product with a description
  • At least one specification of type POLICY with a test type assigned (Accuracy, Security, or Behavior)
CAPABILITY and INABILITY specifications cannot be used for AI metric generation because they do not have a test type.

How to Generate Metrics

There are two ways to trigger AI metric generation from the dashboard:

From the Specifications Page

  1. Navigate to your product’s Specifications tab
  2. Open the dropdown menu on the specification you want to generate metrics for
  3. Click Generate Metrics — this takes you to the generation page with that specification pre-selected
  4. Click Generate and wait for the AI to process
  5. Review the generated candidates — edit, save, or discard each one

From the Metrics Page

  1. Navigate to your product’s Metrics tab
  2. Click Generate Metrics with AI
  3. Select the specifications you want to generate metrics for
  4. Click Generate and wait for the AI to process
  5. Review the generated candidates — edit, save, or discard each one
In both cases, the AI analyzes your product name, description, and the selected specifications to generate tailored metrics.

Evaluation Parameter Selection

The AI automatically selects the evaluation parameters each metric needs based on what the specification describes. For example:
  • A specification about citation accuracy or knowledge-grounded answers will include retrieval_context so the judge can verify answers against retrieved source material.
  • A specification about internal processes, workflows, or tool orchestration will include traces (and often tools_used) so the judge can inspect the execution path — not just the final output.
  • A specification about refusal or safety boundaries typically only needs input and actual_output, since compliance is fully observable from what was asked and answered.
If a metric includes traces or retrieval_context as evaluation parameters, the product under test must capture that data during test execution (e.g., via tracing integrations). If the data is not available at evaluation time, the evaluation will fail.

Generated Metric Properties

Each AI-generated metric candidate includes:
PropertyDescription
NameA descriptive name for the metric
DescriptionWhat the metric evaluates
Judge PromptThe evaluation prompt with placeholders for dynamic data
Evaluation ParametersThe data parameters the judge needs for evaluation (automatically selected based on the specification)
TagsCategorization tags
Evaluator ModelThe LLM model used for evaluation
Test TypeInherited from the source specification

Specification Linking

When you save a generated metric, it is automatically linked to the specification it was generated from. This creates a traceable connection between your requirements and your evaluation criteria. You can view linked specifications directly from a metric’s detail page, and manage metric-specification links from the Specification Hub.