- An entire conversation stored in a Session by creating evaluations for each of its Inference Results
- A single Inference Result by providing its ID
Returns
Returns a list of Evaluation objects, one for each metric provided.Usage
This method evaluates inference results using the specified metrics. It supports both Galtea-hosted metrics and self-hosted custom metrics.You must provide either
session_id or inference_result_id, but not both. For single-turn evaluations, you can also use galtea.inference_results.create_and_evaluate() which combines creating an inference result and evaluating it in one call.Development Testing
For non self-hosted metricsEvaluating a Single Inference Result
You can evaluate a specific inference result by providing its ID instead of a session ID:Production Monitoring
In order to monitor your product in a production environment, you can create evaluations not linked to a specific test case, but you need to set theis_production flag of the Session to True.
Advanced Usage
You can also create evaluations using self-hosted metrics with dynamically calculated scores by utilizing theCustomScoreEvaluationMetric class, which allows for more complex evaluation scenarios.
Both options are equally valid for self-hosted metrics. Choose based on your preference: pre-compute for simplicity, or use CustomScoreEvaluationMetric for encapsulation and reusability.
When using
CustomScoreEvaluationMetric, your measure() method receives an inference_results parameter containing InferenceResult objects — all turns for session evaluations, or a single-element list for single inference result evaluations. This enables conversation-level custom scoring. See Evaluate with Custom Metrics for examples.Parameters
The ID of the session containing the inference results to be evaluated.
Either
session_id or inference_result_id must be provided, but not both.The ID of a specific inference result to evaluate.
Either
session_id or inference_result_id must be provided, but not both.A list of Specification IDs. When provided, the evaluation uses the metrics linked to these specifications.Can be combined with
metrics — the API merges and deduplicates by metric ID. If neither metrics nor specification_ids is provided, the API falls back to metrics from all specifications linked to the product.This parameter allows you to evaluate against specific product specifications without manually listing all their associated metrics.
A list of metrics to use for the evaluation.Optional when
specification_ids is provided or when the product has specifications with linked metrics (in which case those metrics are used as a fallback).The MetricInput dictionary supports the following keys:id(string, optional): The ID of an existing metricname(string, optional): The name of the metricscore(float | CustomScoreEvaluationMetric, optional): For self-hosted metrics only- If
float: Pre-computed score (0.0 to 1.0). Requiresidornamein the dict. - If
CustomScoreEvaluationMetric: Score will be calculated dynamically. The CustomScoreEvaluationMetric instance must be initialized withnameorid. Do NOT provideidornamein the dict when using this option.
- If
For self-hosted metrics, both score options are equally valid: pre-compute as a float, or use CustomScoreEvaluationMetric for dynamic calculation. Galtea-hosted metrics automatically compute scores and should not include a
score field.