What is an Evaluation?
An evaluation in Galtea represents the assessment of inference results from a session using the evaluation criteria of a metric. Evaluations are now directly linked to sessions, allowing for comprehensive evaluation of full sessions containing multiple inference results for multi-turn dialogues.Evaluations do not call your AI product — they score outputs that have already been generated. Run your product first to produce outputs, then trigger the evaluation to score them.
Evaluation Lifecycle
Evaluations follow a specific lifecycle:Creation
Trigger an evaluation. It will appear in the session’s details page with the status “pending”
Processing
Galtea’s evaluation system processes the evaluation using the evaluation criteria of the selected metric
Completion
Once processed, the status changes to “completed” and the results are available
SDK Integration
The Galtea SDK allows you to create, view, and manage evaluations programmatically. This is particularly useful for organizations that want to automate their evaluation process or integrate it into their CI/CD pipeline.Evaluation Service SDK
Manage evaluations using the Python SDK
GitHub Actions
Learn how to set up GitHub Actions to automatically evaluate new versions
Evaluation Properties
Result Properties
Once an evaluation completes, the following fields are available:The current status of the evaluation.
Possible values:
- Pending: The evaluation has been created but not yet processed.
- Pending Human: The evaluation is waiting for a human annotator to review and score it. This status is used for metrics with
source: "human_evaluation". - Success: The evaluation was processed successfully.
- Failed: The evaluation encountered an error during processing. Check the
canRetryfield to determine if this evaluation can be retried. - Skipped: The evaluation was skipped because the metric validation failed (e.g., missing required parameters) or due to insufficient credits. The error message contains details about what was missing. Check the
canRetryfield to determine if this evaluation can be retried.
The score assigned to the output by the metric’s evaluation criteria. Example: 0.85
The error message if the evaluation failed during processing.
The number of retry attempts made for this evaluation. Starts at 0 and increments each time the evaluation is retried. Used to track retry history and enforce retry limits.
Indicates whether the evaluation can be retried. When
true, the evaluation failed or was skipped due to a temporary condition (e.g., evaluation processing error, insufficient credits) and can be retried later. When false, the evaluation cannot be retried. Only evaluations with canRetry=true are eligible for the retry operation. Defaults to false.The ID of the human evaluator assigned to this evaluation, if applicable.
The timestamp when the human evaluator started reviewing this evaluation, if applicable.
Related
Test
A set of test cases for evaluating product performance
Test Case
Each challenge in a test for evaluating product performance
Session
A full conversation between a user and an AI system.
Metric
Ways to evaluate and score product performance