What is an Evaluation?
An evaluation in Galtea represents the assessment of inference results from a session using the evaluation criteria of a metric. Evaluations are now directly linked to sessions, allowing for comprehensive evaluation of full sessions containing multiple inference results for multi-turn dialogues.Evaluations don’t perform inference on the LLM product themselves. Rather, they evaluate outputs that have already been generated. You should perform inference on your product first, then trigger the evaluation.
Evaluation Lifecycle
Evaluations follow a specific lifecycle:1
Creation
Trigger an evaluation. It will appear in the session’s details page with the status “pending”
2
Processing
Galtea’s evaluation system processes the evaluation using the evaluation criteria of the selected metric
3
Completion
Once processed, the status changes to “completed” and the results are available
Related Concepts
Test
A set of test cases for evaluating product performance
Test Case
Each challenge in a test for evaluating product performance
Session
A full conversation between a user and an AI system.
Metric
Ways to evaluate and score product performance
SDK Integration
The Galtea SDK allows you to create, view, and manage evaluations programmatically. This is particularly useful for organizations that want to automate their versioning process or integrate it into their CI/CD pipeline.Evaluation Service SDK
Manage evaluations using the Python SDK
GitHub Actions
Learn how to set up GitHub Actions to automatically evaluate new versions
Evaluation Properties
The properties required for an evaluation depend on which method you use:For create_single_turn() (Single-Turn Evaluations)
The ID of the version you want to evaluate.
The ID of the session containing the inference results to be evaluated.
A list of the metrics used for the evaluation.
The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”
The test case to be evaluated. Required for non-production evaluations.
The user query that your model needs to answer. Required for production evaluations where no
test_case_id is provided.Set to
True to indicate the evaluation is from a production environment. Defaults to False.The context retrieved by your RAG system that was used to generate the actual output.
Including retrieval context enables more comprehensive evaluation of RAG systems.
Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.
Token count information of the LLM call. Use snake_case keys:
input_tokens, output_tokens, cache_read_input_tokens.The costs associated with the LLM call. Keys may include
cost_per_input_token, cost_per_output_token, etc.Evaluation Results Properties
Once an evaluation is created, you can access the following information:The current status of the evaluation.
Possible values:
- Pending: The evaluation has been created but not yet processed.
- Success: The evaluation was processed successfully.
- Failed: The evaluation encountered an error during processing.
The score assigned to the output by the metric’s evaluation criteria. Example: 0.85
The error message if the evaluation failed during processing.