Evaluation

What is an Evaluation?

An evaluation in Galtea represents the assessment of inference results from a session using the evaluation criteria of a metric. Evaluations are now directly linked to sessions, allowing for comprehensive evaluation of full sessions containing multiple inference results for multi-turn dialogues.

Evaluations don’t perform inference on the LLM product themselves. Rather, they evaluate outputs that have already been generated. You should perform inference on your product first, then trigger the evaluation.

The only way to create evaluations is programmatically by using the Galtea SDK but they can be viewed and managed on the Galtea dashboard.

Evaluation Lifecycle

Evaluations follow a specific lifecycle:

Creation

Trigger an evaluation. It will appear in the session’s details page with the status “pending”

Processing

Galtea’s evaluation system processes the evaluation using the evaluation criteria of the selected metric

Completion

Once processed, the status changes to “completed” and the results are available

Test

A set of test cases for evaluating product performance

Test Case

Each challenge in a test for evaluating product performance

Session

A full conversation between a user and an AI system.

Metric

Ways to evaluate and score product performance

SDK Integration

The Galtea SDK allows you to create, view, and manage evaluations programmatically. This is particularly useful for organizations that want to automate their versioning process or integrate it into their CI/CD pipeline.

Evaluation Service SDK

Manage evaluations using the Python SDK

GitHub Actions

Learn how to set up GitHub Actions to automatically evaluate new versions

Evaluation Properties

The properties required for an evaluation depend on which method you use:

For `create_single_turn()` (Single-Turn Evaluations)

Version ID

string

required

The ID of the version you want to evaluate.

Session ID

string

required

The ID of the session containing the inference results to be evaluated.

Metrics IDs

list[Metrics]

required

A list of the metrics used for the evaluation.

Actual Output

Text

required

The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”

Test Case ID

Test Case

The test case to be evaluated. Required for non-production evaluations.

Input

Text

The user query that your model needs to answer. Required for production evaluations where no test_case_id is provided.

Is Production

boolean

Set to True to indicate the evaluation is from a production environment. Defaults to False.

Retrieval Context

Text

The context retrieved by your RAG system that was used to generate the actual output.

Including retrieval context enables more comprehensive evaluation of RAG systems.

Latency

float

Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.

Usage Info

Object

Token count information of the LLM call. Use snake_case keys: input_tokens, output_tokens, cache_read_input_tokens.

Cost Info

Object

The costs associated with the LLM call. Keys may include cost_per_input_token, cost_per_output_token, etc.

If cost information is properly configured in the Model selected by the Version, the system will automatically calculate the cost. Provided values will override the system’s calculation.

Evaluation Results Properties

Once an evaluation is created, you can access the following information:

Status

Enum

The current status of the evaluation. Possible values:

Pending: The evaluation has been created but not yet processed.
Success: The evaluation was processed successfully.
Failed: The evaluation encountered an error during processing.

Score

Number

The score assigned to the output by the metric’s evaluation criteria. Example: 0.85

Reason

Text

The explanation of the score assigned to the output by the metric’s evaluation criteria.

Error

Text

The error message if the evaluation failed during processing.

Concepts

Metrics

Test Types

What is an Evaluation?

Evaluation Lifecycle

Test

Test Case

Session

Metric

SDK Integration

Evaluation Service SDK

GitHub Actions

Evaluation Properties

For `create_single_turn()` (Single-Turn Evaluations)

Evaluation Results Properties

Concepts

Metrics

Test Types

​What is an Evaluation?

​Evaluation Lifecycle

​Related Concepts

Test

Test Case

Session

Metric

​SDK Integration

Evaluation Service SDK

GitHub Actions

​Evaluation Properties

​For create_single_turn() (Single-Turn Evaluations)

​Evaluation Results Properties

What is an Evaluation?

Evaluation Lifecycle

Related Concepts

SDK Integration

Evaluation Properties

For `create_single_turn()` (Single-Turn Evaluations)

Evaluation Results Properties