Skip to main content
Using specifications? If your tests and metrics are linked to Specifications, you can use Specification-Based Evaluation below — it resolves metrics automatically. This tutorial also covers the manual loop approach for full control over each test case.
To evaluate a product version against a predefined Test, you can loop through its Test Cases and run an evaluation for each one. This workflow is ideal for regression testing and standardized quality checks.
A session is created implicitly the first time you run an evaluation for a specific version_id and test_id combination. You do not need to create it manually.

Workflow

1

Select a Test and Version

Identify the test_id and version_id you want to evaluate.
2

Iterate Through Test Cases

Fetch all test cases associated with the test using galtea.test_cases.list().
3

Generate and Evaluate Output

For each test case, call your product to get its output, create a session with galtea.sessions.create(), then use galtea.inference_results.create_and_evaluate() to log and evaluate the result.

Example

This example demonstrates how to run an evaluation on all test cases from a specific test.
# 1. Fetch the test cases for the specified test
test_cases = galtea.test_cases.list(test_id=test_id)
print(f"Found {len(test_cases)} test cases for Test ID: {test_id}")

# 2. Loop through each test case and run an evaluation
for test_case in test_cases:
    # Get your product's actual response to the input
    actual_output = your_product_function(test_case.input)

    # Create a session and evaluate
    session = galtea.sessions.create(version_id=version_id, test_case_id=test_case.id)
    galtea.inference_results.create_and_evaluate(
        session_id=session.id,
        output=actual_output,
        metrics=METRICS_TO_EVALUATE,
    )

print(f"\nAll evaluations submitted for Version ID: {version_id}")
A session is automatically created behind the scenes to link this version_id and test_id with the provided inference result (the actual_output and the Test Case’s input).

Specification-Based Evaluation

Instead of passing metrics explicitly, you can evaluate against Specifications. Each specification has linked metrics that the API resolves automatically.
# 1. List the specifications for your product
specifications = galtea.specifications.list(product_id=product_id)
spec_ids = [spec.id for spec in specifications]
print(f"Found {len(spec_ids)} specifications for product {product_id}")

# 2. Create a session and record an inference result
session = galtea.sessions.create(version_id=version_id, test_case_id=test_cases[0].id)
inference_result = galtea.inference_results.create(
    session_id=session.id,
    input=test_cases[0].input,
    output=your_product_function(test_cases[0].input),
)

# 3. Evaluate using specifications — the API resolves linked metrics automatically
evaluations = galtea.evaluations.create(
    inference_result_id=inference_result.id,
    specification_ids=spec_ids,
)
print(f"Created {len(evaluations)} evaluations from specifications")
If you omit both metrics and specification_ids, the API falls back to all metrics from every specification linked to the product.

Next Steps

Specification-Driven Evaluations

Automate test resolution, agent execution, and evaluation with specifications.

Evaluating Conversations

Evaluate multi-turn conversations and production sessions.