Create and Evaluate Inference Result

Returns

Returns a tuple containing:

An InferenceResult object
A list of Evaluation objects, one for each metric evaluated

Usage

This method combines creating an inference result with its evaluation in a single convenient call. It’s the recommended approach for single-turn evaluations, replacing the deprecated galtea.evaluations.create_single_turn() method.

Basic Example

# Create inference result and evaluate in a single call
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    metrics=[{"name": "Factual Accuracy"}, {"name": "Answer Relevancy"}],
)

With Specification IDs

# Evaluate using metrics linked to specifications (no need to list metrics manually)
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    specification_ids=[specification.id],
)

With Pre-computed Scores

# With pre-computed scores for self-hosted metrics
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"name": self_hosted_metric.name, "score": 0.95},  # Pre-computed score
    ],
)

With Custom Score Calculation

# With dynamic score calculation using CustomScoreEvaluationMetric
from galtea import CustomScoreEvaluationMetric


class MyMetric(CustomScoreEvaluationMetric):
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        # Your custom scoring logic
        return 0.95


custom_metric = MyMetric(name=self_hosted_metric.name)

inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"score": custom_metric},  # Dynamic score calculation
    ],
)

Parameters

session_id

string

required

The session ID to log the inference result to.

output

string

required

The generated output/response from the AI model.

metrics

List[Union[str, CustomScoreEvaluationMetric, Dict]]

A list of metrics to evaluate against. Supports multiple formats:

Strings: Metric names (e.g., ["accuracy", "relevance"])
CustomScoreEvaluationMetric: Objects with dynamic score calculation. Must be initialized with either ‘name’ or ‘id’ parameter.
MetricInput dicts: Format with optional id, name, and score.
- If score is a float: Pre-calculated score (requires ‘id’ or ‘name’ in the dict).
- If score is a CustomScoreEvaluationMetric: Dynamic score calculation.

Optional when specification_ids is provided (in which case metrics are resolved from the specifications).

specification_ids

List[str]

A list of Specification IDs. When provided, the evaluation uses the metrics linked to these specifications.Can be combined with metrics — the API merges and deduplicates by metric ID.

This parameter allows you to evaluate against specific product specifications without manually listing all their associated metrics. At least one of metrics or specification_ids must be provided.

input

string

The input text/prompt. If not provided, will be inferred from the test case linked to the session.

retrieval_context

string

Context retrieved by a RAG system, if applicable.

latency

float

Latency in milliseconds from model invocation to response.

usage_info

dict[str, int]

Token usage information from the model call. Supported keys: input_tokens, output_tokens, cache_read_input_tokens.

cost_info

dict[str, float]

Cost breakdown for the model call. Supported keys: cost_per_input_token, cost_per_output_token, cost_per_cache_read_input_token.

conversation_simulator_version

string

Version of Galtea’s conversation simulator used to generate the input.

Create Evaluation - Evaluate an existing session or inference result
Create Inference Result - Create an inference result without immediate evaluation

​Returns

​Usage

​Basic Example

​With Specification IDs

​With Pre-computed Scores

​With Custom Score Calculation

​Parameters

​Related

Returns

Usage

Basic Example

With Specification IDs

With Pre-computed Scores

With Custom Score Calculation

Parameters

Related