Returns
Returns a tuple containing:- An InferenceResult object
- A list of Evaluation objects, one for each metric evaluated
Usage
This method combines creating an inference result with its evaluation in a single convenient call. It’s the recommended approach for single-turn evaluations, replacing the deprecatedgaltea.evaluations.create_single_turn() method.
Basic Example
With Specification IDs
With Pre-computed Scores
With Custom Score Calculation
Parameters
The session ID to log the inference result to.
The generated output/response from the AI model.
A list of metrics to evaluate against. Supports multiple formats:
- Strings: Metric names (e.g.,
["accuracy", "relevance"]) - CustomScoreEvaluationMetric: Objects with dynamic score calculation. Must be initialized with either ‘name’ or ‘id’ parameter.
- MetricInput dicts: Format with optional id, name, and score.
- If
scoreis afloat: Pre-calculated score (requires ‘id’ or ‘name’ in the dict). - If
scoreis aCustomScoreEvaluationMetric: Dynamic score calculation.
- If
specification_ids is provided (in which case metrics are resolved from the specifications).A list of Specification IDs. When provided, the evaluation uses the metrics linked to these specifications.Can be combined with
metrics — the API merges and deduplicates by metric ID.This parameter allows you to evaluate against specific product specifications without manually listing all their associated metrics. At least one of
metrics or specification_ids must be provided.The input text/prompt. If not provided, will be inferred from the test case linked to the session.
Context retrieved by a RAG system, if applicable.
Latency in milliseconds from model invocation to response.
Token usage information from the model call.
Supported keys:
input_tokens, output_tokens, cache_read_input_tokens.Cost breakdown for the model call.
Supported keys:
cost_per_input_token, cost_per_output_token, cost_per_cache_read_input_token.Version of Galtea’s conversation simulator used to generate the input.
Related
- Create Evaluation - Evaluate an existing session or inference result
- Create Inference Result - Create an inference result without immediate evaluation