Skip to main content

Returns

Returns a tuple containing:
  1. An InferenceResult object
  2. A list of Evaluation objects, one for each metric evaluated

Usage

This method combines creating an inference result with its evaluation in a single convenient call. It’s the recommended approach for single-turn evaluations, replacing the deprecated galtea.evaluations.create_single_turn() method.

Basic Example

# Create inference result and evaluate in a single call
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    metrics=[{"name": "Factual Accuracy"}, {"name": "Answer Relevancy"}],
)

With Specification IDs

# Evaluate using metrics linked to specifications (no need to list metrics manually)
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    specification_ids=[specification.id],
)

With Pre-computed Scores

# With pre-computed scores for self-hosted metrics
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"name": self_hosted_metric.name, "score": 0.95},  # Pre-computed score
    ],
)

With Custom Score Calculation

# With dynamic score calculation using CustomScoreEvaluationMetric
from galtea import CustomScoreEvaluationMetric


class MyMetric(CustomScoreEvaluationMetric):
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        # Your custom scoring logic
        return 0.95


custom_metric = MyMetric(name=self_hosted_metric.name)

inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"score": custom_metric},  # Dynamic score calculation
    ],
)

Parameters

session_id
string
required
The session ID to log the inference result to.
output
string
required
The generated output/response from the AI model.
metrics
List[Union[str, CustomScoreEvaluationMetric, Dict]]
A list of metrics to evaluate against. Supports multiple formats:
  • Strings: Metric names (e.g., ["accuracy", "relevance"])
  • CustomScoreEvaluationMetric: Objects with dynamic score calculation. Must be initialized with either ‘name’ or ‘id’ parameter.
  • MetricInput dicts: Format with optional id, name, and score.
    • If score is a float: Pre-calculated score (requires ‘id’ or ‘name’ in the dict).
    • If score is a CustomScoreEvaluationMetric: Dynamic score calculation.
Optional when specification_ids is provided (in which case metrics are resolved from the specifications).
specification_ids
List[str]
A list of Specification IDs. When provided, the evaluation uses the metrics linked to these specifications.Can be combined with metrics — the API merges and deduplicates by metric ID.
This parameter allows you to evaluate against specific product specifications without manually listing all their associated metrics. At least one of metrics or specification_ids must be provided.
input
string
The input text/prompt. If not provided, will be inferred from the test case linked to the session.
retrieval_context
string
Context retrieved by a RAG system, if applicable.
latency
float
Latency in milliseconds from model invocation to response.
usage_info
dict[str, int]
Token usage information from the model call. Supported keys: input_tokens, output_tokens, cache_read_input_tokens.
cost_info
dict[str, float]
Cost breakdown for the model call. Supported keys: cost_per_input_token, cost_per_output_token, cost_per_cache_read_input_token.
conversation_simulator_version
string
Version of Galtea’s conversation simulator used to generate the input.