Evaluating Conversations

To accurately evaluate interactions within a dialogue, you can use Galtea’s session-based workflow. This approach allows you to log an entire conversation and then run evaluations on all of its turns at once. Certain metrics are specifically designed for conversational analysis and require the full context:

Role Adherence: Measures how well the AI stays within its defined role.
Knowledge Retention: Assesses the model’s ability to remember and use information from previous turns.
Conversation Completeness: Evaluates whether the conversation has reached a natural and informative conclusion.
Conversation Relevancy: Assesses whether each turn in the conversation is relevant to the ongoing topic.

The Session-Based Workflow

Create a Session

A Session acts as a container for all the turns in a single conversation. You create one at the beginning of an interaction.

Log Inference Results

Each user input and model output pair is an Inference Result. You can log these turns individually or in a single batch call after the conversation ends. Using a batch call is more efficient.

Evaluate the Session

Once the session is logged, you can create evaluations for the entire conversation using the evaluations.create() method.

Define your agent function

First, define an agent function that connects Galtea to your product:

Simple
Chat History
Structured

The quickest way to get started. Your function receives just the latest user message as a string.

def my_agent(user_message: str) -> str:
    # In a real scenario, call your model here
    return f"Your model output to: {user_message}"

Use this when your agent needs the full conversation context. Your function receives the message list in the OpenAI format ({"role": "...", "content": "..."}).

def my_agent(messages: list[dict]) -> str:
    # messages follows the standard chat format:
    # [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ...]
    user_message = messages[-1]["content"]
    return f"Your model output to: {user_message}"

For full control over input and output — including optional usage tracking, cost tracking, and retrieval context for RAG evaluations.

def my_agent(input_data: AgentInput) -> AgentResponse:
    user_message = input_data.last_user_message_str()

    # Access structured input fields from the first user message's metadata
    # (e.g. when test case input is {"user_message": "hello", "chat_type": "support"})
    first_msg = input_data.messages[0] if input_data.messages else None
    chat_type = first_msg.metadata.get("chat_type") if first_msg and first_msg.metadata else None

    # In a real scenario, call your model here
    model_output = f"Your model output to: {user_message}"
    # Return AgentResponse with optional usage/cost tracking
    return AgentResponse(
        content=model_output,
        usage_info={"input_tokens": 100, "output_tokens": 50},
    )

All three signatures work with evaluations.run(), inference_results.generate(), and simulator.simulate(). Both sync and async functions are supported. The SDK auto-detects which signature you’re using from the type hint on the first parameter.For the full list of fields available on AgentInput (including structured input access via message metadata), see the AgentInput reference.

Determine your use case

Test-based evaluation
Past conversations (offline ingestion)
Monitoring (production)

Use this when you have test cases. It requires test_case_id and is often combined with the Conversation Simulator to generate turns.

# Fetch your test cases (created from a CSV of behavior tests)
test_cases = galtea_client.test_cases.list(test_id=behavior_test.id)
if test_cases is None or len(test_cases) == 0:
    raise ValueError("No test cases found")


# Define your agent function (connect your product/model)
def my_agent(user_message: str) -> str:
    return f"Response to: {user_message}"


for test_case in test_cases:
    # Create a session linked to the test case
    session = galtea_client.sessions.create(
        version_id=version_id,
        test_case_id=test_case.id,
    )

    # Run the simulator (synthetic user) with your agent function
    galtea_client.simulator.simulate(
        session_id=session.id,
        agent=my_agent,
        max_turns=test_case.max_iterations or 10,
    )

    # Evaluate the full conversation
    galtea_client.evaluations.create(
        session_id=session.id,
        metrics=[
            {"name": "Conversation Relevancy"},
            {"name": "Role Adherence"},
            {"name": "Knowledge Retention"},
        ],
    )

See the full workflow in Simulating Conversations.

Use this when the conversation already happened outside Galtea. Ingest the transcript by creating a session (no test_case_id) and logging all turns in a batch, then evaluate.

# Optional: map to your own conversation ID and mark as production if these are real users
session = galtea_client.sessions.create(
    version_id=version_id,
    custom_id="EXTERNAL_CONVERSATION_ID",
    is_production=True,
)

conversation_turns = [
    {"role": "user", "content": "What are some lower-risk investment strategies?"},
    {
        "role": "assistant",
        "content": "For lower-risk investments, consider diversified index funds, bonds, or Treasury securities.",
        "retrieval_context": "Low-risk investment options include index funds, government bonds, and Treasury securities.",
    },
    {"role": "user", "content": "With age, should the investment strategy change?"},
    {
        "role": "assistant",
        "content": "Yes, many advisors recommend shifting to more conservative investments as you approach retirement.",
        "retrieval_context": "Financial advisors typically recommend a more conservative asset allocation as investors near retirement age.",
    },
]

# Log all turns at once
galtea_client.inference_results.create_batch(
    session_id=session.id, conversation_turns=conversation_turns
)

# Evaluate the full session
galtea_client.evaluations.create(
    session_id=session.id,
    metrics=[
        {"name": "Role Adherence"},
        {"name": "Knowledge Retention"},
        {"name": "Conversation Relevancy"},
    ],
)

Use this for real-time logging of user interactions from your live product. Create the session with is_production=True and log turns as they happen (or batch at the end), then evaluate.

Log turns individually
Log turns in a batch

session = galtea_client.sessions.create(
    version_id=version_id,
    is_production=True,
)


def your_product(user_input: str) -> str:
    return f"This is a simulated response to '{user_input}'"


def handle_turn(user_input: str) -> str:
    model_output = your_product(user_input)
    galtea_client.inference_results.create(
        session_id=session.id, input=user_input, output=model_output
    )
    return model_output


# Simulate production interactions
handle_turn("Hello!")
handle_turn("What services do you offer?")

session_batch = galtea_client.sessions.create(
    version_id=version_id,
    is_production=True,
)

conversation_turns = [
    {"role": "user", "content": "What are some lower-risk investment strategies?"},
    {
        "role": "assistant",
        "content": "For lower-risk investments, consider diversified index funds, bonds, or Treasury securities.",
    },
]

galtea_client.inference_results.create_batch(
    session_id=session_batch.id, conversation_turns=conversation_turns
)

# Evaluate when the conversation is complete
galtea_client.evaluations.create(
    session_id=session.id,
    metrics=[{"name": "Conversation Relevancy"}, {"name": "Knowledge Retention"}],
)

See the dedicated guide: Monitor Production Responses.

Custom Metrics with Full Conversation Access

When you use CustomScoreEvaluationMetric, your measure() method always receives an inference_results parameter containing InferenceResult objects. For session evaluations this includes all turns; for single inference result evaluations it contains one item. This enables conversation-level scoring (e.g., consistency checks, cross-turn analysis).

from galtea import CustomScoreEvaluationMetric, InferenceResult


class ConversationConsistency(CustomScoreEvaluationMetric):
    """Scores how consistently the assistant responds across all turns."""

    def __init__(self):
        super().__init__(name=metric_name)

    def measure(
        self, *args, inference_results: list[InferenceResult] | None = None, **kwargs
    ) -> float:
        if not inference_results:
            return 0.0
        # Access the full conversation for cross-turn analysis
        assistant_outputs = [
            ir.actual_output for ir in inference_results if ir.actual_output
        ]
        if len(assistant_outputs) < 2:
            return 1.0
        # Your custom logic here (e.g., check for contradictions across turns)
        return 0.9


galtea_client.evaluations.create(
    session_id=session.id,
    metrics=[
        {"name": "Role Adherence"},
        {"score": ConversationConsistency()},  # Custom multi-turn metric
    ],
)

Each InferenceResult object provides actual_input, actual_output, retrieval_context, latency, index, and other fields. See the custom metrics tutorial for more on custom metrics.

Learn More

Session

A full conversation between a user and an AI system.

Inference Result

A single turn in a conversation between a user and the AI.

Evaluation

The assessment of an evaluation using a specific metric’s criteria

Conversation Simulator

Test your conversational AI with simulated multi-turn conversations

Documentation Index

​The Session-Based Workflow

​Define your agent function

​Determine your use case

​Custom Metrics with Full Conversation Access

​Learn More

Session

Inference Result

Evaluation

Conversation Simulator

The Session-Based Workflow

Define your agent function

Determine your use case

Custom Metrics with Full Conversation Access

Learn More