Evaluation Parameters
To compute theknowledge_retention metric, the following parameters are required in every turn of the conversation:
input: The user message in the conversation.actual_output: The LLM-generated response to the user message.
How Is It Calculated?
Theknowledge_retention score is computed using an LLM-as-a-judge approach:
- Identify Knowledge Anchors: The LLM scans user inputs to identify specific facts, preferences, constraints, or context (e.g., names, locations, specific numbers).
- Verify Recall: The LLM checks if the agent recalled and applied this information in subsequent turns.
- Check Consistency: The LLM evaluates whether the agent contradicted previously established information, asked for information already provided, or ignored constraints set earlier.
- Score 1.0 (Good Retention): The agent correctly recalled relevant information or no specific memory recall was required (and no errors were made).
- Score 0.0 (Poor Retention): The agent forgot information, contradicted itself, or asked redundant questions about known facts.
Suggested Test Case Types
The Knowledge Retention metric is effective for evaluating Behavior test cases in Galtea, particularly:- Long multi-turn conversations where the user shares preferences, constraints, or facts early on.
- Personalized assistant scenarios where the agent must recall user-provided details.
- Complex workflows where information from one step is needed in a later step.