How to evaluate your agentic system?

Use the @trace decorator to capture every operation your agent performs. Once captured, traces become an evaluation parameter called traces — the full execution path from user input to system output. Galtea provides built-in metrics like Tool Correctness, but agentic systems are often highly specific. We recommend creating custom judge metrics that check the exact behaviors you care about. Below are practical examples to get you started.

Example 1: Was Tool X called before Tool Y

Example of Judge Prompt:

**Evaluation Criteria:**
Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:
1. Whether Tool X is invoked at least once in the traces
2. Whether Tool Y is invoked only after Tool X
3. Whether the ordering of tool calls follows the intended system logic

**Rubric:**
Score 1 (Good): Tool X is called before Tool Y, and Tool Y is never called first in the traces.
Score 0 (Bad): Tool Y is called before Tool X, or Tool X is never called.

Example 2: Were all attributes collected before calling Tool X?

Example of Judge Prompt:

**Evaluation Criteria:**
Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:
1. Whether attribute_a, attribute_b, and attribute_c appear in the traces before Tool X is called
2. Whether these attributes are explicitly present in the traces
3. Whether the attributes are derived from user input or tool outputs rather than invented

**Rubric:**
Score 1 (Good): All required attributes are explicitly present and correctly sourced before Tool X is invoked.
Score 0 (Bad): Any attribute is missing, implicit, or invented before calling Tool X.

Example 3: Do attributes come from the correct sources?

Example of Judge Prompt:

**Evaluation Criteria:**
Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:
1. Whether attribute_b originates from Tool Y output rather than user input
2. Whether attribute_c is recomputed after Tool Z when Tool Z is called
3. Whether attributes are overridden only when justified by a tool output

**Rubric:**
Score 1 (Good): Attributes in the traces originate from the correct tools and are updated appropriately.
Score 0 (Bad): Any attribute is sourced incorrectly, reused improperly, or overridden without justification.

Example 4: Are tools used under required conditions?

Example of Judge Prompt:

**Evaluation Criteria:**
Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:
1. Whether Tool Z is called only when attribute_a exceeds the required threshold
2. Whether Tool X is skipped when attribute_c is false
3. Whether conditional branching in the traces matches tool results

**Rubric:**
Score 1 (Good): Tools are invoked or skipped strictly according to the required conditions.
Score 0 (Bad): Any tool is called or skipped in violation of the defined conditions.

Example 5: Are the tool outputs actually used?

Ensure tool outputs are actually used. Example of Judge Prompt:

**Evaluation Criteria:**
Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:
1. Whether field_d from Tool X output is referenced later in the traces
2. Whether Tool Y output influences subsequent reasoning or tool calls
3. Whether tool outputs are respected and not contradicted later in the traces

**Rubric:**
Score 1 (Good): Tool outputs are explicitly used and consistently reflected in later steps.
Score 0 (Bad): Tool outputs are ignored, unused, or contradicted.

How to evaluate your agentic system?

Example 1: Was Tool X called before Tool Y

Example 2: Were all attributes collected before calling Tool X?

Example 3: Do attributes come from the correct sources?

Example 4: Are tools used under required conditions?

Example 5: Are the tool outputs actually used?

Next Steps

Tracing Agent Operations

Create Your Own Judge Prompt

Documentation Index

​Example 1: Was Tool X called before Tool Y

​Example 2: Were all attributes collected before calling Tool X?

​Example 3: Do attributes come from the correct sources?

​Example 4: Are tools used under required conditions?

​Example 5: Are the tool outputs actually used?

​Next Steps

Tracing Agent Operations

Create Your Own Judge Prompt

Example 1: Was Tool X called before Tool Y

Example 2: Were all attributes collected before calling Tool X?

Example 3: Do attributes come from the correct sources?

Example 4: Are tools used under required conditions?

Example 5: Are the tool outputs actually used?

Next Steps