Evaluate with Custom Metrics

Galtea allows you to define and use your own custom metrics for evaluations. This is particularly useful for:

Deterministic Metrics: When you have custom, rule-based logic to score outputs (e.g., checking for specific keywords, validating JSON structure).
Integrating External Models: When you use your own models for evaluation and want to store their scores in Galtea.

Recommended Approach: MetricInput Dictionary

The recommended way to specify metrics in SDK v3.0 is using the MetricInput dictionary format. For self-hosted metrics, you have two equally valid options for providing scores:

Option 1: Pre-Compute the Score

If you want to calculate the score yourself before creating the evaluation, you can provide the score directly as a float:

import os
from galtea import Galtea

galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

# --- Configuration ---
VERSION_ID = "your_version_id"
TEST_CASE_ID = "your_test_case_id"

# Your product's response
actual_output = "This response contains the correct keyword."

# 1. Define your custom scoring logic
def contains_keyword(text: str, keyword: str) -> float:
    """Returns 1.0 if the keyword is in text (case-insensitive), 0.0 otherwise."""
    return 1.0 if keyword.lower() in text.lower() else 0.0

# 2. Compute the score
custom_score = contains_keyword(actual_output, "correct")

# 3. Create the metric if it doesn't exist yet
CUSTOM_METRIC_NAME = "contains-correct"
galtea.metrics.create(
    name=CUSTOM_METRIC_NAME, 
    source='self_hosted', 
    description='Checks for the presence of the keyword "correct" in the output'
)

# 4. Run evaluation with your pre-computed score
galtea.evaluations.create_single_turn(
    version_id=VERSION_ID,
    test_case_id=TEST_CASE_ID,
    actual_output=actual_output,
    metrics=[
        {"name": "Role Adherence"},                           # Standard Galtea metric
        {"name": CUSTOM_METRIC_NAME, "score": custom_score}  # Self-hosted with pre-computed score
    ],
)

print("Evaluation with custom metric submitted.")

Option 2: Use CustomScoreEvaluationMetric Class

If you prefer to encapsulate your scoring logic in a class that will be executed dynamically, you can use the CustomScoreEvaluationMetric class within the MetricInput dictionary:

import os
from galtea import Galtea, CustomScoreEvaluationMetric

galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

# --- Configuration ---
VERSION_ID = "your_version_id"
TEST_CASE_ID = "your_test_case_id"

# 1. Define your custom metric class
class ContainsKeyword(CustomScoreEvaluationMetric):
    def __init__(self, keyword: str):
        self.keyword = keyword.lower()
        # Initialize with the metric name or ID
        super().__init__(name=f"contains-{self.keyword}")
        
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        """
        Returns 1.0 if the keyword is in actual_output (case-insensitive), 0.0 otherwise.
        All other args/kwargs are accepted but ignored.
        """
        if actual_output is None:
            return 0.0
        return 1.0 if self.keyword in actual_output.lower() else 0.0

# 2. Instantiate your metric
accuracy_metric = ContainsKeyword(keyword="correct")

# 3. Create the metric in the platform if it doesn't exist yet
galtea.metrics.create(
    name=accuracy_metric.name, 
    source='self_hosted', 
    description='Checks for the presence of the keyword "correct" in the output'
)

# Your product's response
actual_output = "This response contains the correct keyword."

# 4. Run evaluation with your custom metric class
# Important: Do NOT provide 'id' or 'name' in the dict when using CustomScoreEvaluationMetric
# The metric identifier comes from the CustomScoreEvaluationMetric instance itself
galtea.evaluations.create_single_turn(
    version_id=VERSION_ID,
    test_case_id=TEST_CASE_ID,
    actual_output=actual_output,
    metrics=[
        {"name": "Role Adherence"},  # Standard Galtea metric
        {"score": accuracy_metric}   # Self-hosted with dynamic scoring
    ],
)

print("Evaluation with custom metric submitted.")

When using CustomScoreEvaluationMetric as the score field in a MetricInput dictionary, do NOT provide id or name in the dictionary itself. The metric identifier must be specified when initializing the CustomScoreEvaluationMetric instance (e.g., CustomScoreEvaluationMetric(name="my-metric")).

Choosing Between Options

Both approaches are equally valid and current. Choose based on your preference:

Use Option 1 (Pre-Computed Score) if:
- You prefer a simpler, more declarative style
- Your scoring logic is straightforward and doesn’t require encapsulation
- You want to separate score calculation from the evaluation submission
Use Option 2 (CustomScoreEvaluationMetric Class) if:
- You prefer object-oriented design
- Your scoring logic is complex and benefits from encapsulation
- You want the SDK to handle score calculation automatically
- You need to reuse the same metric logic across multiple evaluations

Legacy Format (Not Recommended)

For backward compatibility, the SDK still supports passing CustomScoreEvaluationMetric or strings directly without the MetricInput dictionary wrapper:

from galtea import Galtea, CustomScoreEvaluationMetric

# This format is deprecated - use MetricInput dictionaries instead
class ContainsKeyword(CustomScoreEvaluationMetric):
    def __init__(self, keyword: str):
        self.keyword = keyword.lower()
        super().__init__(name=f"contains-{self.keyword}")
        
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        if actual_output is None:
            return 0.0
        return 1.0 if self.keyword in actual_output.lower() else 0.0

accuracy_metric = ContainsKeyword(keyword="correct")

# Legacy format - passing CustomScoreEvaluationMetric directly
galtea.evaluations.create_single_turn(
    version_id=VERSION_ID,
    test_case_id=TEST_CASE_ID,
    actual_output=actual_output,
    metrics=[
        "Role Adherence",     # Legacy: string format
        accuracy_metric       # Legacy: CustomScoreEvaluationMetric directly
    ],
)

This legacy format is maintained for backward compatibility only. New code should use the MetricInput dictionary format with either pre-computed scores or CustomScoreEvaluationMetric as the score field.

Getting Started

Tutorials

Integrations

Evaluate with Custom Metrics

Recommended Approach: MetricInput Dictionary

Option 1: Pre-Compute the Score

Option 2: Use CustomScoreEvaluationMetric Class

Choosing Between Options

Legacy Format (Not Recommended)

Getting Started

Tutorials

Integrations

​Recommended Approach: MetricInput Dictionary

​Option 1: Pre-Compute the Score

​Option 2: Use CustomScoreEvaluationMetric Class

​Choosing Between Options

​Legacy Format (Not Recommended)

Recommended Approach: MetricInput Dictionary

Option 1: Pre-Compute the Score

Option 2: Use CustomScoreEvaluationMetric Class

Choosing Between Options

Legacy Format (Not Recommended)