Create metric
Create a new metric. See Metrics.
Accepted values for source on new metrics are SELF_HOSTED, PARTIAL_PROMPT, and HUMAN_EVALUATION.
FULL_PROMPT is deprecated for creation — requests with source=FULL_PROMPT (or that omit
source and supply a judgePrompt without evaluationParams, which previously inferred FULL_PROMPT)
are rejected with 400. Use PARTIAL_PROMPT instead — it offers the same capabilities with
user-configurable evaluation parameters. Existing FULL_PROMPT metrics continue to be listed,
read, filtered, and evaluated.
Authorizations
API key authorization. Pass your API key in the Authorization header as a Bearer token. Both new (gsk_*) and legacy (gsk-) API keys are accepted, e.g. Authorization: Bearer gsk_... or Authorization: Bearer gsk-....
Body
"metric_123"
Id of the direct parent metric. On create, providing this value turns the new metric into a revision: it joins the parent's family and (if the parent is active) flips the parent to legacy. Omit or null to create a root metric in a fresh group. On responses, this is the recorded parent edge (null for roots).
"metric_122"
"org_123"
"user_123"
"Accuracy"
Ordered list of inference-result fields the evaluator needs (e.g. input, actualOutput, expectedOutput, retrievalContext). Determines which data the evaluation engine extracts from each inference result.
["input", "actualOutput", "expectedOutput"]
Evaluation method for the metric. FULL_PROMPT is deprecated for creation — POST /metrics rejects it with a 400. Use PARTIAL_PROMPT for new AI Evaluation metrics. The value remains in the enum because existing FULL_PROMPT metrics are still returned by reads and filters.
SELF_HOSTED, FULL_PROMPT, PARTIAL_PROMPT, HUMAN_EVALUATION, GEVAL, DEEPEVAL, DETERMINISTIC "PARTIAL_PROMPT"
"Evaluate the accuracy of the response"
["accuracy", "quality"]
"Measures the accuracy of responses"
"https://docs.example.com/metrics/accuracy"
"GPT-4"
When true, evaluationParams are injected at the top level of the evaluator prompt instead of nested inside the conversation context.
["spec_123"]
["ug_123"]
Response
Metric created successfully
"metric_123"
Identifier shared by every metric in the same revision family. Server-managed — derived from parentMetricId on create (or generated for roots). Cannot be set by the caller.
"metric_123"
Id of the direct parent metric. On create, providing this value turns the new metric into a revision: it joins the parent's family and (if the parent is active) flips the parent to legacy. Omit or null to create a root metric in a fresh group. On responses, this is the recorded parent edge (null for roots).
"metric_122"
"org_123"
"user_123"
"Accuracy"
Ordered list of inference-result fields the evaluator needs (e.g. input, actualOutput, expectedOutput, retrievalContext). Determines which data the evaluation engine extracts from each inference result.
["input", "actualOutput", "expectedOutput"]
Evaluation method for the metric. FULL_PROMPT is deprecated for creation — POST /metrics rejects it with a 400. Use PARTIAL_PROMPT for new AI Evaluation metrics. The value remains in the enum because existing FULL_PROMPT metrics are still returned by reads and filters.
SELF_HOSTED, FULL_PROMPT, PARTIAL_PROMPT, HUMAN_EVALUATION, GEVAL, DEEPEVAL, DETERMINISTIC "PARTIAL_PROMPT"
"Evaluate the accuracy of the response"
["accuracy", "quality"]
"Measures the accuracy of responses"
"https://docs.example.com/metrics/accuracy"
"GPT-4"
When true, evaluationParams are injected at the top level of the evaluator prompt instead of nested inside the conversation context.
Whether the metric is currently being optimized.
["spec_123"]
["ug_123"]