Skip to main content
Most eval setups assume your prompt is fixed. But many production agents aren’t — they receive different instructions depending on who’s using them, what session they’re in, or what task they’re running. A customer support agent might get a different persona and policy per tenant. A coding assistant might operate under different constraints per workspace. When the instructions change, the expected behavior changes too. A static eval can’t capture that. This guide explains how to use call metadata to link runtime context to behavior checks in Tuner — so your evals are always grounded in what the agent was actually told to do.

The problem with static evals for dynamic agents

If your agent receives different instructions per session — different personas, rules, or restrictions — a static behavior check can’t keep up. A check that asks “does the agent avoid discussing pricing?” is correct for one tenant and wrong for another. Your evals need to know what the agent was told, not just what it said. That’s what call metadata is for.

How it works

When you push a call to Tuner via the API, you can attach arbitrary context to it in the metadata field. This metadata is stored with the call and becomes available inside behavior checks. In a behavior check, you:
  1. Write a prompt definition — a natural language description of the behavior you’re evaluating
  2. Select Metadata under Inputs Used, and enter the specific metadata key you want the evaluator to use
  3. Tuner injects the value of that key into the evaluation context for each call
The evaluator then assesses each call against the runtime context that was actually in play — not a static hardcoded assumption.

Step-by-step guide

Step 1: Configure the behavior check

In Tuner, go to Agent Settings → Behavior Checks and create or edit a behavior check. 1. Set the Check Label and Type Give the check a descriptive name (e.g. “Instruction Adherence”). Choose Pass/Fail for binary evaluation or Score 1–5 for a scaled rating. 2. Write the Prompt Definition This is the natural language criteria the LLM evaluator will use. Be specific about what correct behavior looks like in relation to the instructions:
“Evaluate whether the agent’s responses are consistent with the instructions it was given. If the instructions prohibit a topic, the agent should not engage with it. If they specify a tone or format, the agent should follow it.”
3. Select Inputs Used Under Inputs Used, open the dropdown and select Metadata. Then in the Enter metadata key field, type the key you want the evaluator to receive. For example: instructions and Click Add. This tells Tuner: for each call, pull metadata.instructions and include it as context when running this check.
You can add multiple inputs to a single behavior check. For example, combine System Prompt and Metadata if you want the evaluator to see both the base prompt and the runtime instructions together.

Step 2: Push calls with metadata attached

When logging a call to Tuner, include the session instructions (or any relevant runtime context) in the metadata field of the request body.
{
  "call_id": "call-acme-001",
  "call_type": "web_call",
  "call_status": "call_ended",
  "transcript_with_tool_calls": [
    { "role": "agent", "text": "Hi, how can I help you today?", "start_ms": 0, "end_ms": 2000 },
    { "role": "user", "text": "What's your pricing?", "start_ms": 2500, "end_ms": 4000 },
    { "role": "agent", "text": "I'm not able to discuss pricing, but I can connect you with our sales team.", "start_ms": 4500, "end_ms": 7000 }
  ],
  "metadata": {
    "instructions": "You are a support agent for Acme Corp. Never discuss pricing.",
    "tenant": "acme-corp"
  }
}
The metadata object is flexible — include whatever context your agent receives at runtime. The key is that the actual instructions are captured per-call, not per-deployment.
If your agent composes its instructions dynamically from multiple sources, log the final composed string — the complete instructions the model actually received. That’s what the evaluator needs to reason about.

Step 3: Results

Once configured, every call you push to Tuner is automatically evaluated against this behavior check. Results are broken down per call, so you can identify:
  • Which calls failed the check and why
  • Whether failures cluster around specific tenants or instruction sets

Complete example: multi-tenant support agent

Here’s how the full pattern looks end-to-end: At call time, your backend logs the call with the tenant’s runtime instructions in metadata:
tuner.create_call(
    workspace_id=WORKSPACE_ID,
    agent_remote_identifier=AGENT_ID,
    body={
        "call_id": f"call-{session_id}",
        "call_type": "web_call",
        "call_status": "call_ended",
        "transcript_with_tool_calls": transcript,
        "metadata": {
            "instructions": tenant.system_prompt,
            "tenant": tenant.id,
        }
    }
)
In Tuner, your behavior check is configured with:
  • Prompt definition: “Evaluate whether the agent follows the instructions it was given. Check that it respects any restrictions or formatting rules specified.”
  • Inputs Used: Metadata → key instructions
Every call is now evaluated against its own tenant’s instructions — not a shared static policy.

When to use this pattern

This pattern applies any time the correctness of a response depends on context that varies per call:
  • Session-scoped agents where context (user role, permissions, goals) changes between sessions
  • Multi-tenant agents where each customer has different rules, personas, or restrictions
  • Instruction-following evals where instructions are injected at runtime rather than hardcoded in a template

Next steps