Evaluating Agents with Dynamic Instructions

Most eval setups assume your prompt is fixed. But many production agents aren’t. They receive different instructions depending on who’s using them, what session they’re in, or what task they’re running. A customer support agent might get a different persona and policy per tenant. A coding assistant might operate under different constraints per workspace. When the instructions change, the expected behavior changes too. A static eval can’t capture that. This guide explains how to use call metadata to link runtime context to evals in Tuner, so your evals are always grounded in what the agent was actually told to do.

The problem with static evals for dynamic agents

If your agent receives different instructions per session, such as different personas, rules, or restrictions, a static eval can’t keep up. An eval that asks “does the agent avoid discussing pricing?” is correct for one tenant and wrong for another. Your evals need to know what the agent was told, not just what it said. That’s what call metadata is for.

How it works

When you push a call to Tuner via the API, you can attach arbitrary context to it in the metadata field. This metadata is stored with the call and becomes available inside evals. In an eval, you:

Write a prompt definition, a natural language description of the behavior you’re evaluating
Select Metadata under Inputs Used, and enter the specific metadata key you want the evaluator to use
Tuner injects the value of that key into the evaluation context for each call

The evaluator then assesses each call against the runtime context that was actually in play, not a static hardcoded assumption.

Step-by-step guide

Step 1: Configure the eval

In Tuner, go to Agent Settings → Evaluation Rules and create or edit an eval. 1. Set the Eval Label and Type Give the eval a descriptive name (e.g. “Instruction Adherence”). Choose Pass/Fail for binary evaluation or Score 1–5 for a scaled rating. 2. Write the Prompt Definition This is the natural language criteria the LLM evaluator will use. Be specific about what correct behavior looks like in relation to the instructions:

“Evaluate whether the agent’s responses are consistent with the instructions it was given. If the instructions prohibit a topic, the agent should not engage with it. If they specify a tone or format, the agent should follow it.”

3. Select Inputs Used Under Inputs Used, open the dropdown and select Metadata. Then in the Enter metadata key field, type the key you want the evaluator to receive. For example: instructions and Click Add. This tells Tuner: for each call, pull metadata.instructions and include it as context when running this eval.

You can add multiple inputs to a single eval. For example, combine System Prompt and Metadata if you want the evaluator to see both the base prompt and the runtime instructions together.

Step 2: Push calls with metadata attached

When logging a call to Tuner, include the session instructions (or any relevant runtime context) in the metadata field of the request body.

{
  "call_id": "call-acme-001",
  "call_type": "web_call",
  "call_status": "call_ended",
  "transcript_with_tool_calls": [
    { "role": "agent", "text": "Hi, how can I help you today?", "start_ms": 0, "end_ms": 2000 },
    { "role": "user", "text": "What's your pricing?", "start_ms": 2500, "end_ms": 4000 },
    { "role": "agent", "text": "I'm not able to discuss pricing, but I can connect you with our sales team.", "start_ms": 4500, "end_ms": 7000 }
  ],
  "metadata": {
    "instructions": "You are a support agent for Acme Corp. Never discuss pricing.",
    "tenant": "acme-corp"
  }
}

The metadata object is flexible. Include whatever context your agent receives at runtime. The key is that the actual instructions are captured per-call, not per-deployment.

If your agent composes its instructions dynamically from multiple sources, log the final composed string, the complete instructions the model actually received. That’s what the evaluator needs to reason about.

Step 3: Results

Once configured, every call you push to Tuner is automatically evaluated against this eval. Results are broken down per call, so you can identify:

Which calls failed the check and why
Whether failures cluster around specific tenants or instruction sets

Complete example: multi-tenant support agent

Here’s how the full pattern looks end-to-end: At call time, your backend logs the call with the tenant’s runtime instructions in metadata:

tuner.create_call(
    workspace_id=WORKSPACE_ID,
    agent_remote_identifier=AGENT_ID,
    body={
        "call_id": f"call-{session_id}",
        "call_type": "web_call",
        "call_status": "call_ended",
        "transcript_with_tool_calls": transcript,
        "metadata": {
            "instructions": tenant.system_prompt,
            "tenant": tenant.id,
        }
    }
)

In Tuner, your eval is configured with:

Prompt definition: “Evaluate whether the agent follows the instructions it was given. Check that it respects any restrictions or formatting rules specified.”
Inputs Used: Metadata → key instructions

Every call is now evaluated against its own tenant’s instructions, not a shared static policy.

When to use this pattern

This pattern applies any time the correctness of a response depends on context that varies per call:

Session-scoped agents where context (user role, permissions, goals) changes between sessions
Multi-tenant agents where each customer has different rules, personas, or restrictions
Instruction-following evals where instructions are injected at runtime rather than hardcoded in a template

Next steps

See the API reference for the full call schema and metadata field constraints
Learn how to Create Custom Evals

​The problem with static evals for dynamic agents

​How it works

​Step-by-step guide

​Step 1: Configure the eval

​Step 2: Push calls with metadata attached

​Step 3: Results

​Complete example: multi-tenant support agent

​When to use this pattern

​Next steps

The problem with static evals for dynamic agents

How it works

Step-by-step guide

Step 1: Configure the eval

Step 2: Push calls with metadata attached

Step 3: Results

Complete example: multi-tenant support agent

When to use this pattern

Next steps