The problem with static evals for dynamic agents
If your agent receives different instructions per session — different personas, rules, or restrictions — a static behavior check can’t keep up. A check that asks “does the agent avoid discussing pricing?” is correct for one tenant and wrong for another. Your evals need to know what the agent was told, not just what it said. That’s what call metadata is for.How it works
When you push a call to Tuner via the API, you can attach arbitrary context to it in themetadata field. This metadata is stored with the call and becomes available inside behavior checks.
In a behavior check, you:
- Write a prompt definition — a natural language description of the behavior you’re evaluating
- Select Metadata under Inputs Used, and enter the specific metadata key you want the evaluator to use
- Tuner injects the value of that key into the evaluation context for each call
Step-by-step guide
Step 1: Configure the behavior check
In Tuner, go to Agent Settings → Behavior Checks and create or edit a behavior check. 1. Set the Check Label and Type Give the check a descriptive name (e.g. “Instruction Adherence”). Choose Pass/Fail for binary evaluation or Score 1–5 for a scaled rating. 2. Write the Prompt Definition This is the natural language criteria the LLM evaluator will use. Be specific about what correct behavior looks like in relation to the instructions:“Evaluate whether the agent’s responses are consistent with the instructions it was given. If the instructions prohibit a topic, the agent should not engage with it. If they specify a tone or format, the agent should follow it.”3. Select Inputs Used Under Inputs Used, open the dropdown and select Metadata. Then in the Enter metadata key field, type the key you want the evaluator to receive. For example: instructions and Click Add. This tells Tuner: for each call, pull
metadata.instructions and include it as context when running this check.
Step 2: Push calls with metadata attached
When logging a call to Tuner, include the session instructions (or any relevant runtime context) in themetadata field of the request body.
metadata object is flexible — include whatever context your agent receives at runtime. The key is that the actual instructions are captured per-call, not per-deployment.
If your agent composes its instructions dynamically from multiple sources, log the final composed string — the complete instructions the model actually received. That’s what the evaluator needs to reason about.
Step 3: Results
Once configured, every call you push to Tuner is automatically evaluated against this behavior check. Results are broken down per call, so you can identify:- Which calls failed the check and why
- Whether failures cluster around specific tenants or instruction sets
Complete example: multi-tenant support agent
Here’s how the full pattern looks end-to-end: At call time, your backend logs the call with the tenant’s runtime instructions inmetadata:
- Prompt definition: “Evaluate whether the agent follows the instructions it was given. Check that it respects any restrictions or formatting rules specified.”
- Inputs Used: Metadata → key
instructions
When to use this pattern
This pattern applies any time the correctness of a response depends on context that varies per call:- Session-scoped agents where context (user role, permissions, goals) changes between sessions
- Multi-tenant agents where each customer has different rules, personas, or restrictions
- Instruction-following evals where instructions are injected at runtime rather than hardcoded in a template
Next steps
- See the API reference for the full call schema and
metadatafield constraints - Learn how to Create Custom Evals