Diagnose Your Agent - Tuner Documentation

Setup gets your agent observed. Diagnosis is the loop that keeps it improving. The Tuner MCP reads your workspace and writes back to it, so your assistant doesn’t just report what’s broken. It finds the root cause and fixes it: updating the prompt, adding the evals, and capturing the data so the problem can’t hide again.

Diagnosis requires two things: your agent must be set up and connected, and calls must be flowing through it. The more calls Tuner has received, the deeper and more precise the analysis.

Run a Diagnosis

Paste this into your chat to run a full diagnosis:

Diagnose my agent and tell me what is broken and how to fix it.
Fetch and follow the 'tuner_analyze_agent' prompt from the Tuner MCP server.

agent_id: 485
workspace_id: 1420
agent_name: agent007

Replace agent_id, workspace_id, and agent_name with your own values from the Tuner dashboard. Don’t know the IDs? Just ask first: “List the agents in workspace 1420” and the assistant will look them up before you run the diagnosis.

What the Diagnosis Looks At

Following the tuner_analyze_agent prompt, the assistant works through your data in order, the same path a human analyst would take:

Reads your agent's configuration

Pulls the system prompt, workflow, and description, plus everything you’ve defined: outcomes, intents, data-extraction fields, evals, and red flags. This is the ground truth for what “good” looks like; everything else is measured against it.

Builds the health picture

Reviews success rate, red-flag rate, call volume, and trends over the last 30 days, then breaks calls down across your outcome labels and intent categories, to tell whether the agent is broadly healthy with isolated issues or degraded across the board.

Finds the priority signals

Looks at recently red-flagged calls and which intents and outcomes they cluster on. A red flag concentrated on one intent with bad outcomes becomes the lead to chase.

Isolates the problem segments

Filters the call logs down to the underperforming intent + outcome combinations and checks whether the problem is chronic or started at a specific agent version. A version-correlated change points to a regression, not a long-standing gap.

Digs into individual calls

Opens the worst calls in detail: which evals are failing, whether STT, LLM, TTS, or end-to-end latency is elevated, and whether the same red flags fire consistently.

Cross-references against your prompt

Compares what the failing calls show against what your prompt and workflow are actually designed to do. The gap between the two is the root cause.

What You Get Back

The assistant returns a structured diagnosis you can act on immediately:

Agent health: a one-line summary with the headline number that supports it.
Issues found: each with the Signal (the data, with numbers), the Pattern across failing calls, the Root cause hypothesis and confidence, a specific Fix (what to change and where, not “improve your prompt”), and how to Verify it worked.
What’s working: so you know what not to touch.
Data gaps: configuration that, if added, would sharpen future diagnoses.

Example diagnosis output

## Agent health
Broadly healthy: 82% success across 1,240 calls in the last 30 days, but one intent is dragging the average down.

## Issues found

### Booking fails when the requested time is unavailable
**Signal:** The "Offer alternative times" eval fails on 71% of *Make a Reservation* calls that end in the *No Availability* outcome, a red-flag rate ~3x the agent average.
**Pattern:** When the first-choice slot is taken, the agent apologizes and ends the call instead of proposing nearby times.
**Root cause:** The system prompt describes booking a requested slot but has no branch for the unavailable case. (High confidence.)
**Fix:** Add an explicit step to the booking workflow: when a slot is unavailable, offer the two closest available times before closing.
**Verify:** Watch the "Offer alternative times" eval pass-rate and the share of *No Availability* outcomes over the next week.

## What's working
Reservation confirmations and party-size capture pass on 96%+ of calls. Leave them as-is.

## Data gaps
No data-extraction field for the requested time, so how far off the alternatives are can't be measured. Adding it would sharpen this analysis.

From Diagnosis to Fix

A diagnosis that you can’t act on is just a report. The power of doing this over the MCP is that the same assistant can apply the fix in the same conversation, because the MCP’s write tools cover everything in your Agent Settings. After it explains the root cause, just tell it to act. It can:

Update the prompt and workflow guidance to handle the scenario that’s failing.
Create or refine evals so the behavior is scored on every future call.
Add call outcomes and intents so the right calls get categorized instead of falling through.
Add data-extraction fields to capture the values you need to measure the problem.
Add red flags and alerts so you’re warned the moment it happens again.

That turns diagnosis into a loop: diagnose → fix → capture → re-diagnose to confirm. Each pass leaves the agent better instrumented than the last, so the next problem is easier to catch.

Examples

A Spike in Unresolved Calls on One Intent

Suppose your Billing Dispute intent is ending Unresolved far more often than the rest of your calls. Ask:

“Diagnose my agent: why are so many Billing Dispute calls ending unresolved, and how do I fix it?”

The assistant pulls the outcome breakdown and sees Billing Dispute sitting at a 48% Unresolved rate vs. 12% agent-wide. It opens the worst calls and finds the same pattern: the agent can’t move forward because it never collects the account number before trying to act on the dispute. It reports the signal, the pattern, the root cause, and a concrete fix. Then you close the loop in the same chat:

“Apply that. Add an account-verification step to the prompt, create an eval that checks the agent collected and verified the account number, and add a data field to capture it.”

The assistant updates the guidance and, through the MCP’s write tools, creates the eval (“Did the agent verify the account number before discussing the dispute?”) and the data-extraction field. Now every future Billing Dispute call is measured on exactly that behavior, so your next diagnosis can confirm the Unresolved rate is dropping instead of guessing.

Finding the Problems You Didn’t Know to Look For

You can only write evals for problems you already know about. The ones you don’t know about are usually where calls quietly fail. Diagnosis is good at surfacing them. Ask:

“Are there calls that don’t match any of my configured intents? Cluster them and tell me what they’re about.”

The assistant scans your call logs for calls that landed with no matching intent, reads the transcripts, and groups them into themes. For example, 62 calls in the last 30 days asking about international shipping, an intent you never defined. Because there’s no flow for it, those calls fall through to a generic response and abandon at a high rate. The assistant explains the theme, shows you example transcripts, and recommends:

Add an International Shipping intent so these calls are tracked.
Add a prompt/workflow branch that actually handles the request.
Add an eval to check the agent gives accurate shipping information.

And again, you can have it create the intent and eval right away, turning a blind spot into something measured and handled instead of a silent gap you’d never have thought to search for.

Keep Asking

The diagnosis is a starting point, not the end. The assistant still has your whole workspace at its fingertips, so you can drill into anything in plain language:

“Show me the transcripts of the 3 worst calls behind that issue.”

“What’s different about the calls that fail versus the ones that succeed on this intent?”

“Break down latency for those calls. Which component is slow: STT, the LLM, or TTS?”

“Did this start at a specific agent version, or has it always been like this?”

“Draft the exact prompt change you’re recommending, then create the eval to track it.”

“Which eval is failing the most this week, and on what kind of calls?”

Diagnosis needs real calls to analyze. If your agent hasn’t received any calls in the last 30 days, the assistant will ask you to send calls through it first. With fewer than 10 calls it will still run, but it flags that the findings may not yet be statistically reliable.

What’s Next?

Set Up Your Agent

Connect the Tuner MCP to your IDE or chatbot and configure your agent.

Manage Tuner with MCP

Examples, best practices, and use cases for managing your agent with AI assistants.

How to Use Red Flags

Understand the red flags the diagnosis prioritizes when triaging issues.

Creating Custom Evaluations

Learn how the evals you create from a diagnosis are defined and scored.

​Run a Diagnosis

​What the Diagnosis Looks At

​What You Get Back

​From Diagnosis to Fix

​Examples

​A Spike in Unresolved Calls on One Intent

​Finding the Problems You Didn’t Know to Look For

​Keep Asking

​What’s Next?

Set Up Your Agent

Manage Tuner with MCP

How to Use Red Flags

Creating Custom Evaluations

Run a Diagnosis

What the Diagnosis Looks At

What You Get Back

From Diagnosis to Fix

Examples

A Spike in Unresolved Calls on One Intent

Finding the Problems You Didn’t Know to Look For

Keep Asking

What’s Next?