The Agent Kept Lying to Me. Until I Opened Apidog's AI Agent Debugger.

Hands-on guide to Apidog's AI Agent Debugger: Turns, Traces, MCP transports (STDIO/HTTP/SSE), model comparison, and 5 common debugging patterns for production agents.

Ashley Innocent

Ashley Innocent

20 May 2026

The Agent Kept Lying to Me. Until I Opened Apidog's AI Agent Debugger.

A Tuesday afternoon. Twelve turns into a debug session, and the agent was confidently telling me our /users endpoint was responding in forty-seven seconds. The real number was forty-seven milliseconds.

I had been chasing this bug for two days. Every time I added a print statement to the MCP server, the agent’s answer shifted just enough to make me think I was getting somewhere. Every time I rewrote the system prompt, the response sounded more plausible. None of it was right.

What I had not done, until that afternoon, was open the actual execution trace and look at what was being passed between the model and the tool. That is what Apidog’s AI Agent Debugger is for. I had installed it three weeks earlier and forgotten about it. It took twelve minutes to find the bug.

This is what surprised me.

The bug I’d been chasing

The setup was simple. An agent built on GPT-5.5. One MCP server I’d written in a weekend, exposing a get_response_time(endpoint) tool that queried our metrics pipeline. A system prompt of maybe forty words. The user prompt: “How fast is the /users endpoint?”

The agent answered fast. It answered confidently. It answered wrong, every time, in different ways. Sometimes “the endpoint is responding in 47 seconds.” Sometimes “around 0.05 seconds.” Once, memorably, “performance is acceptable.”

I had been doing the things you do. Adding logging to the MCP server. Reading the model’s response token-by-token. Diffing system prompts. Cursing. I had three open terminal windows and a Notion page of failed hypotheses by Tuesday morning.

The thing about debugging agents is that the bug is rarely where you look first. It can live in the system prompt, in the model choice, in the tool definition, in the parameters the model passed to the tool, in the data the tool returned, or in how the model interpreted that data. Six places. A console log shows you one.

What the Traces panel actually shows

The Apidog debugger opens into three columns. Sessions on the left. Turns in the middle. Traces on the right. Click any session and the middle column shows you the dialogue: user message, model response, tool call, tool return, next model response. Click any turn and the right column expands into the full execution tree underneath it.

The execution tree is the part I had been missing. Every step, in order:

I opened the failing session. The tool call looked fine: get_response_time(endpoint="/users"). The model had picked the right tool with the right argument.

Then I expanded the tool result.

{"value": 47, "p95": 89, "samples": 1240}

There it was. The metrics pipeline returned the value in milliseconds. The model assumed seconds. 47 became “47 seconds” via a confident hallucination that did not bother questioning the unit. The tool was correct. The model was wrong. My system prompt had no instruction on units, and the tool response had no unit annotation.

Twelve minutes from opening the debugger. Two days I had been blaming the system prompt.

The fix took six lines

I changed two things. In the MCP server, I updated the response shape:

{
  "value": { "amount": 47, "unit": "ms" },
  "p95": { "amount": 89, "unit": "ms" },
  "samples": 1240
}

Then I added one sentence to the system prompt: “Tool results return units explicitly. Read them carefully.”

I ran the same /users prompt three more times. Three different sessions on the left panel. All three correctly returned “the endpoint is responding at around 47 ms” with a millisecond-to-percentile breakdown in the model’s reasoning. The token cost was eighteen percent lower than my failing runs, probably because the model wasn’t generating recovery prose around its own bad assumptions.

I ran the same prompt on Claude Opus 4.7 in a second session, side by side. Same result, twice the cost, slightly more verbose. I knew which model was going to production.

This is the part of the tool that earned my respect. Not the bug-finding, which any decent debugger should do. The model comparison, run on identical configurations with summary metrics in the left panel: turn count, step count, time, tokens, dollars. I had been doing that comparison in a Google Sheet for six months. Now it was three clicks.

What I had been getting wrong

The cheap take is that the AI Agent Debugger is a logging tool. It is not. Logging tools show you what happened. The debugger shows you what the model and the tool actually exchanged, which is a different layer.

If you write agents and you have been doing what I had been doing, which is reading model output and guessing at the cause of failures, here is what I would push back on. You are not debugging the agent. You are debugging your hypothesis about the agent. Those are different things, and only one of them gets you to a fix.

The thing I had refused to internalize for six months was that the agent is a closed system between the model, the prompt, the tools, and the tool responses. The bug always lives in one of those four. If you can see all four at the same time, you can find the bug in twelve minutes. If you cannot, you can chase it for a week.

The other thing the debugger surfaced, which I had not expected, was non-determinism in my own agent. I ran the same prompt five times after the fix, just to confirm. Three runs called get_response_time once. Two runs called it twice, the second time with the endpoint path in different case. My tool schema was case-sensitive. I had not noticed because my failing test cases all happened to use lowercase. That was a second bug I would have shipped without seeing.

Multi-run analysis is the feature I am going to use the most going forward. Click Run five times. Look at the sessions panel. Anything that varies across runs is a place your agent is fragile.

Try it yourself: a full setup walkthrough

If you want the same setup I had open during the bug hunt, here is the path from a fresh install to a running debug session. Five screens, in order.

Step 1: Create a new agent debug session

Open Apidog and click AI Agent Debugger in the top tab bar. The upper section of the page configures the model and run status.

The AI Agent Debugger tab with the model provider and model selectors at the top, Base URL auto-populated, and the Run button in the upper right.

Step 2: Configure the prompts

The Prompts tab has two input areas.

Click Run in the upper right when both are set. If you want the input box to clear automatically after each run, check Clear after Send.

Step 3: Configure the tools

The Tools tab lists everything the agent can call at runtime. The number on the tab is the current count of available or configured tools.

Built-in tools ship with the debugger. Toggle them on or off as needed.

Tool What it does
bash Execute commands in a persistent shell session
web_fetch Fetch web content and convert it to Markdown, text, or HTML
read Read text, image, or PDF files
edit Apply precise string replacements to files
write Create or overwrite files
grep Search file content with regular expressions
glob Find files using glob patterns
kill_shell Reset the current shell session

MCP tools add external systems or custom capabilities through MCP Servers. Three connection methods:

MCP Servers that require authentication accept request headers or OAuth 2.0 flows. Once the connection succeeds, pick which tools the server exposes to the agent.

Step 4: Configure skills, authentication, and model parameters

Three smaller tabs round out the setup.

Step 5: Read the three panels

After clicking Run, the session you just created appears in the left panel. Each session shows a one-line summary:

Session 3
1 turn · 1 step · 10s · 3.1k tokens · $0.02
gpt-5.5

When a tool call fails or the model returns an exception, the failing step is right there in the Traces panel with its inputs and outputs visible. No log diving.

Step 6: Compare model performance

Same prompt, same tool configuration, different model. Each run creates a new session, and the left panel lets you compare them side by side.

Useful metrics to compare:

The takeaway

Two days of debugging collapsed into an afternoon, and I did not learn the lesson on the bug. I learned it on the tooling. The reason I had been chasing the wrong fix was that the tools I was using did not show me what I needed to see. I had a model output and a tool output, and no shared frame to look at them together. The shared frame is the entire point.

If you have written more than one agent and you have not yet opened Apidog’s AI Agent Debugger, the next agent you ship will have a bug that lives between the model and the tool. You will spend a week on it. You will write a Notion page of failed hypotheses. The bug will be exactly where the debugger would have shown you on day one.

Download Apidog and open it on the next agent that gives you a wrong answer with a confident voice. Twelve minutes. Forty-seven milliseconds, not forty-seven seconds.

The full feature reference, including MCP transport setup and plan availability, lives in Apidog AI Agent Debugger: availability, coverage, and setup.

button

Explore more

How to Reduce Agent Token Costs From the CLI (2026 Guide)

How to Reduce Agent Token Costs From the CLI (2026 Guide)

Cut AI agent token costs from the CLI: context hygiene, prompt caching, model routing, trimming tool output, and measuring cost per run in Claude Code.

20 May 2026

Gemini 3.5 Flash Pricing: How Much Does It Actually Cost ?

Gemini 3.5 Flash Pricing: How Much Does It Actually Cost ?

Gemini 3.5 Flash pricing breakdown: ~$1.50 input / ~$9 output per 1M tokens, free tier (1500 req/day), 50% batch discount, real-world cost scenarios, and comparison to GPT-5.5 and Opus 4.7.

20 May 2026

How to Use Gemini 3.5 Flash for Free ?

How to Use Gemini 3.5 Flash for Free ?

Learn how to se Gemini 3.5 Flash free in 2026: Gemini app, AI Studio playground, free API key (1500 req/day), Vertex AI credits, and Gemini CLI. Working examples for each path.

20 May 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs