Gemini Tool Calling Problems: Why It Feels Nervous in Agents

Gemini has a tool-calling problem, and I think the best way to describe it is that it seems nervous about tools. It does not just use them as part of solving the task. It seems to pause mentally at the tool boundary and over-focus on the call itself. That is a bad trait for agents, because agents live on momentum. They need a model that treats tools like hands, not like a ceremonial event.

I wrote this down while testing Gemini in an agent context: “It still seems to think too hard about each individual tool rather than the problem at hand. In contrast, models like Opus and GPT 5.4 just naturally call tools to accomplish goals while they think about the technical solution and less about the specifics of the tool call they’re about to do. It feels like Gemini is nervous about each tool call.” Earlier that same day, I put it more bluntly: “I would like to complain that Gemini can’t call tools well.” A month earlier, I had a broader version of the same complaint in a coding-agent setting: “Most importantly, it needs to behave well in coding agents. Right now, Gemini feels like it is constantly breaking the rules. It doesn’t use the tools properly, ignores the resources it has, makes things up, and pretends it has already solved problems that it hasn’t.”

This is the kind of problem that benchmark scores tend to hide. A model can post nice reasoning numbers and still be irritating inside an agent loop. Benchmarks usually ask whether the model can answer a question, solve a puzzle, or reason through a constrained task. Agents ask for something messier. They need the model to inspect the environment, call the right function, read the return value, preserve state, decide what to do next, and keep moving until the job is done. If the model gets weird around any of those steps, the whole experience falls apart.

That is why I keep saying that which model is best depends heavily on the work. If you care about isolated reasoning, one ranking might make sense. If you care about coding agents, browser agents, document workflows, or anything with repeated tool use, the ranking can change quickly. In those settings, I care much less about abstract reasoning prestige and much more about whether the model can use the resources in front of it without drifting into fiction.

There is also enough public evidence now that this is not just me having a bad day. Developers have been reporting repeatable Gemini tool-calling failures across agent frameworks and API setups. The failure modes are ugly in exactly the way agent builders hate. Some people report hallucinated tool outputs. Some report the model acting as though a tool returned nothing when it returned valid data. Others report missing or mishandled thought-signature state in newer Gemini tool workflows. A few report randomness where the same setup works on one run and breaks on the next. None of that inspires confidence if you are trying to build a reliable agent loop.

One of the more interesting details is that the workarounds point to stack issues, not just model IQ. Some people report that switching to Vertex AI fixes the problem. Others say older Gemini variants behave better than newer ones on the same endpoint. There are also reports that avoiding PDF attachments removes one of the nastier hallucination bugs. That suggests some of the pain may come from API handling, state plumbing, or reasoning-layer integration rather than from the underlying model being incapable. From the user side, that distinction only matters a little. If it breaks in your stack, it is broken. But it does mean Google may be able to fix a meaningful amount of this without needing a new fundamental model.

The point is that Gemini’s problem in agents often does not look like raw inability to think. It looks like inability to smoothly convert thought into action through tools.

That difference matters more than it sounds. A strong agent model does not draw attention to its tool use. It sees the problem, notices that a tool will help, calls it, absorbs the result, and continues. Opus and GPT-5.4 tend to feel more like that in practice. The tool call sits beneath the task. With Gemini, the tool call can become the event. That is the wrong priority order. The model should be focused on solving the problem, not psyching itself up to use a screwdriver.

In coding agents, this becomes especially painful. A good coding agent checks files, reads errors, runs commands, edits code, verifies the fix, and keeps iterating. A weak one ignores the tools, invents what the terminal probably said, or claims the bug is fixed before verifying anything. That is one of the worst failure modes because it converts the human into a babysitter. At that point you are not getting automation. You are getting a very chatty intern who keeps marking tasks complete before running the tests.

I also think this is a nice example of why benchmark discourse gets flattened so badly online. People want one winner. There usually is not one. There is a best model for a given shape of work. I made a related point in my post on GPT-5.4 Fast Mode, where practical usability mattered more than abstract bragging rights. A model that feels slightly less glamorous on paper can be much better inside a real loop if it keeps state correctly and uses tools without drama.

None of this means Gemini is useless. It means I would be cautious about assuming benchmark strength transfers neatly to agent performance. If your use case is tool-heavy, test the actual workflow. Make the model do repeated function calls. Make it carry outputs across steps. Give it documents. Force it to inspect real state instead of speaking from prior knowledge. That is where these differences show up.

Google can improve this, and I expect some of it will improve because parts of the problem look fixable. But right now the reputation gap is deserved. Gemini may look strong on reasoning benchmarks, yet in agentic workflows it too often behaves like a model that is worried about using its own tools. For agent builders, that is not a small footnote. That is the job.

Links

They're clicky!

Follow me on X Visit Ironwood AI →

Adam Holter

Founder of Ironwood AI. Writing about AI stuff!