Why LLMs Sound Cautious or Overconfident: What Really Happens at Inference Time

In a recent evaluation benchmark for sales-style AI agents, I ran into a pattern that felt surprisingly consistent:

Some responses confidently said:“We can deploy this immediately.”
Others took a more cautious tone:“We should begin with discovery and scope.”

The benchmark penalized overcommitment and rewarded phased, cautious language. But one question kept coming up:

Why does the same model sometimes sound overconfident and other times cautious - even under similar conditions?

This turns out not to be random, and not just about prompt wording. It’s an inference-time effect, driven by how token probabilities, prompt conditioning, and decoding strategies interact under uncertainty.

1. The Core Mechanism: Token Probabilities

At its core, a language model generates text one token at a time.

For every next word, the model:

Assigns a score (called a logit) to many possible tokens
Converts those scores into probabilities
Selects one token based on those probabilities

So when generating a sentence like:

“Based on your requirements, …”

the model might internally assign probabilities like:

“we” → 0.25
“I” → 0.15
“it” → 0.10