What LLMs Know But Don't Show
Research shows LLMs often contain correct answers internally but fail to express them, revealing a gap between knowledge and behavior.
Most explanations of AI hallucinations assume the same thing. The model doesn't know the answer, so it makes one up.
Recent research suggests this assumption is incomplete.
In work presented at ICLR 2025, researchers examined what large language models encode internally about the truthfulness of their own outputs. The findings point to a different kind of gap.
The Core Finding
The researchers found that internal signals within the model could often predict whether an answer would be correct - even in cases where the model consistently generated the wrong response.
In other words: the signal is present. The behavior is not aligned with it.
What This Suggests
This reframes part of the hallucination problem.
In many cases, the model is not missing the information. The correct answer exists within the system's internal state. It simply does not reach the output.
Why This Happens
Language models are optimized to generate likely continuations. Likelihood and truthfulness are related, but not identical.
When they diverge, generation tends to favor what is more likely - rather than what is more accurate.
This is not a failure of knowledge. It is a behavioral pattern at the point where the response is produced.
What This Implies
If relevant signals already exist internally, the question changes.
It is no longer only: how do we get better information into the model? It becomes: how do we ensure the model acts on the information it already has?
A Simple Conclusion
If models already know more than they show, the path forward is not only to teach them more. It is to ensure that what they already know shapes what they actually say.
We agree. So we did something about it.
This perspective is informed by ongoing work at XyloIQ on how AI behavior can be stabilized and governed as responses are formed.
##
Reference: Orgad, H., Toker, M., Gekhman, Z., Reichart, R., Szpektor, I., Kotek, H., Belinkov, Y. (2025). LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. ICLR 2025.
.png)

