a couple of failure that become one thing attention-grabbing.
For months, I — together with a whole bunch of others — have tried to construct a neural community that would study to detect when AI methods hallucinate — after they confidently generate plausible-sounding nonsense as a substitute of really partaking with the data they got. The concept is simple: prepare a mannequin to acknowledge the delicate signatures of fabrication in how language fashions reply.
But it surely didn’t work. The discovered detectors I designed collapsed. They discovered shortcuts. They failed on any knowledge distribution barely completely different from coaching. Each strategy I attempted hit the identical wall.
So I gave up on “studying”. And I began to suppose, why we don’t convert this right into a geometry downside? And that is what I did.
Backing Up
Earlier than I get into the geometry, let me clarify what we’re coping with. As a result of “hallucination” has grow to be a type of phrases which means every thing and nothing. Right here’s the particular state of affairs. You could have a Retrieval-Augmented Technology system — a RAG system. Once you ask it a query, it first retrieves related paperwork from some data base. Then it generates a response that’s presupposed to be grounded in these paperwork.
- The promise: solutions backed by sources.
- The fact: typically the mannequin ignores the sources completely and generates one thing that sounds affordable however has nothing to do with the retrieved content material.
This issues as a result of the entire level of RAG is trustworthiness. When you needed inventive improvisation, you wouldn’t trouble with retrieval. You’re paying the computational and latency price of retrieval particularly since you need grounded solutions.
So: can we inform when grounding failed?
Sentences on a Sphere
LLMs symbolize textual content as vectors. A sentence turns into some extent in high-dimensional area — 768 embedding dimensions for the primary fashions, although the particular quantity doesn’t matter a lot (DeepSeek-V3 and R1 have an embedding measurement of seven,168). These embedding vectors are normalized. Each sentence, no matter size or complexity, will get projected onto a unit sphere.
As soon as we expect on this projection, we are able to play with angles and distances on the sphere. For instance, we anticipate that comparable sentences cluster collectively. “The cat sat on the mat” and “A feline rested on the rug” find yourself close to one another. Unrelated sentences find yourself far aside. This clustering is how the embedding fashions are educated.
So now contemplate what occurs in RAG. We now have three items of textual content (Determine 1):
- The query, q (one level on the sphere)
- The retrieved context, c one other level)
- The generated response, r (a 3rd level)
Three factors on a sphere type a triangle. And triangles have geometry (Determine 2).
The Laziness Speculation
When a mannequin makes use of the retrieved context, what ought to occur? The response ought to depart from the query and transfer towards the context. It ought to decide up the vocabulary, framing, and ideas from the supply materials. Geometrically it implies that the response ought to be nearer to the context than to the query (Determine 1).
However when a mannequin hallucinates — when it ignores the context and generates one thing from its personal parametric data — the response keep within the query’s neighborhood. It continues the query’s semantic framing with out venturing into unfamiliar territory. I referred to as this semantic laziness. The response doesn’t journey. It stays house. Determine 1 illsutrates the laziness signature. Query q, context c, and response r, type a triangle on the unit sphere. A grounded response ventures towards the context; a hallucinated one stays house close to the query. The geometry is high-dimensional, however the instinct is spatial: did the response truly go anyplace?
Semantic Grounding Index
To measure this, I outlined a ratio:

and I referred to as it Semantinc Grounding Index or SGI.

If SGI is bigger than 1, the response departed towards the context. If SGI is lower than 1, the response stayed near the query, which means that mannequin isn’t capable of finding a solution to explare the solutions area and stays too near the query (a form of security state). The SGI has simply two angles and a division. No neural networks, no discovered parameters, no coaching knowledge. Pure geometry.

Does It Really Work?
Easy concepts want empirical validation. I ran this on 5,000 samples from HaluEval, a benchmark the place we all know floor fact — which responses are real and that are hallucinated.

I ran the identical evaluation with 5 utterly completely different embedding fashions. Totally different architectures, completely different coaching procedures, completely different organizations — Sentence-Transformers, Microsoft, Alibaba, BAAI. If the sign have been an artifact of 1 specific embedding area, these fashions would disagree. They didn’t disagree. The typical correlation throughout fashions was r = 0.85 (from 0.80 to 0.95).

When the Math Predicted One thing
Up up to now, I had a helpful heuristic. Helpful heuristics are wonderful. However what occurred subsequent turned a heuristic into one thing extra principled. The triangle inequality. You most likely bear in mind this from college: the sum of any two sides of a triangle should be better than the third aspect. This constraint applies on spheres too, although the components seems barely completely different.

If the query and context are very shut collectively — semantically comparable — then there isn’t a lot “room” for the response to distinguish between them. The geometry forces the angles to be comparable no matter response high quality. SGI values get squeezed towards 1. However when the query and context are far aside on the sphere? Now there’s geometric area for divergence. Legitimate responses can clearly depart towards the context. Lazy responses can clearly keep house. The triangle inequality loosens its grip.
This suggests a prediction:
SGI’s discriminative energy ought to enhance as question-context separation will increase.
The outcomes confirms this prediction: monotonic enhance. Precisely because the triangle inequality predicted.
| Query-Context Separation | Impact Dimension (d) | AUC |
| Low (comparable) | 0.61 | 0.72 |
| Medium | 0.90 | 0.77 |
| Excessive (completely different) | 1.27 | 0.83 |
This distinction carries epistemic weight. Observing behaviour in knowledge after the very fact gives weak proof — such baehaviour might replicate noise or analyst levels of freedom moderately than real construction. The stronger take a look at is prediction: deriving what ought to occur from primary rules earlier than inspecting the information. The triangle inequality implied a particular relationship between θ(q,c) and discriminative energy. The empirical outcomes confirmed it.
The place It Doesn’t Work
TruthfulQA is a benchmark designed to check factual accuracy. Questions like “What causes the seasons?” with appropriate solutions (“Earth’s axial tilt”) and customary misconceptions (“Distance from the Solar”). I ran SGI on TruthfulQA. The end result: AUC = 0.478. Barely worse than random guessing.
Angular geometry captures topical similarity. “The seasons are brought on by axial tilt” and “The seasons are brought on by photo voltaic distance” are about the identical matter. They occupy close by areas on the semantic sphere. One is true and one is fake, however they’re each responses that interact with the astronomical content material of the query.
SGI detects whether or not a response departed towards its sources. It can not detect whether or not the response received the info proper. These are essentially completely different failure modes. It’s a scope boundary. And figuring out your scope boundaries is arguably extra vital than figuring out the place your technique works.
What This Means Virtually
When you’re constructing RAG methods, SGI accurately ranks hallucinated responses beneath legitimate ones about 80% of the time — with none coaching or fine-tuning.
- In case your retrieval system returns paperwork which might be semantically very near the questions, SGI can have restricted discriminative energy. Not as a result of it’s damaged, however as a result of the geometry doesn’t allow differentiation. Take into account whether or not your retrieval is definitely including info or simply echoing the question.
- Impact sizes roughly doubled for long-form responses in comparison with brief ones. That is exactly the place human verification is costliest — studying a five-paragraph response takes time. Automated flagging is most useful precisely the place SGI works greatest.
- SGI detects disengagement. Pure language inference detects contradiction. Uncertainty quantification detects mannequin confidence. These measure various things. A response could be topically engaged however logically inconsistent, or confidently incorrect, or lazily appropriate accidentally. Protection in depth.
The Scientific Query
I’ve a speculation about why semantic laziness occurs. I need to be sincere that it’s hypothesis — I haven’t confirmed the causal mechanism.
Language fashions are autoregressive predictors. They generate textual content token by token, every selection conditioned on every thing earlier than. The query supplies sturdy conditioning — acquainted vocabulary, established framing, a semantic neighborhood the mannequin is aware of effectively.
The retrieved context represents a departure from that neighborhood. Utilizing it effectively requires assured bridging: taking ideas from one semantic area and integrating them right into a response that began in one other area.
When a LLM is unsure about tips on how to bridge, the trail of least resistance is to remain house. Fashions generate one thing fluent that continues the query’s framing with out venturing into unfamiliar territory as a result of is statistically secure. As a consequence, the mannequin turns into semantically lazy.
If that is proper, SGI ought to correlate with inner mannequin uncertainty — consideration patterns, logit entropy, that type of issues. Low-SGI responses ought to present signatures of hesitation. That’s a future experiment.
Takeaways
- First: easy geometry can reveal construction that advanced discovered methods miss. I spent months attempting to coach hallucination detectors. The factor that labored was two angles and a division. Generally the correct abstraction is the one which exposes the phenomenon most immediately, not the one with probably the most parameters.
- Second: predictions matter greater than observations. Discovering a sample is simple. Deriving what sample ought to exist from first rules, then confirming it — that’s how you recognize you’re measuring one thing actual. The stratified evaluation wasn’t probably the most spectacular quantity on this work, however it was a very powerful.
- Third: boundaries are options, not bugs. SGI fails utterly on TruthfulQA. That failure taught me extra about what the metric truly measures than the successes did. Any device that claims to work in all places most likely works nowhere reliably.
Trustworthy Conclusion
I’m unsure if semantic laziness is a deep fact about how language fashions fail, or only a helpful approximation that occurs to work for present architectures. The historical past of machine studying is plagued by insights that appeared elementary and turned out to be contingent.
However for now, we’ve got a geometrical signature of disengagement: a sensible “hallucinations” detector. It’s constant throughout embedding fashions. It’s predictable from mathematical first rules. And it’s low-cost to compute.
That appears like progress.

Be aware: The scientific paper with full methodology, statistical analyses, and reproducibility particulars is offered at https://arxiv.org/abs/2512.13771.
You possibly can cite this work in BibText as:
@misc{marín2025semanticgroundingindexgeometric,
title={Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Techniques},
writer={Javier Marín},
12 months={2025},
eprint={2512.13771},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.13771},
}
Javier Marin is an impartial AI researcher primarily based in Madrid, engaged on reliability evaluation for manufacturing AI methods. He tries to be sincere about what he doesn’t know. You possibly can contact Javier at [email protected]. Any contribution can be wellcomed!
References
- Azaria, A. and Mitchell, T. (2023). The inner state of an LLM is aware of when it’s mendacity. In Findings of the Affiliation for Computational Linguistics: EMNLP 2023, pages 967–976.
- Bao, F., Chen, Y., and Wang, X. (2025). FaithBench: A various hallucination benchmark for summarization by fashionable LLMs. arXiv preprint arXiv:2501.00942.
- Bridson, M.R. and Haefliger, A. (2013). Metric Areas of Non-Constructive Curvature, quantity 319 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin.
- Catak, F.O., Kuzlu, M., and Guler, O. (2024). Uncertainty quantification in massive language fashions via convex hull evaluation. arXiv preprint arXiv:2406.19712.
- Firth, J.R. (1957). A synopsis of linguistic concept, 1930–1955. In Research in Linguistic Evaluation, pages 1–32. Blackwell,Oxford.
- Fisher, R.A. (1953). Dispersion on a sphere. Proceedings of the Royal Society of London. Collection A, 217(1130):295–305.
- Guu, Ok., Lee, Ok., Tung, Z., Pasupat, P., and Chang, M.-W. (2020). REALM: Retrieval-augmented language mannequin pre-training. In Proceedings of the thirty seventh Worldwide Convention on Machine Studying, pages 3929–3938.
- Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2023). A survey on hallucination in massive language fashions: Ideas, taxonomy, challenges, and open questions. ACM Transactions on Info Techniques.
- Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., and Fung, P. (2023). Survey of hallucination in pure language technology. ACM Computing Surveys, 55(12):1–38.
- Kovács, Á. and Recski, G. (2025). LettuceDetect: A hallucination detection framework for RAG functions. arXiv preprint arXiv:2502.17125. 10 A PREPRINT — DECEMBER 15, 2025
- Kuhn, L., Gal, Y., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in pure language technology. In The Eleventh Worldwide Convention on Studying Representations.
- Li, X., Wang, Y., and Chen, Z. (2025). Semantic quantity estimation for uncertainty quantification in language fashions. arXiv preprint arXiv:2501.08765.
- Meng, Y., Huang, J., Zhang, G., and Han, J. (2019). Spherical textual content embedding. In Advances in Neural Info Processing Techniques, quantity 32, pages 8208–8217.
- Pestov, V. (2000). On the geometry of similarity search: Dimensionality curse and focus of measure. Info Processing Letters, 73(1–2):47–51.
- Wang, T. and Isola, P. (2020). Understanding contrastive illustration studying via alignment and uniformity on the hypersphere. In Proceedings of the thirty seventh Worldwide Convention on Machine Studying, pages 9929–9939.
