HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs
Abstract
A theoretical framework and detection method for identifying hallucinations in large language models by analyzing data-driven and reasoning-driven components through neural tangent kernel-based scoring.
The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.
Community
π HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs
Accepted at ICLR 2026
In this work, we introduce HalluGuard, a unified, theory-driven framework for hallucination detection in large language models, accepted at ICLR 2026.
Rather than treating hallucination as a single failure mode, HalluGuard explicitly decomposes hallucinations into data-driven and reasoning-driven componentsβand detects both at inference time, with no retraining, no labels, and no external references.
π Key Takeaways
π§ Two Sources of Hallucination
LLM hallucinations arise from two fundamentally different mechanisms:
Data-driven hallucinations
Errors rooted in biased, incomplete, or mismatched knowledge acquired during pretraining or finetuning.Reasoning-driven hallucinations
Errors caused by instability and error amplification during multi-step autoregressive decoding.
Most existing detectors focus on only one of these. HalluGuard shows that real hallucinations often emerge from their interaction and evolve across decoding steps.
π Hallucination Risk Bound (Theory)
We introduce a Hallucination Risk Bound, which formally decomposes total hallucination risk into:
- a representation bias term (training-time mismatch), and
- a decoding instability term (inference-time amplification).
The analysis reveals a key insight:
hallucinations originate from semantic approximation gaps and are then exponentially amplified during long-horizon generation.
This provides a principled explanation of how hallucinations emerge and evolve in LLMs.
π HalluGuard Score (Method)
Building on this theory, we propose HalluGuard, a lightweight NTK-based hallucination score:
Higher HalluGuard score β lower hallucination risk.
π Strong Empirical Results
We evaluate HalluGuard across:
- 10 benchmarks (QA, math reasoning, instruction following),
- 11 competitive baselines, and
- 9 LLM backbones (from GPT-2 to 70B-scale models).
Results:
- π Consistent state-of-the-art AUROC / AUPRC across all task families
- π Especially strong gains on multi-step reasoning benchmarks (MATH-500, BBH)
- π§© Robust detection of fine-grained semantic hallucinations (PAWS), even when surface forms are nearly identical
π§ Beyond Detection: Test-Time Guidance
HalluGuard can also be used to guide test-time inference, significantly improving reasoning accuracy by steering generation away from unstable trajectoriesβwithout modifying or retraining the model.
π Takeaway
HalluGuard (ICLR 2026) provides:
- a theoretical lens for understanding how hallucinations emerge and evolve, and
- a practical, plug-and-play detector for modern LLMs.
It bridges representation geometry and decoding dynamics, offering a unified foundation for reliable reasoning and uncertainty-aware inference.
Feedback and discussion are very welcome π
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering (2026)
- VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck (2026)
- Context-Aware Decoding for Faithful Vision-Language Generation (2026)
- KDCM: Reducing Hallucination in LLMs through Explicit Reasoning Structures (2026)
- HALT: Hallucination Assessment via Latent Testing (2026)
- AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents (2026)
- CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXivlens breakdown of this paper π https://arxivlens.com/PaperView/Details/halluguard-demystifying-data-driven-and-reasoning-driven-hallucinations-in-llms-688-50234d67
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper