Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning
Pith reviewed 2026-05-16 17:16 UTC · model grok-4.3
The pith
A label-flipping tweak to preference optimization cuts hallucinations in large language models by up to five times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
F-DPO extends Direct Preference Optimization by applying a label-flipping transformation that ensures the chosen response is never less factual than the rejected response, together with a factuality-aware margin in the loss that reduces to standard DPO whenever the two responses share the same factuality level, producing models with substantially lower hallucination rates and higher factuality scores.
What carries the argument
The label-flipping transformation on binary factuality-labeled preference pairs plus a factuality-aware margin term added to the DPO objective.
If this is right
- F-DPO reduces hallucination rates by a factor of five on Qwen3-8B while raising factuality scores by half.
- It improves MC1 and MC2 accuracy on TruthfulQA for models up to 14B parameters without any out-of-distribution fine-tuning.
- No auxiliary reward model or token-level supervision is required.
- The method works across model sizes from 1B to 14B and falls back to standard DPO on equally factual pairs.
- Only existing DPO pairs augmented with simple binary labels are needed.
Where Pith is reading between the lines
- The same label-correction idea could be inserted into other preference-based alignment losses to isolate factuality from fluency signals.
- If factuality labeling can be automated at scale, the approach would support training on much larger preference sets with little added cost.
- Hallucinations may arise more from misordered training signals than from limits in model capacity itself.
- Applying the correction inside domain-specific datasets such as medical or legal question answering could test whether the factuality focus carries over to specialized tasks.
Load-bearing premise
Binary factuality labels must correctly identify which response is more factual so that the flips reorder pairs without adding systematic errors.
What would settle it
Training with deliberately noisy or inverted factuality labels and finding the same hallucination reductions would show that the claimed benefit does not depend on accurate labels.
read the original abstract
Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by 5x(from 0.424 to 0.084) while improving factuality scores by 50% (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves +17% MC1 accuracy (0.500 to 0.585) and +49% MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces F-DPO, an extension of Direct Preference Optimization that augments standard DPO pairs with binary factuality labels (0/1), applies a label-flipping transformation to ensure the chosen response is never less factual than the rejected response, and adds a factuality-aware margin term that is zero when labels match. The method reduces to vanilla DPO on pairs with identical factuality labels. Across seven open-weight models (1B–14B), F-DPO is reported to reduce hallucination rates (e.g., 5× on Qwen3-8B from 0.424 to 0.084) and raise factuality scores while also improving OOD accuracy on TruthfulQA MC1/MC2; the approach requires no auxiliary reward model or multi-stage training.
Significance. If the gains prove robust to label noise and data-construction choices, F-DPO supplies a minimal, single-stage modification to DPO that directly targets factuality without extra models or token-level supervision, offering a practical route to more reliable preference-aligned LLMs.
major comments (1)
- [Data-construction paragraph (abstract and §3)] Data-construction paragraph (abstract and §3): the claim that label-flipping and the margin term preserve the original DPO objective rests on the unverified assumption that binary factuality labels are sufficiently accurate; no error-rate analysis, judge accuracy numbers, or ablation on synthetic hallucination generation is provided, leaving open the possibility that correlated label errors systematically reorder pairs in a fluency-biased direction and thereby inflate the reported 5× hallucination reduction.
minor comments (2)
- [Results tables/figures] Results tables/figures: no error bars, run counts, or statistical significance tests are mentioned for the per-model improvements or the TruthfulQA OOD gains.
- [Abstract] Abstract: the statement that F-DPO “requires no auxiliary reward model” is correct but should explicitly note the external source used to obtain the binary factuality labels.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. We address the major comment point by point below, providing clarifications and committing to revisions where appropriate.
read point-by-point responses
-
Referee: Data-construction paragraph (abstract and §3): the claim that label-flipping and the margin term preserve the original DPO objective rests on the unverified assumption that binary factuality labels are sufficiently accurate; no error-rate analysis, judge accuracy numbers, or ablation on synthetic hallucination generation is provided, leaving open the possibility that correlated label errors systematically reorder pairs in a fluency-biased direction and thereby inflate the reported 5× hallucination reduction.
Authors: We agree that the accuracy of the binary factuality labels is central to the validity of F-DPO. The label-flipping transformation is intended to ensure that the chosen response is at least as factual as the rejected one, and the margin term modulates the loss based on label differences, reducing to standard DPO when labels match. However, we acknowledge that without explicit validation of label quality, there is a risk of systematic errors. In the revised version, we will add a new subsection in §3 detailing the factuality labeling process, including an error analysis on a representative sample of pairs where we compare automated labels to human judgments. We will also include an ablation study that varies the synthetic hallucination generation parameters and measures the impact on final performance to demonstrate robustness to potential label noise. These additions will directly address the concern about possible fluency bias in reordered pairs. revision: yes
Circularity Check
No significant circularity detected in derivation or results
full rationale
The paper defines F-DPO explicitly as an extension of standard DPO via binary factuality labels, a label-flipping transformation, and an optional margin term that reduces to vanilla DPO when labels match. All performance claims (hallucination rate reductions, factuality score gains, and TruthfulQA OOD accuracy) are measured on separate evaluation benchmarks and models after training, with no equations or steps that reduce the reported outcomes to fitted parameters or input data by construction. The data augmentation step (synthetic hallucinated variants) is an independent preprocessing choice whose effect is tested externally rather than assumed. No self-citations are load-bearing for the core method or results.
Axiom & Free-Parameter Ledger
free parameters (1)
- factuality-aware margin
axioms (1)
- domain assumption Binary factuality labels accurately reflect response correctness without significant noise or bias
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin... reducing to standard DPO when both responses share the same factuality.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We augment preference data with binary factuality labels... Δh=hl−hw... (hw,hl)∈{(0,0),(0,1),(1,1)}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.