Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

Ahmed Y. Radwan; Azib Farooq; Shaina Raza; Sindhuja Chaduvula; Yani Ioannou

arxiv: 2601.03027 · v3 · submitted 2026-01-06 · 💻 cs.CL

Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

Sindhuja Chaduvula , Ahmed Y. Radwan , Azib Farooq , Yani Ioannou , Shaina Raza This is my paper

Pith reviewed 2026-05-16 17:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucination reductionfactualitypreference optimizationDPOLLM alignmentlarge language modelsTruthfulQAtruthful generation

0 comments

The pith

A label-flipping tweak to preference optimization cuts hallucinations in large language models by up to five times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard methods like DPO improve how models follow instructions but often reward fluent yet false answers, which increases hallucinations. F-DPO fixes this by taking existing preference pairs, adding binary factuality labels, flipping the order whenever the rejected response is more factual than the chosen one, and inserting a margin that gives extra weight to pairs with clear factuality gaps. When both responses match in factuality the method falls back to ordinary DPO. Experiments across seven models from 1B to 14B parameters show lower hallucination rates, higher factuality scores, and gains on out-of-distribution checks such as TruthfulQA. The change needs no extra reward model, token annotations, or staged training.

Core claim

F-DPO extends Direct Preference Optimization by applying a label-flipping transformation that ensures the chosen response is never less factual than the rejected response, together with a factuality-aware margin in the loss that reduces to standard DPO whenever the two responses share the same factuality level, producing models with substantially lower hallucination rates and higher factuality scores.

What carries the argument

The label-flipping transformation on binary factuality-labeled preference pairs plus a factuality-aware margin term added to the DPO objective.

If this is right

F-DPO reduces hallucination rates by a factor of five on Qwen3-8B while raising factuality scores by half.
It improves MC1 and MC2 accuracy on TruthfulQA for models up to 14B parameters without any out-of-distribution fine-tuning.
No auxiliary reward model or token-level supervision is required.
The method works across model sizes from 1B to 14B and falls back to standard DPO on equally factual pairs.
Only existing DPO pairs augmented with simple binary labels are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same label-correction idea could be inserted into other preference-based alignment losses to isolate factuality from fluency signals.
If factuality labeling can be automated at scale, the approach would support training on much larger preference sets with little added cost.
Hallucinations may arise more from misordered training signals than from limits in model capacity itself.
Applying the correction inside domain-specific datasets such as medical or legal question answering could test whether the factuality focus carries over to specialized tasks.

Load-bearing premise

Binary factuality labels must correctly identify which response is more factual so that the flips reorder pairs without adding systematic errors.

What would settle it

Training with deliberately noisy or inverted factuality labels and finding the same hallucination reductions would show that the claimed benefit does not depend on accurate labels.

read the original abstract

Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by 5x(from 0.424 to 0.084) while improving factuality scores by 50% (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves +17% MC1 accuracy (0.500 to 0.585) and +49% MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

F-DPO adds label flipping and a factuality margin to DPO for cleaner preference pairs, delivering the reported hallucination drops, but the whole thing rests on how reliable those binary labels actually are.

read the letter

The main thing to know is that this paper takes standard DPO and adds two targeted changes: flip the chosen/rejected labels whenever the factuality scores disagree so the chosen side is never less factual, and insert a margin term that only activates on pairs with different factuality labels. When labels match it collapses back to ordinary DPO. That combination is new enough to be worth noticing, and the abstract shows consistent gains across seven models from 1B to 14B plus solid OOD movement on TruthfulQA MC1 and MC2 for the larger Qwen variants. The 5x hallucination drop on Qwen3-8B and the 50% factuality score lift are the headline numbers, and they come without extra reward models or token-level supervision, which keeps the method practical. The construction step that adds synthetic hallucinations to existing DPO pairs is also straightforward to describe. Those pieces are what the work actually contributes and where it earns credit. The soft spot is exactly the one the stress test flags. Binary factuality labels drive both the flipping and the margin, yet the paper gives no error analysis, no inter-annotator numbers, and no ablation on label noise. If the labels come from an LLM judge or heuristic, correlated mistakes could systematically reorder pairs toward fluency rather than truth, and the gradient would then optimize the wrong signal. Without bounds on label accuracy or a check on how much the OOD gains survive label perturbation, it is hard to tell how much of the reported improvement is method versus data artifact. The math itself looks clean, but the empirical claim depends on an unexamined assumption. This is the sort of paper that would interest people already running DPO pipelines who want a lightweight factuality knob. A reader who cares about reliable alignment tweaks would get immediate value from trying the margin and flip on their own data. It is grounded enough in existing preference optimization and shows enough empirical movement to deserve referee time, even if the first round of reviews will probably ask for label-quality diagnostics and more ablations. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces F-DPO, an extension of Direct Preference Optimization that augments standard DPO pairs with binary factuality labels (0/1), applies a label-flipping transformation to ensure the chosen response is never less factual than the rejected response, and adds a factuality-aware margin term that is zero when labels match. The method reduces to vanilla DPO on pairs with identical factuality labels. Across seven open-weight models (1B–14B), F-DPO is reported to reduce hallucination rates (e.g., 5× on Qwen3-8B from 0.424 to 0.084) and raise factuality scores while also improving OOD accuracy on TruthfulQA MC1/MC2; the approach requires no auxiliary reward model or multi-stage training.

Significance. If the gains prove robust to label noise and data-construction choices, F-DPO supplies a minimal, single-stage modification to DPO that directly targets factuality without extra models or token-level supervision, offering a practical route to more reliable preference-aligned LLMs.

major comments (1)

[Data-construction paragraph (abstract and §3)] Data-construction paragraph (abstract and §3): the claim that label-flipping and the margin term preserve the original DPO objective rests on the unverified assumption that binary factuality labels are sufficiently accurate; no error-rate analysis, judge accuracy numbers, or ablation on synthetic hallucination generation is provided, leaving open the possibility that correlated label errors systematically reorder pairs in a fluency-biased direction and thereby inflate the reported 5× hallucination reduction.

minor comments (2)

[Results tables/figures] Results tables/figures: no error bars, run counts, or statistical significance tests are mentioned for the per-model improvements or the TruthfulQA OOD gains.
[Abstract] Abstract: the statement that F-DPO “requires no auxiliary reward model” is correct but should explicitly note the external source used to obtain the binary factuality labels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address the major comment point by point below, providing clarifications and committing to revisions where appropriate.

read point-by-point responses

Referee: Data-construction paragraph (abstract and §3): the claim that label-flipping and the margin term preserve the original DPO objective rests on the unverified assumption that binary factuality labels are sufficiently accurate; no error-rate analysis, judge accuracy numbers, or ablation on synthetic hallucination generation is provided, leaving open the possibility that correlated label errors systematically reorder pairs in a fluency-biased direction and thereby inflate the reported 5× hallucination reduction.

Authors: We agree that the accuracy of the binary factuality labels is central to the validity of F-DPO. The label-flipping transformation is intended to ensure that the chosen response is at least as factual as the rejected one, and the margin term modulates the loss based on label differences, reducing to standard DPO when labels match. However, we acknowledge that without explicit validation of label quality, there is a risk of systematic errors. In the revised version, we will add a new subsection in §3 detailing the factuality labeling process, including an error analysis on a representative sample of pairs where we compare automated labels to human judgments. We will also include an ablation study that varies the synthetic hallucination generation parameters and measures the impact on final performance to demonstrate robustness to potential label noise. These additions will directly address the concern about possible fluency bias in reordered pairs. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or results

full rationale

The paper defines F-DPO explicitly as an extension of standard DPO via binary factuality labels, a label-flipping transformation, and an optional margin term that reduces to vanilla DPO when labels match. All performance claims (hallucination rate reductions, factuality score gains, and TruthfulQA OOD accuracy) are measured on separate evaluation benchmarks and models after training, with no equations or steps that reduce the reported outcomes to fitted parameters or input data by construction. The data augmentation step (synthetic hallucinated variants) is an independent preprocessing choice whose effect is tested externally rather than assumed. No self-citations are load-bearing for the core method or results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that factuality can be reliably reduced to binary labels and that synthetic hallucinated variants preserve useful preference signals.

free parameters (1)

factuality-aware margin
A margin term that emphasizes pairs with clear correctness differences; its value is not specified and is likely tuned on validation data.

axioms (1)

domain assumption Binary factuality labels accurately reflect response correctness without significant noise or bias
Invoked when constructing preference data and applying the label-flipping transformation.

pith-pipeline@v0.9.0 · 5603 in / 1216 out tokens · 23303 ms · 2026-05-16T17:16:58.381227+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin... reducing to standard DPO when both responses share the same factuality.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We augment preference data with binary factuality labels... Δh=hl−hw... (hw,hl)∈{(0,0),(0,1),(1,1)}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.