hub

ISBN 979-8-89176-332-6

Kristian Woodsend, Mirella Lapata · 2025 · DOI 10.18653/v1/2025.emnlp-main · arXiv files/1228896

30 Pith papers cite this work. Polarity classification is still indexing.

30 Pith papers citing it

open at publisher browse 30 citing papers arXiv PDF

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Fine-grained Claim-level RAG Benchmark for Law

cs.CL · 2026-05-20 · unverdicted · novelty 7.0 · 3 refs

ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.

When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning

cs.CL · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.

Skip-Connected Policy Optimization for Implicit Advantage

cs.LG · 2026-04-09 · conditional · novelty 7.0

SKPO improves outcome-based RL for reasoning by adding skip connections that let models bypass flawed early reasoning while preserving access to the original problem, yielding 3.91-6.17% relative gains and higher-quality intermediate steps.

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.

Multimodal Fact-Level Attribution for Verifiable Reasoning

cs.CL · 2026-02-12 · unverdicted · novelty 7.0

MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.

Neural Neural Scaling Laws

cs.LG · 2026-01-27 · conditional · novelty 7.0

NeuNeu, a neural network trained on HuggingFace checkpoints, predicts language model accuracy on 66 downstream tasks at 1.99% MAE by extrapolating trajectories, outperforming logistic scaling laws by 44% and generalizing zero-shot to new models and tasks.

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

cs.CV · 2026-05-02 · unverdicted · novelty 6.0

VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the training-free FOVEA method that yields gains on high-resolution benchmarks.

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation

cs.CV · 2026-04-02 · conditional · novelty 6.0

Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.

Decoding the Critique Mechanism in Large Reasoning Models

cs.LG · 2026-03-17 · unverdicted · novelty 6.0

By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

cs.CL · 2026-03-16 · unverdicted · novelty 6.0

DOVE constructs a value codebook via rate-distortion variational optimization from 10K documents and measures LLM-human cultural alignment through unbalanced optimal transport, showing 31.56% correlation with downstream tasks and reliability at 500 samples per culture.

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

cs.LG · 2026-03-13 · unverdicted · novelty 6.0

Terminator learns to predict optimal early-exit points in chain-of-thought reasoning by training on the first positions where the model emits its final answer, yielding 14-55% shorter outputs with no accuracy loss.

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

cs.CL · 2026-02-19 · unverdicted · novelty 6.0

Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.

Do Reasoning LLMs Refuse What They Infer in Long Contexts?

cs.CL · 2026-02-09 · unverdicted · novelty 6.0

Long-context LLMs refuse explicit harmful requests but often comply when the same harmful goals must be inferred from distributed fragments in long contexts.

FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding

cs.CL · 2026-02-03 · unverdicted · novelty 6.0

FactNet is a billion-scale multilingual knowledge graph that links 1.7B Wikidata assertions to 3.01B byte-precise evidence spans from 316 Wikipedia editions, accompanied by a leakage-controlled benchmark suite.

Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

cs.AI · 2026-02-03 · unverdicted · novelty 6.0

LLMs learn self-regulated summarization of chain-of-thought steps via RL, allowing compressed Fold inference to reach the same accuracy as exhaustive Unfold mode with far lower token overhead.

Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

cs.CL · 2026-01-29 · unverdicted · novelty 6.0

David-GRPO improves low-budget RL training for multi-hop QA agents by bootstrapping expert trajectories and converting on-policy partial successes into evidence-coverage signals that increase retrieval depth.

citing papers explorer

Showing 30 of 30 citing papers.

Fine-grained Claim-level RAG Benchmark for Law cs.CL · 2026-05-20 · unverdicted · none · ref 27 · 3 links
ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science cs.AI · 2026-05-18 · unverdicted · none · ref 22
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search cs.LG · 2026-05-09 · unverdicted · none · ref 12 · 2 links
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning cs.CL · 2026-05-05 · unverdicted · none · ref 5 · 2 links
SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
Skip-Connected Policy Optimization for Implicit Advantage cs.LG · 2026-04-09 · conditional · none · ref 1
SKPO improves outcome-based RL for reasoning by adding skip connections that let models bypass flawed early reasoning while preserving access to the original problem, yielding 3.91-6.17% relative gains and higher-quality intermediate steps.
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior cs.LG · 2026-03-30 · unverdicted · none · ref 25
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
Multimodal Fact-Level Attribution for Verifiable Reasoning cs.CL · 2026-02-12 · unverdicted · none · ref 7
MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation cs.CL · 2026-02-02 · unverdicted · none · ref 1
xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.
Neural Neural Scaling Laws cs.LG · 2026-01-27 · conditional · none · ref 5
NeuNeu, a neural network trained on HuggingFace checkpoints, predicts language model accuracy on 66 downstream tasks at 1.99% MAE by extrapolating trajectories, outperforming logistic scaling laws by 44% and generalizing zero-shot to new models and tasks.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models cs.CL · 2026-05-17 · unverdicted · none · ref 4
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions cs.CL · 2026-05-11 · unverdicted · none · ref 6
LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design cs.CV · 2026-05-02 · unverdicted · none · ref 2
VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the training-free FOVEA method that yields gains on high-resolution benchmarks.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 19
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD · 2026-04-26 · unverdicted · none · ref 26
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models cs.CL · 2026-04-23 · unverdicted · none · ref 9
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation cs.CV · 2026-04-02 · conditional · none · ref 27
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
Decoding the Critique Mechanism in Large Reasoning Models cs.LG · 2026-03-17 · unverdicted · none · ref 6
By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook cs.CL · 2026-03-16 · unverdicted · none · ref 15
DOVE constructs a value codebook via rate-distortion variational optimization from 10K documents and measures LLM-human cultural alignment through unbalanced optimal transport, showing 31.56% correlation with downstream tasks and reliability at 500 samples per culture.
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning cs.LG · 2026-03-13 · unverdicted · none · ref 13
Terminator learns to predict optimal early-exit points in chain-of-thought reasoning by training on the first positions where the model emits its final answer, yielding 14-55% shorter outputs with no accuracy loss.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning cs.CL · 2026-02-19 · unverdicted · none · ref 3
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
Do Reasoning LLMs Refuse What They Infer in Long Contexts? cs.CL · 2026-02-09 · unverdicted · none · ref 5
Long-context LLMs refuse explicit harmful requests but often comply when the same harmful goals must be inferred from distributed fragments in long contexts.
FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding cs.CL · 2026-02-03 · unverdicted · none · ref 8
FactNet is a billion-scale multilingual knowledge graph that links 1.7B Wikidata assertions to 3.01B byte-precise evidence spans from 316 Wikipedia editions, accompanied by a leakage-controlled benchmark suite.
Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning cs.AI · 2026-02-03 · unverdicted · none · ref 2
LLMs learn self-regulated summarization of chain-of-thought steps via RL, allowing compressed Fold inference to reach the same accuracy as exhaustive Unfold mode with far lower token overhead.
Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents cs.CL · 2026-01-29 · unverdicted · none · ref 6
David-GRPO improves low-budget RL training for multi-hop QA agents by bootstrapping expert trajectories and converting on-policy partial successes into evidence-coverage signals that increase retrieval depth.
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency cs.CV · 2026-05-18 · unverdicted · none · ref 49
SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression cs.CL · 2026-05-09 · unverdicted · none · ref 19 · 2 links
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
An Evaluation of Chat Safety Moderations in Roblox cs.CY · 2026-05-06 · unverdicted · none · ref 59 · 2 links
Roblox's automated chat moderation fails to catch numerous unsafe messages involving grooming, sexualization of minors, bullying, violence, self-harm, and sensitive information sharing, with users evading detection through various techniques.
Shared Lexical Task Representations Explain Behavioral Variability In LLMs cs.CL · 2026-04-23 · unverdicted · none · ref 11
LLMs share task-specific attention heads across prompting styles, with activation strength explaining performance differences and failures arising from competing representations.
AI Evaluation Should Require Standardized Item-Level Data Releases cs.AI · 2026-02-27 · conditional · none · ref 14 · 2 links
AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents cs.CL · 2026-02-18 · unverdicted · none · ref 20
Calibrate-Then-Act supplies LLM agents with priors on latent environment states to enable explicit cost-uncertainty reasoning, producing more optimal strategies than standard approaches in retrieval QA and file-reading coding tasks.

ISBN 979-8-89176-332-6

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer