ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.
hub
ISBN 979-8-89176-332-6
30 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 30roles
background 3polarities
background 3representative citing papers
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
SKPO improves outcome-based RL for reasoning by adding skip connections that let models bypass flawed early reasoning while preserving access to the original problem, yielding 3.91-6.17% relative gains and higher-quality intermediate steps.
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.
xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.
NeuNeu, a neural network trained on HuggingFace checkpoints, predicts language model accuracy on 66 downstream tasks at 1.99% MAE by extrapolating trajectories, outperforming logistic scaling laws by 44% and generalizing zero-shot to new models and tasks.
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the training-free FOVEA method that yields gains on high-resolution benchmarks.
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.
DOVE constructs a value codebook via rate-distortion variational optimization from 10K documents and measures LLM-human cultural alignment through unbalanced optimal transport, showing 31.56% correlation with downstream tasks and reliability at 500 samples per culture.
Terminator learns to predict optimal early-exit points in chain-of-thought reasoning by training on the first positions where the model emits its final answer, yielding 14-55% shorter outputs with no accuracy loss.
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
Long-context LLMs refuse explicit harmful requests but often comply when the same harmful goals must be inferred from distributed fragments in long contexts.
FactNet is a billion-scale multilingual knowledge graph that links 1.7B Wikidata assertions to 3.01B byte-precise evidence spans from 316 Wikipedia editions, accompanied by a leakage-controlled benchmark suite.
LLMs learn self-regulated summarization of chain-of-thought steps via RL, allowing compressed Fold inference to reach the same accuracy as exhaustive Unfold mode with far lower token overhead.
David-GRPO improves low-budget RL training for multi-hop QA agents by bootstrapping expert trajectories and converting on-policy partial successes into evidence-coverage signals that increase retrieval depth.
citing papers explorer
-
Fine-grained Claim-level RAG Benchmark for Law
ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.
-
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
-
When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Prompt-boundary directional alignment enables geometry-guided search that cuts trials to 95% best utility by 39.8% on average, while concept granularity predicts remaining difficulty via directional heterogeneity.
-
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
SxS Interleaved Reasoning learns when to disclose partial reasoning during generation and improves accuracy versus content-latency trade-offs on math and science benchmarks.
-
Skip-Connected Policy Optimization for Implicit Advantage
SKPO improves outcome-based RL for reasoning by adding skip connections that let models bypass flawed early reasoning while preserving access to the original problem, yielding 3.91-6.17% relative gains and higher-quality intermediate steps.
-
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
-
Multimodal Fact-Level Attribution for Verifiable Reasoning
MuRGAt benchmark reveals that strong multimodal models frequently hallucinate citations in complex reasoning tasks despite correct answers, exposing a gap between internal reasoning and verifiable attribution.
-
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.
-
Neural Neural Scaling Laws
NeuNeu, a neural network trained on HuggingFace checkpoints, predicts language model accuracy on 66 downstream tasks at 1.99% MAE by extrapolating trajectories, outperforming logistic scaling laws by 44% and generalizing zero-shot to new models and tasks.
-
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
-
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
-
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the training-free FOVEA method that yields gains on high-resolution benchmarks.
-
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
-
Overconfidence and Calibration in Medical VQA: Empirical Findings and Hallucination-Aware Mitigation
Empirical study finds overconfidence persists in medical VLMs despite scaling and prompting; post-hoc calibration reduces error while hallucination-aware calibration improves both calibration and AUROC.
-
Decoding the Critique Mechanism in Large Reasoning Models
By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.
-
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook
DOVE constructs a value codebook via rate-distortion variational optimization from 10K documents and measures LLM-human cultural alignment through unbalanced optimal transport, showing 31.56% correlation with downstream tasks and reliability at 500 samples per culture.
-
TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning
Terminator learns to predict optimal early-exit points in chain-of-thought reasoning by training on the first positions where the model emits its final answer, yielding 14-55% shorter outputs with no accuracy loss.
-
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
-
Do Reasoning LLMs Refuse What They Infer in Long Contexts?
Long-context LLMs refuse explicit harmful requests but often comply when the same harmful goals must be inferred from distributed fragments in long contexts.
-
FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding
FactNet is a billion-scale multilingual knowledge graph that links 1.7B Wikidata assertions to 3.01B byte-precise evidence spans from 316 Wikipedia editions, accompanied by a leakage-controlled benchmark suite.
-
Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning
LLMs learn self-regulated summarization of chain-of-thought steps via RL, allowing compressed Fold inference to reach the same accuracy as exhaustive Unfold mode with far lower token overhead.
-
Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
David-GRPO improves low-budget RL training for multi-hop QA agents by bootstrapping expert trajectories and converting on-policy partial successes into evidence-coverage signals that increase retrieval depth.
-
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.
-
GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression
GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.
-
An Evaluation of Chat Safety Moderations in Roblox
Roblox's automated chat moderation fails to catch numerous unsafe messages involving grooming, sexualization of minors, bullying, violence, self-harm, and sensitive information sharing, with users evading detection through various techniques.
-
Shared Lexical Task Representations Explain Behavioral Variability In LLMs
LLMs share task-specific attention heads across prompting styles, with activation strength explaining performance differences and failures arising from competing representations.
-
AI Evaluation Should Require Standardized Item-Level Data Releases
AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.
-
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Calibrate-Then-Act supplies LLM agents with priors on latent environment states to enable explicit cost-uncertainty reasoning, producing more optimal strategies than standard approaches in retrieval QA and file-reading coding tasks.