Canonical reference

ISBN 979-8-89176-335-7

Association for Computational Linguistics · 2025 · DOI 10.18653/v1/2025

Canonical reference. 74% of citing Pith papers cite this work as background.

104 Pith papers citing it

Background 74% of classified citations

open at publisher browse 104 citing papers

citation-role summary

background 19 baseline 2 dataset 2

citation-polarity summary

background 17 baseline 2 use dataset 2 support 1 unclear 1

co-cited works

representative citing papers

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

As It Was: Aligning LLM Search Evaluation with Historical User Preferences

cs.IR · 2026-07-01 · unverdicted · novelty 7.0

Augmenting LLM search judges with historical QRI cards improves Spearman correlation with user preferences by ~5% overall (91% relative on disagreements) and 15% in multilingual settings, with better alignment to live A/B test outcomes.

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.

When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration

cs.MA · 2026-06-27 · conditional · novelty 7.0

KV-cache sharing boosts multi-agent QA performance but enables undetectable tampering; HMAC manifests binding agent, session, and payload reliably detect changes.

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

CAPER: Clause-Aligned Process Supervision for Text-to-SQL

cs.DB · 2026-06-02 · unverdicted · novelty 7.0

CAPER derives clause-aligned supervision via SQL AST counterfactuals to train a Clause-PRM that improves execution accuracy up to 15.3% relative and failure localization to 84.53% accuracy on BIRD and Spider.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

cs.CV · 2026-05-19 · conditional · novelty 7.0 · 2 refs

PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer that improves SRCC across categories.

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.

EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent

cs.NE · 2026-05-10 · unverdicted · novelty 7.0

EvoPref applies NSGA-II evolutionary optimization with archive-based diversity to populations of LoRA adapters, yielding 18% higher preference coverage and 47% lower collapse than gradient descent baselines while matching alignment quality.

Entropy-informed Decoding: Adaptive Information-Driven Branching

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited exploration in unconstrained systems.

Large Language Models Explore by Latent Distilling

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

ESamp trains a test-time distiller to model LLM depth-wise representation transitions and biases decoding toward high prediction-error paths to increase semantic diversity.

Efficient Personalization of Generative User Interfaces

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming baselines in both modeling and user preference.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.

Robust Reasoning Benchmark

cs.LG · 2026-03-26 · unverdicted · novelty 7.0 · 2 refs

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

cs.SE · 2026-03-14 · unverdicted · novelty 7.0

VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.

citing papers explorer

Showing 50 of 104 citing papers.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents cs.CR · 2026-01-26 · unverdicted · none · ref 3
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
As It Was: Aligning LLM Search Evaluation with Historical User Preferences cs.IR · 2026-07-01 · unverdicted · none · ref 8
Augmenting LLM search judges with historical QRI cards improves Spearman correlation with user preferences by ~5% overall (91% relative on disagreements) and 15% in multilingual settings, with better alignment to live A/B test outcomes.
Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction cs.CV · 2026-06-28 · unverdicted · none · ref 23
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
When Latent Agents Lie: KV-Cache Integrity in Multi-Agent LLM Collaboration cs.MA · 2026-06-27 · conditional · none · ref 35
KV-cache sharing boosts multi-agent QA performance but enables undetectable tampering; HMAC manifests binding agent, session, and payload reliably detect changes.
Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact cs.AI · 2026-06-18 · unverdicted · none · ref 54
Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.
ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues cs.CL · 2026-06-16 · unverdicted · none · ref 20
ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.
Can AI Agents Synthesize Scientific Conclusions? cs.AI · 2026-06-09 · unverdicted · none · ref 124
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
CAPER: Clause-Aligned Process Supervision for Text-to-SQL cs.DB · 2026-06-02 · unverdicted · none · ref 27
CAPER derives clause-aligned supervision via SQL AST counterfactuals to train a Clause-PRM that improves execution accuracy up to 15.3% relative and failure localization to 84.53% accuracy on BIRD and Spider.
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs cs.CL · 2026-06-01 · unverdicted · none · ref 12
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution cs.SE · 2026-05-28 · unverdicted · none · ref 59
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks cs.AI · 2026-05-27 · unverdicted · none · ref 26
TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation cs.CV · 2026-05-19 · conditional · none · ref 18 · 2 links
PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer that improves SRCC across categories.
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization cs.AI · 2026-05-12 · unverdicted · none · ref 25
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking cs.CL · 2026-05-11 · unverdicted · none · ref 12 · 2 links
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image cs.LG · 2026-05-11 · unverdicted · none · ref 65
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent cs.NE · 2026-05-10 · unverdicted · none · ref 34
EvoPref applies NSGA-II evolutionary optimization with archive-based diversity to populations of LoRA adapters, yielding 18% higher preference coverage and 47% lower collapse than gradient descent baselines while matching alignment quality.
Entropy-informed Decoding: Adaptive Information-Driven Branching cs.LG · 2026-05-10 · unverdicted · none · ref 13
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 9
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild cs.HC · 2026-05-07 · unverdicted · none · ref 117
Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited exploration in unconstrained systems.
Large Language Models Explore by Latent Distilling cs.CL · 2026-04-27 · unverdicted · none · ref 1
ESamp trains a test-time distiller to model LLM depth-wise representation transitions and biases decoding toward high prediction-error paths to increase semantic diversity.
Efficient Personalization of Generative User Interfaces cs.LG · 2026-04-10 · unverdicted · none · ref 22
A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming baselines in both modeling and user preference.
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction cs.CL · 2026-04-05 · unverdicted · none · ref 3
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
Robust Reasoning Benchmark cs.LG · 2026-03-26 · unverdicted · none · ref 62 · 2 links
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging cs.SE · 2026-03-14 · unverdicted · none · ref 23
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation cs.AI · 2026-02-24 · unverdicted · none · ref 99
CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.
Hybrid Pooling with LLMs via Relevance Context Learning cs.IR · 2026-02-09 · unverdicted · none · ref 19
Relevance Context Learning generates explicit relevance narratives from judged examples to guide LLM assessors, outperforming zero-shot and standard in-context learning for IR relevance judgments.
Mixture of Masters: Sparse Chess Language Models with Player Routing cs.LG · 2026-02-04 · unverdicted · none · ref 11
Mixture-of-Masters routes moves among small grandmaster-specific GPT experts via a gating network, outperforming dense chess LMs against Stockfish while adding style control and variety.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space cs.CV · 2025-12-14 · unverdicted · none · ref 41
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning cs.CV · 2026-07-01 · unverdicted · none · ref 35
StochasT uses stochastic clustering of language tasks into varying turn depths for the same image to improve LVLMs on both single-turn and multi-turn scenarios without discarding data.
HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents cs.AI · 2026-06-30 · unverdicted · none · ref 20
HealthAgentBench is a new benchmark of 54 healthcare agent tasks where even the strongest frontier AI agent reaches only about 42% success rate on end-to-end clinical workflows.
GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots cs.AI · 2026-06-29 · unverdicted · none · ref 41
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
ScAle: Attention Head Scaling as a Minimal Adapter for Spatial Reasoning in Vision Language Models cs.CV · 2026-06-28 · unverdicted · none · ref 23
ScAle learns scalar coefficients to modulate last-token attention and MLP activations in frozen VLMs, achieving up to 134.1% relative accuracy gains on spatial benchmarks with only 1K parameters.
Flowing With Purpose: Latent Action Guided Flow Matching Policies For Robotic Manipulation cs.RO · 2026-06-22 · unverdicted · none · ref 35
LAFM adapts the source distribution in flow matching policies via a latent action model to better match fragmented robotic action spaces, claiming 23.4% higher real-world success and 10.4% on LIBERO-90 while beating larger pre-trained models.
HERALD: High-Throughput Block Diffusion LLM Serving via CPU-GPU Cooperative KV Cache Retrieval cs.LG · 2026-06-19 · unverdicted · none · ref 5
HERALD enables near-lossless accuracy at 5-10% KV budget for block dLLMs by amortizing top-k selection across denoising steps and overlapping CPU-GPU retrieval, yielding up to 2.47x higher throughput than GPU-only inference.
Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification cs.CL · 2026-06-16 · unverdicted · none · ref 6
A local cascade framework for educational dialogue de-identification reaches 0.958 macro F1 on math tutoring transcripts, outperforming same-family LLM-only and commercial baselines while remaining fully on-device.
M\"OVE: A Holistic LLM Benchmark for the German Public Sector cs.CL · 2026-06-11 · unverdicted · none · ref 8
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows cs.AI · 2026-06-06 · unverdicted · none · ref 26
SKILL.nb uses selective formalization and gate-conditioned execution in auditable notebooks to improve durability of agent workflows, achieving 53.7% success on WebArena-Verified with 91.7% retention across re-executions.
AdaCodec: A Predictive Visual Code for Video MLLMs cs.CV · 2026-06-01 · unverdicted · none · ref 61
AdaCodec introduces a predictive visual code that cuts visual token use in video MLLMs by sending full frames only on high predictive cost and otherwise encoding inter-frame changes as P-tokens, yielding better benchmark scores at lower budgets.
Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches cs.AI · 2026-05-31 · unverdicted · none · ref 164 · 2 links
A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.
TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents cs.AI · 2026-05-31 · unverdicted · none · ref 13
TravelEval is a new benchmark with a six-dimensional evaluation framework, realistic data sandbox, and simulation-based global assessment for LLM-powered travel planning agents.
PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing cs.AI · 2026-05-28 · unverdicted · none · ref 2
PRAIB reveals LLM reviews are less variable, positively biased, overconfident, longer, and overlook atomic weaknesses noted by humans compared to real reviewer feedback.
GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human cs.CL · 2026-05-26 · unverdicted · none · ref 24
GrowLoop proposes a human-seeded self-evolving framework that co-evolves rubrics and cases to evaluate conversational human-likeness with differentiated agreement rules.
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors cs.CL · 2026-05-26 · unverdicted · none · ref 25
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
What Are We Actually Decoding? Source Attribution for Non-Invasive Brain-to-Language Retrieval cs.LG · 2026-05-23 · unverdicted · none · ref 13
An auditing framework for brain-to-audio retrieval isolates structural, stimulus-locked, and contextual performance sources via controls and a new Group Context Bias intervention, showing reduced performance under strict settings and measurable contextual gains.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 29
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents cs.AI · 2026-05-14 · unverdicted · none · ref 21 · 2 links
SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignment on 42 held-out storefronts.
Permit: Permission-Aware Representation Intervention for Controlled Generation in Large Language Models cs.CR · 2026-05-10 · unverdicted · none · ref 23
Permit identifies a permission-sensitive subspace in LLM hidden states and applies lightweight offset or gated interventions to enforce fine-grained generation control, outperforming prior methods with over 18% F1 gain and near-zero leakage using over 98% fewer parameters.
Can Revealed Preferences Clarify LLM Alignment and Steering? cs.LG · 2026-05-08 · unverdicted · none · ref 13
LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 5
HFRU is a two-stage reinforcement unlearning method operating on the vision encoder with GRPO optimization and an abstraction reward that achieves over 98% forgetting and retention on object and face tasks with negligible hallucination.
ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression cs.LG · 2026-05-08 · unverdicted · none · ref 53 · 2 links
ExpThink reduces average CoT response length by up to 77% while improving accuracy on math benchmarks via experience-guided reward shaping and difficulty-adaptive advantage in RL.

ISBN 979-8-89176-335-7

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer