Canonical reference

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang · 2025 · DOI 10.18653/v1/2025

Canonical reference. 76% of citing Pith papers cite this work as background.

83 Pith papers citing it

Background 76% of classified citations

open at publisher browse 83 citing papers

citation-role summary

background 17 baseline 2 dataset 2

citation-polarity summary

background 16 baseline 2 use dataset 2 unclear 1

co-cited works

representative citing papers

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

cs.CL · 2026-06-10 · unverdicted · novelty 8.0

EDEN releases the largest freely available Italian clinical notes corpus (4M notes, 6k annotated) and proposes CRF-filling as a structured extraction benchmark with zero-shot baselines from Gemma models.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Apparent psychological profiles of LLMs are largely measurement artifacts driven by directional response bias rather than actual traits.

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.

Can AI Agents Synthesize Scientific Conclusions?

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.

CAPER: Clause-Aligned Process Supervision for Text-to-SQL

cs.DB · 2026-06-02 · unverdicted · novelty 7.0

CAPER derives clause-aligned supervision via SQL AST counterfactuals to train a Clause-PRM that improves execution accuracy up to 15.3% relative and failure localization to 84.53% accuracy on BIRD and Spider.

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

TASTE automates generation of high-coverage difficult agent benchmarks via adaptive contrastive n-gram sampling of tool sequences, yielding τ^c-Bench where models saturating τ²-Bench drop sharply and unique tool combinations more than double.

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

cs.CV · 2026-05-19 · conditional · novelty 7.0 · 2 refs

PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer that improves SRCC across categories.

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

cs.CL · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.

Entropy-informed Decoding: Adaptive Information-Driven Branching

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.

Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.

Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited exploration in unconstrained systems.

Large Language Models Explore by Latent Distilling

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

ESamp trains a test-time distiller to model LLM depth-wise representation transitions and biases decoding toward high prediction-error paths to increase semantic diversity.

Efficient Personalization of Generative User Interfaces

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

A dataset revealing high inter-designer disagreement on UI preferences motivates a sample-efficient method that personalizes generative interfaces by embedding new users in the space of prior designers, outperforming baselines in both modeling and user preference.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.

Robust Reasoning Benchmark

cs.LG · 2026-03-26 · unverdicted · novelty 7.0 · 2 refs

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

cs.SE · 2026-03-14 · unverdicted · novelty 7.0

VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

cs.AI · 2026-02-24 · unverdicted · novelty 7.0

CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.

Hybrid Pooling with LLMs via Relevance Context Learning

cs.IR · 2026-02-09 · unverdicted · novelty 7.0

Relevance Context Learning generates explicit relevance narratives from judged examples to guide LLM assessors, outperforming zero-shot and standard in-context learning for IR relevance judgments.

Mixture of Masters: Sparse Chess Language Models with Player Routing

cs.LG · 2026-02-04 · unverdicted · novelty 7.0

Mixture-of-Masters routes moves among small grandmaster-specific GPT experts via a gating network, outperforming dense chess LMs against Stockfish while adding style control and variety.

citing papers explorer

Showing 24 of 24 citing papers after filters.

EDEN: A Large-Scale Corpus of Clinical Notes for Italian cs.CL · 2026-06-10 · unverdicted · none · ref 31
EDEN releases the largest freely available Italian clinical notes corpus (4M notes, 6k annotated) and proposes CRF-filling as a structured extraction benchmark with zero-shot baselines from Gemma models.
ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues cs.CL · 2026-06-16 · unverdicted · none · ref 20
ReproRepo uses GitHub issues as natural supervision to benchmark LLM agents on detecting reproducibility blockers across 1,149 ML papers, with the top agent finding related issues for roughly 90% of cases.
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs cs.CL · 2026-06-01 · unverdicted · none · ref 12
CultureForest benchmark shows top LLMs degrade sharply on open-ended cultural reasoning tasks, exhibit regional disparities, and are limited more by effective use of knowledge than by lack of knowledge itself.
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking cs.CL · 2026-05-11 · unverdicted · none · ref 12 · 2 links
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
Not All Tokens Learn Alike: Attention Entropy Reveals Heterogeneous Signals in RL Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 9
Attention entropy splits RL training tokens into stable anchors and volatile explorers, and entropy-aware reweighting improves held-out reasoning performance.
Large Language Models Explore by Latent Distilling cs.CL · 2026-04-27 · unverdicted · none · ref 1
ESamp trains a test-time distiller to model LLM depth-wise representation transitions and biases decoding toward high prediction-error paths to increase semantic diversity.
MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction cs.CL · 2026-04-05 · unverdicted · none · ref 3
MedicalBench is a benchmark for implicit medical concept extraction and sentence-level evidence retrieval built from MIMIC-IV discharge summaries with human verification to test LLM reasoning on unstated medical ideas.
Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification cs.CL · 2026-06-16 · unverdicted · none · ref 6
A local cascade framework for educational dialogue de-identification reaches 0.958 macro F1 on math tutoring transcripts, outperforming same-family LLM-only and commercial baselines while remaining fully on-device.
M\"OVE: A Holistic LLM Benchmark for the German Public Sector cs.CL · 2026-06-11 · unverdicted · none · ref 8
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 24
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits cs.CL · 2026-05-07 · unverdicted · none · ref 27
Probabilistic circuits detect LLM hallucinations as residual-stream anomalies with up to 99% AUROC and enable dynamic correction that raises truthfulness scores while cutting unnecessary output corruption.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 25
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs cs.CL · 2026-04-21 · unverdicted · none · ref 1
Multilingual LLMs exhibit US-centric global bias and population-size intra-lingual bias on locale-ambiguous questions, with the global bias stronger after instruction tuning.
Learning from Natural Language Feedback for Personalized Question Answering cs.CL · 2025-08-14 · unverdicted · none · ref 16
VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.
Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations cs.CL · 2026-05-30 · unverdicted · none · ref 10
Empirical study claiming to be the first broad comparison of chunking methods in RAG, highlighting effectiveness, cost, and generalization limitations across scenarios.
User-Aware Active Knowledge Acquisition for Emotional Support Dialogue cs.CL · 2026-05-28 · unverdicted · none · ref 1
UKA is a gradient-free active dialogue learning framework using Theory-of-Mind uncertainty estimation to acquire user-aligned conversational knowledge, outperforming baselines in dialogue quality and user alignment across benchmarks.
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation cs.CL · 2026-04-28 · unverdicted · none · ref 23
CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
Quantum Knowledge Graph: Modeling Context-Dependent Triplet Validity cs.CL · 2026-04-27 · unverdicted · none · ref 6
Quantum Knowledge Graphs model context-dependent triplet validity and improve LLM medical reasoning accuracy by 1.4 to 6 percentage points over baselines.
Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation cs.CL · 2026-03-28 · unverdicted · none · ref 13
SDSR places human metadata at file primacy and combines it with prompt routing rules to reach 100% primary category accuracy on a 119-category benchmark, far above the 65% no-guidance baseline.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 180
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
What Am I Missing? Question-Answering as Hidden State Probing cs.CL · 2026-05-29 · unverdicted · none · ref 16
Question generation produces a hidden-state signal that predicts final correctness before the answer is produced, yet gating interventions based on that signal do not reliably improve trajectories.
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation cs.CL · 2025-04-02 · unverdicted · none · ref 141
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts cs.CL · 2026-06-24 · unverdicted · none · ref 4
HIPE-2026 is an evaluation campaign with 17 teams testing relation extraction for person presence at locations in 19th-20th century newspapers across French, German, and English plus a literary generalization set.
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook cs.CL · 2026-03-16 · unreviewed · ref 14

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

citation-role summary

citation-polarity summary

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer