archive

Every paper Pith has read. Search by title, abstract, or pith.

7661 papers in cs.CL · page 8

cs.CL 2026-05-19 reviewed

Modular platform enables concurrent LLM evaluation
OpenCompass: A Universal Evaluation Platform for Large Language Models

Maosong Cao +29
cs.CL 2026-05-19 reviewed

English pivots cut causal grounding of explanations by up to 5.7x
Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

Somnath Banerjee +3
cs.CL 2026-05-19 reviewed

DECOR scores LLM responses on four manipulation dimensions for deception
DECOR: Auditing LLM Deception via Information Manipulation Theory

Linyue Cai +4
cs.CL 2026-05-19 reviewed

End-to-end models output formal text straight from Chinese speech
FormalASR: End-to-End Spoken Chinese to Formal Text

Wanyi Ning +5
cs.CL 2026-05-19 reviewed

Language access managers accept AI but require human oversight
AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

Miguel A. Jim\'enez-Crespo +2
cs.CL 2026-05-19 reviewed

Step-level scores flag reasoning errors in closed LLMs
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Xiaoou Liu +5
cs.CL 2026-05-19 reviewed

Fine-tuning on fMRI boosts ECoG language predictions
Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

Aditya R. Vaidya +2
cs.CL 2026-05-19 reviewed

LLM Uncertainty Scores Only Measure Output Consistency
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

Tiejin Chen +3
cs.CL 2026-05-18 reviewed

LLM judges spot agent failures less than half the time
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Leyao Wang +7
cs.CL 2026-05-18 reviewed

Recurrent router matches MoA accuracy with fewer active agents
MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

Rui Chu
cs.CL 2026-05-18 reviewed

English prompts improve LLM diagnostic accuracy over French
Prompting language influences diagnostic reasoning and accuracy of large language models

Adrien Bazoge +3
cs.CL 2026-05-18 reviewed

Agents launch unsafe actions after benign errors in 65% of trials
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

Rishi Jha +3
cs.LG 2026-05-18 reviewed

Local attack and support calls stabilize global argument rankings
GRASP: Deterministic argument ranking in interaction graphs

Diganta Misra +3
cs.LG 2026-05-18 reviewed

One model trained on text and time series matches both specialists
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Paul Quinlan +3
cs.LG 2026-05-18 reviewed

VLMs need tight data alignment and miss weak signals in egocentric video
EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

Dongyan Lin +21
cs.AI 2026-05-18 reviewed

Benchmark shows 15-31 point headroom for better AI delegation
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Yuxuan Gao +4
cs.LG 2026-05-18 reviewed

Graph separation shows public channels carry all indirect private influence
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

Alexander Boesgaard Lorup (Openhagen)
cs.CL 2026-05-18 reviewed

Bounded ReAct loop boosts zero-shot DST by 14 points
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

Yanjun Lin +9
cs.CL 2026-05-18 reviewed

ElevenLabs Scribe v2 leads on code-switched Arabic
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

Sajjad Abdoli +4
cs.CL 2026-05-18 reviewed

ElevenLabs Scribe leads on code-switched ASR with 13.2% WER
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

Sajjad Abdoli +4
cs.CL 2026-05-18 reviewed

ElevenLabs ASR leads on code-switched speech at 13 percent error
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

Sajjad Abdoli +4
cs.CL 2026-05-18 reviewed

Model scaling outpaces evaluation capacity in low-resource NLP
The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

Vukosi Marivate
cs.AI 2026-05-18 reviewed

Control layer above optimizer keeps LLM training stable under stress
Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

Anis Radianis
cs.CL 2026-05-18 reviewed

Adaptive block selection matches full attention at 75% sparsity
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Yuxiang Huang +7
cs.CL 2026-05-18 reviewed

Code harness turns LLMs into verifiable AI agents
Code as Agent Harness

Xuying Ning +41
cs.CV 2026-05-18 reviewed

Active exploration outperforms passive in spatial intelligence tasks
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Yining Hong +7
cs.CV 2026-05-18 reviewed

Self-distillation from crops boosts MLLM detail recognition
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Qianhao Yuan +6
cs.CL 2026-05-18 reviewed

LLM fact recall improves with model size and topic frequency in data
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

Matthew L. Smith +4
cs.LG 2026-05-18 reviewed

Multi-dimensional preferences resist reward hacking in LLM training
General Preference Reinforcement Learning

Muhammad Umer +7
cs.LG 2026-05-18 reviewed

Multi-dimensional preferences stop reward hacking in LLM reinforcement learning
General Preference Reinforcement Learning

Muhammad Umer +7
cs.LG 2026-05-18 reviewed

Multi-dimensional preferences prevent reward hacking in LLM alignment
General Preference Reinforcement Learning

Muhammad Umer +7
cs.CL 2026-05-18 reviewed

EnvFactory uses 85 environments for 15% tool-use gains
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Minrui Xu +14
cs.LG 2026-05-18 reviewed

FL nearly matches centralized results for depression detection
FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data

Nuredin Ali Abdelkadir +3
cs.CY 2026-05-18 reviewed

Generative AI ads intervene in model generation rather than visible placements
Generative AI Advertising as a Problem of Trustworthy Commercial Intervention

Jingyi Qiu +1
cs.AI 2026-05-18 reviewed

Config choices rival model selection on GIM benchmark
GIM: Evaluating models via tasks that integrate multiple cognitive domains

Rohit Patel +2
cs.LG 2026-05-18 reviewed

Human soft labels improve calibration and training stability
An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

Maja Pavlovic +2
cs.CL 2026-05-18 reviewed

Backdoor circuit routes trigger to switch model language output
Language-Switching Triggers Take a Latent Detour Through Language Models

Francis Kulumba +4
cs.LG 2026-05-18 reviewed

Trained MoE models skip over half their experts after adaptation
Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv +14
cs.CL 2026-05-18 reviewed

Token statistics on expert solutions forecast LLM performance
Forecasting Downstream Performance of LLMs With Proxy Metrics

Arkil Patel +3
cs.LG 2026-05-18 reviewed

Memory of past evaluations improves rubric updates for RL
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Peilin Wu +6
cs.SE 2026-05-18 reviewed

Stripping consent declarations raises overeager rate in coding agents
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

Yubin Qu +6
cs.CL 2026-05-18 reviewed

Meta-cognitive configurator lifts agent persuasion success rates
MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

Dingyi Zhang +4
cs.CL 2026-05-18 reviewed

Embeddings and clustering unify inconsistent IS constructs
GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

Maximilian Reinhardt +2
cs.CL 2026-05-18 reviewed

Memory systems score 27.9% under fact interference in long contexts
MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Hyunji Lee +5
cs.CL 2026-05-18 reviewed

Readers regress to likely error sites in garden-path sentences
Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences

Thomas Hikaru Clark +2
cs.CL 2026-05-18 reviewed

Probe trajectories predict model future better than static checks
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Maciej Chrab\k{a}szcz +4
cs.CL 2026-05-18 reviewed

Frontier LLMs score under 40% on dynamic tool-use benchmark
STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Tingfeng Hui +7
cs.CL 2026-05-18 reviewed

Continuous diffusion scales to 20x compute gap of autoregressive models
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

Zhihan Yang +7
cs.CL 2026-05-18 reviewed

Judging ICL demonstration success yields 23x speedup and higher accuracy
Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

Haochun Wang +7
cs.CL 2026-05-18 reviewed

Fine-tuning lifts Ancient-to-Modern Greek translation by 10 BLEU points
Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

Spyridon Mavromatis +3

2 Piths