archive

Every paper Pith has read. Search by title, abstract, or pith.

7661 papers in cs.CL · page 4

cs.CL 2026-05-21 reviewed

Fixing the main failure point can hurt LLM agents
Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines

Yoon Jeonghun +1
cs.CL 2026-05-21 reviewed

Medical RAG certifies claims with zero unsupported risk
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

Shao Kan
cs.AI 2026-05-21 reviewed

LLMs now build planners instead of one-off plans
Planning in the LLM Era: Building for Reliability and Efficiency

Michael Katz +3
cs.AI 2026-05-21 reviewed

7B model beats larger ones at Lean proof optimization
ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

Riyaz Ahuja +3
cs.CL 2026-05-21 reviewed

LLM attention weights tokens to improve DPO
Token-weighted Direct Preference Optimization with Attention

Chengyu Huang +3
cs.CL 2026-05-21 reviewed

Hyper-Align turns hypergraphs into LLM tokens
Hypergraph as Language

Mengqi Lei +6
cs.CL 2026-05-21 reviewed

Agent trajectories compiled into QA pairs improve long-context performance
ACC: Compiling Agent Trajectories for Long-Context Training

Qisheng Su +10
cs.LG 2026-05-21 reviewed

Dictionary realignment keeps OOD explanations faithful
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

Sungjun Lim +3
cs.CL 2026-05-21 reviewed

LLMs beat fine-tuned models on rare suicide circumstances
Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

Geoffrey Martin +2
cs.LG 2026-05-21 reviewed

Energy gating lifts transformer loss by 0.1 with tiny overhead
Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention

Athanasios Zeris
cs.CL 2026-05-20 reviewed

LLMs reduce ten intensity words to five numeric values
Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

Daniel Tabach (Georgia Institute of Technology)
cs.CL 2026-05-20 reviewed

Retrieval lifts LLM accuracy on rare medical cases from 56% to 82%
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

Doeun Lee +13
cs.LG 2026-05-20 reviewed

Geometry-aware calibration closes entropy gaps for LLM optimization
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

Zheyuan Zhang +5
cs.CV 2026-05-20 reviewed

Context rewrite lifts 3D grounding accuracy by up to 22 points
MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

Anna Deichler +6

3 Piths
cs.CL 2026-05-20 reviewed

DivSkill-SQL lifts Text-to-SQL accuracy by up to 11 points
Residual Skill Optimization for Text-to-SQL Ensembles

Jiongli Zhu +10
cs.CL 2026-05-20 reviewed

LLM optimizer diagnoses full-set errors to tune prompts
Reflective Prompt Tuning through Language Model Function-Calling

Farima Fatahi Bayat +3
cs.CL 2026-05-20 reviewed

Contrastive prompts with 'other' turn LLMs into probability estimators
PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

Juliette Woodrow +1
cs.CL 2026-05-20 reviewed

Single-flaw pairs create clear tests for multi-turn LLM judges
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

Zhenwei Tang +5
cs.CV 2026-05-20 reviewed

Lightweight cross-encoder matches LLM judges for caption evaluation
BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

Gon\c{c}alo Gomes +2
cs.CL 2026-05-20 reviewed

Bayes rule gives LLMs token-by-token attribution scores
Probabilistic Attribution For Large Language Models

Shilpika Shilpika +4
cs.CL 2026-05-20 reviewed

Semantic comparison catches AI peer reviews at low false positives
Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Andr\'e V. Duarte +5
cs.CL 2026-05-20 reviewed

Natural language queries reach safety data with schema validation
Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

Mahdi Azhdari +1
cs.LG 2026-05-20 reviewed

Projection matrix aligns tokenizers for better distillation
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Sharath Turuvekere Sreenivas +6
cs.CL 2026-05-20 reviewed

Open-source LLMs lean left on politics
How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Daniel C. Ruiz +4
cs.LG 2026-05-20 reviewed

Actor updates match value gradients under differentiable rollouts
Value-Gradient Hypothesis of RL for LLMs

Arip Asadulaev +3
cs.LG 2026-05-20 reviewed

Fine-tuned detectors amplify a pretrained typicality axis
Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

Alexander Smirnov
cs.LG 2026-05-20 reviewed

Entmax turns KV cache truncation into exact support recovery
EntmaxKV: Support-Aware Decoding for Entmax Attention

Gon\c{c}alo Duarte +2

4 Piths
cs.CV 2026-05-20 reviewed

New benchmark shows LVLMs falter on furniture assembly videos
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Aditya Chetan +7
cs.CL 2026-05-20 reviewed

Rewriting cuts unsafe LLM outputs for teen users
CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

Heajun An +3
cs.AI 2026-05-20 reviewed

Platform lets humans and AIs co-author and iterate on papers
AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

Junshu Pan +7
cs.LG 2026-05-20 reviewed

Rank-1 line from first 50 steps matches full RLVR at 15% cost
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Zhepei Wei +5
cs.LG 2026-05-20 reviewed

DelTA raises math scores by over 3 points on 8B models
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang +2
cs.CL 2026-05-20 reviewed

LLMs reach 100% consistency adapting grammars to metamodel changes
Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

Weixing Zhang +4
cs.CL 2026-05-20 reviewed

Separate model learns when to generate agent guidance
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

Xiaoqiang Wang +7
cs.CL 2026-05-20 reviewed

LLM measures track syncretism effects on agreement attraction
Quantifying the cross-linguistic effects of syncretism on agreement attraction

Utku Turk +1
cs.CL 2026-05-20 reviewed

Metaphors widen spectral breadth in transformer layers
Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

Lawhori Chakrabarti +5
cs.SE 2026-05-20 reviewed

Agents pass visible tests but fail held-out usage tests as tasks lengthen
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Bingchen Zhao +3
cs.CL 2026-05-20 reviewed

Traditional systems still lead in multilingual coreference task
Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

Michal Nov\'ak +8
cs.CL 2026-05-20 reviewed

AI shapes 11-26% of goals in human collaborations
"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

Eunsu Kim +3
cs.CL 2026-05-20 reviewed

Hybrid jailbreak method reaches 84% success with 30 queries
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

Abdullah Al Nomaan Nafi +3
cs.CL 2026-05-20 reviewed

LLMs degrade on numerical tasks beyond 500 social media posts
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

Yuefeng Shi +2
cs.AI 2026-05-20 reviewed

43M-paper graph gives AI agents deterministic cross-field links
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

Shuofei Qiao +10
cs.CL 2026-05-20 reviewed

Spike-gated model reaches 89% sparsity at 8.9 perplexity
SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

Ting Liu
cs.CL 2026-05-20 reviewed

Regularization curbs prompt overfitting for better LLM generalization
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Lucheng Fu +6
cs.CL 2026-05-20 reviewed

LLMs follow logical rules for conditionals but miss human implications
Tracing the ongoing emergence of human-like reasoning in Large Language Models

Paolo Morosi +4
cs.CL 2026-05-20 reviewed

Dual safeguards create reliable HIV triage domain in Spanish notes
Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification

Rodrigo Morales-S\'anchez +2
cs.CL 2026-05-20 reviewed

Pairwise rewards stabilize RL for reasoning models
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

Redacted by arXiv +6
cs.LG 2026-05-20 reviewed

10% heads on 10% data deliver 8.3 pp gain with 7x speedup in LLM alignment
From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

Hao Chen +9
cs.CL 2026-05-20 reviewed

Knowledge graphs lift LLM borrowing detection in Luxembourgish to 81%
Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

Nina Hosseini-Kivanani
cs.CL 2026-05-20 reviewed

Manga109 revised to correct 29,000 dialogue annotations
Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Jeonghun Baek +4