archive

Every paper Pith has read. Search by title, abstract, or pith.

7661 papers in cs.CL · page 1

cs.AI 2026-05-22 reviewed

Optimizer model improves agent skills only via validation-raising text edits
SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Yifan Yang +14
cs.CV 2026-05-22 reviewed

Dedicated image editor lifts multimodal reasoning by 5 points
ETCHR: Editing To Clarify and Harness Reasoning

Beichen Zhang +5
cs.CL 2026-05-22 reviewed

Word swaps in English data speed multilingual training 2x
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

Anastasiia Sedova +3
cs.LG 2026-05-22 reviewed

Weak teachers boost larger LLMs via loss mixing
Strong Teacher Not Needed? On Distillation in LLM Pretraining

Taiming Lu +1
cs.CV 2026-05-22 reviewed

LLM splits video queries into tool calls merged by boolean logic
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

Michal Shlapentokh-Rothman +3
cs.CL 2026-05-22 reviewed

Word co-occurrence creates hierarchical geometry in embeddings
Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

Andres Nava +1
cs.CL 2026-05-22 reviewed

NLG evaluation moves from rare to essential
NLG Evaluation: Past, Present, Future

Ehud Reiter
cs.CL 2026-05-22 reviewed

Sense-enhanced embeddings organize semantic types better in graphs
A graph-based analysis of semantic types and coercion in contextualized word embeddings

Long Chen +1
cs.CL 2026-05-22 reviewed

Metadata checks alone miss evidence dependence in benchmarks
Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

Kan Shao
cs.CL 2026-05-22 reviewed

Benchmark exposes weaknesses in MLLM chart descriptions
ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Fen Wang +7
cs.CL 2026-05-22 reviewed

Recursive memory predicts next queries with 22x fewer tokens
OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

Jiangwang Chen +6
cs.CL 2026-05-22 reviewed

Popular skills often fail to improve LLM agent performance
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

Jiahao Ying +4
cs.CL 2026-05-22 reviewed

Register, not size, picks the most human-like LLM
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

Bj\"orn Nieth +15
cs.CL 2026-05-22 reviewed

GE2 leads retrieval accuracy but trails in latency by 14x
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Stefano Cirillo +3
cs.LG 2026-05-22 reviewed

Latent space lets diffusion language models sample faster with better quality
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

Jean-Marie Lemercier +5
cs.CL 2026-05-22 reviewed

Two-phase curriculum reaches 99.02% accuracy on name matching
Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts

Shivam Chourasia +2
cs.CL 2026-05-22 reviewed

Date-filtered retrieval fixes LLM errors on changed laws
Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

Max Prior +2
cs.LG 2026-05-22 reviewed

Self-generated tests and code co-evolve to match RLVR results
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

Zhangyi Hu +8
cs.CL 2026-05-22 reviewed

Automated rubrics let RL scale to open-ended LLM tasks
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

Xiaoyuan Li +7
cs.CL 2026-05-22 reviewed

SSDAU cuts ambiguity F1 drop in joint extraction from 32% to 8%
SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

Jiawei He +7
cs.CL 2026-05-22 reviewed

Solution matching measures model alignment with social norms
Naturalistic measure of social norms alignment

Yevhen Kostiuk +4
cs.CL 2026-05-22 reviewed

Tongue shape in /i/ predicts diphthong formant timing
Articulatory strategy as a source of variation in acoustic vowel dynamics

Patrycja Strycharczuk +2
cs.CL 2026-05-22 reviewed

EquiSumm models gender to create fairer tweet summaries
EquiSumm : A Gender Bias-Aware Framework for Inclusive Tweet Summarization

Chaitanya Wanjari +4
cs.CL 2026-05-22 reviewed

Metacognitive rewards lift LLM reasoning up to 11 percent
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

Sirui Chen +8
cs.CL 2026-05-22 reviewed

RL framework decouples user preferences from task rewards
From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

Ranxu zhang +7
cs.CL 2026-05-22 reviewed

Cultural adaptation required before LLMs handle political discourse across cultures
Cultural Adaptation in Large Language Models for Political Discourse

Wajdi Zaghouani
cs.CL 2026-05-22 reviewed

Sign language ERC models reveal domain gap from generic approaches
Emotion Recognition in Sign Language Conversation

Yusong Wang +4
cs.CL 2026-05-22 reviewed

300K Facebook climate posts released as open dataset
ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

Wajdi Zaghouani +4
cs.CL 2026-05-22 reviewed

Hope speech makes up over 64 percent of Arabic Gaza comments
AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

Esra'a Sharqawi +1
cs.CL 2026-05-22 reviewed

Models converge on representations but diverge on reasoning
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

Muhammad Usama +1
cs.CL 2026-05-22 reviewed

Next-token prediction works only if text prefixes suffice for latent context
When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

Francesco Corielli
cs.LG 2026-05-22 reviewed

Multi-gate residuals stabilize deep nets without extra comms cost
Multi-Gate Residuals

Zhizhan Zheng +6
cs.LG 2026-05-22 reviewed

Kernel agents top out at 0.94x production baselines
FastKernels: Benchmarking GPU Kernel Generation in Production

Gabriele Oliaro +7
cs.HC 2026-05-22 reviewed

Multi-agent AI raises gardener confidence and trust scores
CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening

Yiyang Wang +5
cs.CL 2026-05-22 reviewed

Machine texts hide human-like spans that complicate detection
Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

Chenwang Wu +3
cs.CL 2026-05-22 reviewed

Optimizing prompt embeddings boosts in-context learning
Self-Improving In-Context Learning

Baturay Saglam +1
cs.CR 2026-05-22 reviewed

Key-selected synonyms watermark LLM text at 98% detection
Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection

Kieu Dang +4
cs.CL 2026-05-22 reviewed

LLMs drop up to 88 points when tasks move to context middle
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Chuyifei Zhang +3
cs.RO 2026-05-22 reviewed

VLM boosts robot map coverage by 24% in tests
Autonomous Frontier-Based Exploration with VLM Guidance

Aarush Aitha +1
cs.CL 2026-05-22 reviewed

Block-diffusion VLA reaches SOTA driving accuracy at 12x AR speed
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

Kewei Zhang +11
cs.CR 2026-05-22 reviewed

ActInv recovers inputs from LLM split-inference activations
What Does the Server See? Understanding Privacy Leakage from Large Language Models in Split Inference

Mingyuan Fan +3
cs.CL 2026-05-22 reviewed

Language flips which jailbreaks work on frontier MLLMs
Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

Casey Ford +3

3 Piths
cs.CL 2026-05-22 reviewed

LLMs miss psychiatric symptoms when functioning looks intact
When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

Jianfeng Zhu +3
cs.CL 2026-05-22 reviewed

Role prompts split into additive persona and task vectors at one site
As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

Eric Xu
cs.CL 2026-05-21 reviewed

BERT classifier labels 55k Ming-Qing letters from title lists
A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

Queenie Luo
cs.CL 2026-05-21 reviewed

BERTopic beats STM on coherence for short survey texts
A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

Yan Jiang +2
cs.LG 2026-05-21 reviewed

Global LP ranks every MoE expert to cut memory at low bits
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

Jianing Deng +6
cs.CL 2026-05-21 reviewed

Optimization cuts LLM token use 25% at F1 0.78
The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

Binqi Shen +4
cs.CL 2026-05-21 reviewed

Steering vectors modestly lift cultural reasoning in LLMs
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

Yusser Al Ghussin +5
cs.CL 2026-05-21 reviewed

Mixed curriculum trains memory agents with highest overall QA F1
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Xinjie He +6