archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 1
-
Optimizer model improves agent skills only via validation-raising text edits
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
-
Dedicated image editor lifts multimodal reasoning by 5 points
ETCHR: Editing To Clarify and Harness Reasoning
-
Word swaps in English data speed multilingual training 2x
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
-
Weak teachers boost larger LLMs via loss mixing
Strong Teacher Not Needed? On Distillation in LLM Pretraining
-
LLM splits video queries into tool calls merged by boolean logic
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
-
Word co-occurrence creates hierarchical geometry in embeddings
Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
-
NLG evaluation moves from rare to essential
NLG Evaluation: Past, Present, Future
-
Sense-enhanced embeddings organize semantic types better in graphs
A graph-based analysis of semantic types and coercion in contextualized word embeddings
-
Metadata checks alone miss evidence dependence in benchmarks
Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks
-
Benchmark exposes weaknesses in MLLM chart descriptions
ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models
-
Recursive memory predicts next queries with 22x fewer tokens
OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations
-
Popular skills often fail to improve LLM agent performance
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
-
Register, not size, picks the most human-like LLM
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework
-
GE2 leads retrieval accuracy but trails in latency by 14x
Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems
-
Latent space lets diffusion language models sample faster with better quality
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
-
Two-phase curriculum reaches 99.02% accuracy on name matching
Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts
-
Date-filtered retrieval fixes LLM errors on changed laws
Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering
-
Self-generated tests and code co-evolve to match RLVR results
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
-
Automated rubrics let RL scale to open-ended LLM tasks
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
-
SSDAU cuts ambiguity F1 drop in joint extraction from 32% to 8%
SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction
-
Solution matching measures model alignment with social norms
Naturalistic measure of social norms alignment
-
Tongue shape in /i/ predicts diphthong formant timing
Articulatory strategy as a source of variation in acoustic vowel dynamics
-
EquiSumm models gender to create fairer tweet summaries
EquiSumm : A Gender Bias-Aware Framework for Inclusive Tweet Summarization
-
Metacognitive rewards lift LLM reasoning up to 11 percent
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
-
RL framework decouples user preferences from task rewards
From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning
-
Cultural adaptation required before LLMs handle political discourse across cultures
Cultural Adaptation in Large Language Models for Political Discourse
-
Sign language ERC models reveal domain gap from generic approaches
Emotion Recognition in Sign Language Conversation
-
300K Facebook climate posts released as open dataset
ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication
-
Hope speech makes up over 64 percent of Arabic Gaza comments
AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse
-
Models converge on representations but diverge on reasoning
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning
-
Next-token prediction works only if text prefixes suffice for latent context
When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
-
Multi-gate residuals stabilize deep nets without extra comms cost
Multi-Gate Residuals
-
Kernel agents top out at 0.94x production baselines
FastKernels: Benchmarking GPU Kernel Generation in Production
-
Multi-agent AI raises gardener confidence and trust scores
CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening
-
Machine texts hide human-like spans that complicate detection
Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement
-
Optimizing prompt embeddings boosts in-context learning
Self-Improving In-Context Learning
-
Key-selected synonyms watermark LLM text at 98% detection
Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection
-
LLMs drop up to 88 points when tasks move to context middle
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
-
VLM boosts robot map coverage by 24% in tests
Autonomous Frontier-Based Exploration with VLM Guidance
-
Block-diffusion VLA reaches SOTA driving accuracy at 12x AR speed
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
-
ActInv recovers inputs from LLM split-inference activations
What Does the Server See? Understanding Privacy Leakage from Large Language Models in Split Inference
-
Language flips which jailbreaks work on frontier MLLMs
Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs
3 Piths -
LLMs miss psychiatric symptoms when functioning looks intact
When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening
-
Role prompts split into additive persona and task vectors at one site
As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs
-
BERT classifier labels 55k Ming-Qing letters from title lists
A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works
-
BERTopic beats STM on coherence for short survey texts
A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses
-
Global LP ranks every MoE expert to cut memory at low bits
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
-
Optimization cuts LLM token use 25% at F1 0.78
The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management
-
Steering vectors modestly lift cultural reasoning in LLMs
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
-
Mixed curriculum trains memory agents with highest overall QA F1
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA