archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 2
-
Controller routes LLM requests to best mode for 2x speedup
ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
-
Recognition of evaluations depends on model-benchmark pairs
Decomposing and Measuring Evaluation Awareness
-
Compositionality rises then falls in LLM self-training
Model Collapse as Cultural Evolution
-
RAG method leads in mental health improvement detection
DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods
-
Hawkes process lifts late alignment in news text simulations
HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation
-
LLMs learn what not to say via frequency competition
Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs
-
Multilingual SAEs enable reliable language steering without layer search
Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
-
SAE features from LLMs map onto brain semantic regions
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
-
Training data language, not English, drives brain-LLM alignment
Brain-LLM Alignment Tracks Training Data, Not Typology
-
RADAR forecasts transfer by comparing representation trajectories
RADAR: Relative Angular Divergence Across Representations
-
Transformers have fixed accuracy limits set by layers and width
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
-
Proactive AI questions uncover 82% of autism language traits
A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism
-
FIM pretraining yields linear verbatim memorization growth
Memorization Dynamics of Fill-in-the-Middle Pretraining
-
Pipeline creates first UD treebank for Katharevousa Greek
A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text
-
AI models favor some religions over others in conversion advice
When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance
-
LLMs estimate expertise from Slack logs with 21% error
Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs
-
Graph alignment detects LLM hallucinations better than GPT-4o
Graph Alignment Topology as an Inductive Bias for Grounding Detection
-
LIFT gives diffusion models up to 3x reasoning gains on math tests
Learnability-Informed Fine-Tuning of Diffusion Language Models
-
Error feedback in prompts halves Cypher query execution errors
RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation
-
LaTeX source yields better RAG chunks than PDF text
AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation
-
Linear program yields tokenizers within 1% of optimal
Tokenisation via Convex Relaxations
-
Vector rewards produce diverse LLM outputs that raise search scores
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
-
Evidence verifier scores spans by accuracy gain in self-evolving agents
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
-
AI chatbots hit 90 percent on fresh news but drop in open answers
Evaluating Commercial AI Chatbots as News Intermediaries
-
VLMs keep high scores after most image tokens are deleted
Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
-
Transcoders trace VLM grounding and predict hallucinations at 0.68 AUC
Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models
-
Consistency training cuts covert political bias in LLMs
Reducing Political Manipulation with Consistency Training
-
Time-ordered training keeps LLM facts fresher than shuffling
Understanding Data Temporality Impact on Large Language Models Pre-training
-
Temporal biomedical graph rescues up to 65% of LLM errors on disease timelines
ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning
-
LLM analysis outperforms acoustics for political pathos
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
-
Simulated dense placements train IMU model that ignores sensor setup
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
-
Conversation history pulls LLM judgments toward its tone
AMEL: Accumulated Message Effects on LLM Judgments
-
ToaST cuts tokens over 11% vs BPE at large vocabularies
Tokenization with Split Trees
-
Gradient subspace projection boosts LLM self-distillation
Self-Policy Distillation via Capability-Selective Subspace Projection
-
Moral cues survive machine translation to Polish
Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora
-
Images boost LLM poetry detectors past RoBERTa
Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
-
AI Action Plan echoes private sector over public life concerns
Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government
-
AI office agents fail 44% of gradual attack tests
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
-
Benchmark shows AI agents accept gradual risks in 44 percent of cases
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
-
Moral knowledge beats extra context and model scaling for value detection
More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
-
Moral knowledge retrieval beats extra context for political value detection
More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
-
CAME-Grad fixes gradient double dilemma in report generation
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
-
CAME-Grad optimizer lifts radiology reports by 2 percent
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
-
Dual rewards stabilize unsupervised LLM reasoning
Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
-
Sensorimotor ratings speed Chinese word recognition
Chinese sensorimotor and embodiment norms for 3,000 lexicalized concepts
-
Agentic CLEAR automates multi-level LLM agent evaluation
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
-
Noise prediction loss matches score matching up to constant
A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
-
Hyperfitting expands final LLM layer to promote rare tokens
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion
-
Decaying hints lift non-English reasoning without drift
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
-
Multiple metrics required to judge synthetic data for tool-calling agents
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations