archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 4
-
Fixing the main failure point can hurt LLM agents
Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines
-
Medical RAG certifies claims with zero unsupported risk
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
-
LLMs now build planners instead of one-off plans
Planning in the LLM Era: Building for Reliability and Efficiency
-
7B model beats larger ones at Lean proof optimization
ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
-
LLM attention weights tokens to improve DPO
Token-weighted Direct Preference Optimization with Attention
-
Hyper-Align turns hypergraphs into LLM tokens
Hypergraph as Language
-
Agent trajectories compiled into QA pairs improve long-context performance
ACC: Compiling Agent Trajectories for Long-Context Training
-
Dictionary realignment keeps OOD explanations faithful
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
-
LLMs beat fine-tuned models on rare suicide circumstances
Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity
-
Energy gating lifts transformer loss by 0.1 with tiny overhead
Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention
-
LLMs reduce ten intensity words to five numeric values
Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions
-
Retrieval lifts LLM accuracy on rare medical cases from 56% to 82%
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering
-
Geometry-aware calibration closes entropy gaps for LLM optimization
Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
-
Context rewrite lifts 3D grounding accuracy by up to 22 points
MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
3 Piths -
DivSkill-SQL lifts Text-to-SQL accuracy by up to 11 points
Residual Skill Optimization for Text-to-SQL Ensembles
-
LLM optimizer diagnoses full-set errors to tune prompts
Reflective Prompt Tuning through Language Model Function-Calling
-
Contrastive prompts with 'other' turn LLMs into probability estimators
PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
-
Single-flaw pairs create clear tests for multi-turn LLM judges
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
-
Lightweight cross-encoder matches LLM judges for caption evaluation
BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model
-
Bayes rule gives LLMs token-by-token attribution scores
Probabilistic Attribution For Large Language Models
-
Semantic comparison catches AI peer reviews at low false positives
Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews
-
Natural language queries reach safety data with schema validation
Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries
-
Projection matrix aligns tokenizers for better distillation
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation
-
Open-source LLMs lean left on politics
How Far Will They Go? Red-Teaming Online Influence with Large Language Models
-
Actor updates match value gradients under differentiable rollouts
Value-Gradient Hypothesis of RL for LLMs
-
Fine-tuned detectors amplify a pretrained typicality axis
Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
-
Entmax turns KV cache truncation into exact support recovery
EntmaxKV: Support-Aware Decoding for Entmax Attention
4 Piths -
New benchmark shows LVLMs falter on furniture assembly videos
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
-
Rewriting cuts unsafe LLM outputs for teen users
CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety
-
Platform lets humans and AIs co-author and iterate on papers
AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists
-
Rank-1 line from first 50 steps matches full RLVR at 15% cost
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
-
DelTA raises math scores by over 3 points on 8B models
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
-
LLMs reach 100% consistency adapting grammars to metamodel changes
Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution
-
Separate model learns when to generate agent guidance
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
-
LLM measures track syncretism effects on agreement attraction
Quantifying the cross-linguistic effects of syncretism on agreement attraction
-
Metaphors widen spectral breadth in transformer layers
Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy
-
Agents pass visible tests but fail held-out usage tests as tasks lengthen
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
-
Traditional systems still lead in multilingual coreference task
Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities
-
AI shapes 11-26% of goals in human collaborations
"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration
-
Hybrid jailbreak method reaches 84% success with 30 queries
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
-
LLMs degrade on numerical tasks beyond 500 social media posts
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
-
43M-paper graph gives AI agents deterministic cross-field links
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
-
Spike-gated model reaches 89% sparsity at 8.9 perplexity
SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence
-
Regularization curbs prompt overfitting for better LLM generalization
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization
-
LLMs follow logical rules for conditionals but miss human implications
Tracing the ongoing emergence of human-like reasoning in Large Language Models
-
Dual safeguards create reliable HIV triage domain in Spanish notes
Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification
-
Pairwise rewards stabilize RL for reasoning models
LamPO: A Lambda Style Policy Optimization for Reasoning Language Models
-
10% heads on 10% data deliver 8.3 pp gain with 7x speedup in LLM alignment
From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment
-
Knowledge graphs lift LLM borrowing detection in Luxembourgish to 81%
Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models
-
Manga109 revised to correct 29,000 dialogue annotations
Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding