archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 11
-
Stigmatizing language skews LLMs toward less aggressive medical advice
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making
-
ChemVA lifts LLMs on chemical diagrams by 20 points
ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding
-
LLMs annotate Mandarin narratives nearly as well as humans
LLMs for automatic annotation of Mandarin narrative transcripts
-
AI models barely beat baseline on pluralistic community moderation
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
-
Many models show weaker safety in English than low-resource languages
Why Do Safety Guardrails Degrade Across Languages?
-
On-device specs match cloud accuracy on 4 of 8 benchmarks
OpenJarvis: Personal AI, On Personal Devices
-
Explicit provenance required to compute AI responsibility
Responsible Agentic AI Requires Explicit Provenance
-
Low-cost adapters enable multimodal LLMs for low-resource languages
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
-
Models collapse on multi-sequence brain MRI questions
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
-
VLMs collapse on multi-sequence brain tumor MRI scans
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
-
Small attention-head sets suppress deceptive commitment across environments
The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
-
Router matches top LLM quality at half the cost
HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
-
Three agents boost medical QA accuracy by 6.46 points
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
-
Density weighting recovers 8.7 OCR points in hybrid VLM distillation
HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation
-
Auto-generated reasoning chains lift ICL accuracy on multi-step tasks
ACIL: Auto Chain of Thoughts for In-Context Learning
-
Scale decides if language model geometry stays organized for prediction
Scale Determines Whether Language Models Organize Representation Geometry for Prediction
-
Top LLMs cover only 47.8% of real consumer reactions
Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
-
LLM agent builds traceable knowledge graphs autonomously
RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation
-
AI Agents Differ Sharply in Solo ML Model Training on One GPU
1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
-
Agentic cycle makes translation serve communication goals first
Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design
-
Self-evolution trains math-reasoning LLMs with under 2K samples
D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning
-
Prompt leaks let simple text match fake hallucination detection
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
-
Algorithmic feeds reshape how users write
Algorithmic Cultivation: How Social Media Feeds Shape User Language
-
Every string over its alphabet is a valid program
The IsalProgram Programming Language
-
The paper presents HalluScore
HalluScore: Large Language Model Hallucination Question Answering Benchmark
-
Fine-tuning stabilizes LLM personality scores but accuracy stays near chance
Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?
-
Transformers recover item difficulty signal from wording alone
Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning
-
Test-time skill synthesis raises LLM agent success rates
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents
-
Two-stage adapters put LLM first in coreference task
Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution
-
Two-stage adapters lead LLM multilingual coreference resolution
Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution
-
EEG shows why people miss some AI hallucinations
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
-
Diffusion LLMs learn faster decoding by rolling back mistakes
Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers
-
Reasoning effort fails to change LRM alignment with humans
Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
-
Full-attention LLMs sparsify in hundreds of steps
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
-
Pinyin and glyph features fix homophone errors in Chinese keyword filtering
JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR
-
LLM trading alpha is not deployment evidence
The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence
-
DriveSafe uses scene captions to improve driving risk detection
DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios
-
Expert targets raise merged-model 4-bit accuracy from 35% to 77%
E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring
-
Multiple translations become one benchmark for Pali
PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks
-
Mixing a model's own predictions lets it add facts without forgetting old skills
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
-
MixSD retains 100% of base skills while injecting new facts
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
-
Induced patterns let VLMs plan beyond single-step vision
Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction
-
First structured dataset released for Indian RTI decisions
RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis
-
Block-union tables cut chunked prefill attention time by 2.72x
CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection
-
Diffusion code generation meets constraints through local edits
Constrained Code Generation with Discrete Diffusion
-
Decoupling KL and prefixes creates four LLM distillation objectives
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
-
LLM confidence trajectories separate correct reasoning without content
Confidence Geometry Reveals Trace-Level Correctness in Large Language Model Reasoning
-
AI agents reach 6.89x GPU kernel speedups but drop on unseen shapes
AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
-
Eight calibration passes set LoRA ranks by layer
FIM-LoRA: Task-Informative Rank Allocation for LoRA via Calibration-Time Gradient-Variance Estimation
-
Execution rewards keep tool accuracy above 90% at depth 6
TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition