archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 9
-
Decoupling tool use from execution boosts LLM math reasoning
Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning
-
Wiki beats RAG on cross-paper links but costs more tokens
Vector RAG vs LLM-Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research
-
Generator turns text prompts into LLM fingerprints in one pass
Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation
-
BERT and T5 differ in NER performance by tag scheme
From BERT to T5: A Study of Named Entity Recognition
-
Accuracy unchanged when latent visual tokens replaced by dummies
What's Holding Back Latent Visual Reasoning?
-
No memory method works consistently for LLM agents
EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective
-
Governed skill libraries boost frozen agents on benchmarks
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution
-
LLMs match human conditional ratings without pragmatic reasoning
Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs
-
Index lets researchers search 1.35 billion news articles in under a second
Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
-
Self-distillation supplies step-level search signals from own rollouts
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
-
Preference focus cuts device RAG memory 2400 times
From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG
-
K2V extends RLVR to knowledge domains via process verification
Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains
-
Shared codebook bridges modalities without full data pairs
CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook
-
MDU unlearns data in masked diffusion models by KL reversal
Machine Unlearning for Masked Diffusion Language Models
-
Multi-turn chats in low-resource languages jailbreak LLMs
Multilingual jailbreaking of LLMs using low-resource languages
-
SomaliWeb v1 delivers 303M tokens of cleaned Somali text
SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark
-
Memory of precomputed states cuts LLM prefix attention costs
Context Memorization for Efficient Long Context Generation
-
Speech audio accelerates MRI reconstruction of vocal tracts
SIREM: Speech-Informed MRI Reconstruction with Learned Sampling
-
GA-S2S adds k-hop graph structure to raise link prediction 19%
Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction
-
Varying environment rules builds agents that generalize
Scalable Environments Drive Generalizable Agents
-
One universal fix reduces hallucinations in 15 models
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
-
Hybrid system generates natural sentences from nested logic
FOL2NS: Generating Natural Sentences from First-Order Logic
-
Explanation guidelines lift LLM prompt accuracy by 35 percent
iPOE: Interpretable Prompt Optimization via Explanations
-
Bangla medical questions trip up top AI models
How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking
-
German news overreports European landslides vs risk data
How Loud Rumbles Hit Newsstands: A Data Analysis of Coverage and Spatial Bias in German News about Landslides Around the World
-
Grafting MoE-expanded deltas adds languages to LLMs efficiently
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE
-
Low-precision softmax transformers simulate Turing machines via CoT
The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought
-
KVDrive lifts long-context LLM speed 1.74x with SSD tier
KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference
-
P2P edge agents boost LLM task accuracy 8% and reduce latency 16%
PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence
-
Boundary protection recovers 69-90% quality at 13% KV retention
Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction
-
Tool localizes node errors in multi-agent LLM workflows
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows
-
Reranking by label semantics lifts hard-case F1 by over 9 points
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling
-
Neural tweaks make read speech sound like real conversation
Bridging the Gap: Converting Read Text to Conversational Dialogue
-
Predictive prefetching cuts RAG latency up to 43.5%
Predictive Prefetching for Retrieval-Augmented Generation
-
LLM generates explicit vectorized code beating compiler -O3
AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code
-
BacktestBench tests LLMs on 18k backtesting QA pairs from real markets
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
-
Natural triggers drop sentiment accuracy to 0.04
Universal Adversarial Triggers
-
Prompt compression fails to transfer to diffusion LLMs
Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA
-
Transient expert steers MoE updates to cut forgetting
CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning
-
Benchmark turns NASA mission text into logic formulas
A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration
-
AI chunking builds maps predicting war in Thucydides model
Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap
-
AI agent teams beat human teams at generating creative ideas
Multi-agent AI systems outperform human teams in creativity
-
Hindsight targets fix actions to cut agent training time 2.26x
HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
-
New multi-accent dataset lowers ASR errors on technical talks
PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions
-
SynPro yields 3.7-5.2x more effective tokens from organic data
Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
-
Retrieval system compresses Lean proofs over 70 percent
Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search
-
Memory-equipped agents show rising safety risks over time
Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
-
Memory systems score 0.12-0.18 on social group benchmark
SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?
-
LLM-rephrased notes keep broad utility but lose ICD details
Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale
-
Fine-tuned small models plan with tools without any catalog in the prompt
Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning