archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 7
-
LLM use adds complex words and syntax to NLP papers
What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience
-
Context map cache raises LLM agent accuracy 6-34% on recurring tasks
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
-
Scorer choice sets the layer where authorship signals consolidate
Where Does Authorship Signal Emerge in Encoder-Based Language Models?
-
Model learns when to skip tools for better multimodal answers
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
-
Influence functions fix model errors via key sample and concept tweaks
CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models
-
Dense benchmark exposes open VLMs' gaps on subtle human actions
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
-
Open VLMs struggle with fine details in human video actions
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
-
Dual-stream network lifts weather detection at full speed
CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving
-
Scaled simulations cut speech recognition errors over 30 percent
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
-
Temporal conditioning changes AV planner style but not scores
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning
-
Rubric shows LLMs generate mostly high-quality legal propositions
LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation
-
Section-based chunking tops recall in German legal retrieval
Chunking German Legal Code
-
LLMs generate coherent multimodal behaviors for ability and benevolence
Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs
-
Long-term medical dialogue benchmark reveals LLM limitations
Synthesis and Evaluation of Long-term History-aware Medical Dialogue
-
Pure code boosts programming but hurts complex math reasoning
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
-
Node topology turned into text improves graph anomaly detection
TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection
-
Fuzzy concept graph cuts RAG indexing to 30 LLM calls
ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation
-
Review of 120 studies maps LLM math reasoning gaps
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
-
Parser trained on CHILDES beats general tools on child speech
CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions
-
84K Arabic samples built for Saudi financial sentiment analysis
LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets
-
LLMs fix West Frisian ASR errors on unseen texts
Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian
3 Piths -
OScaR reaches near-lossless INT2 KV cache quantization
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
-
2-bit LLMs retain most accuracy on reasoning tasks
K-Quantization and its Impact on Output Performance
-
One LLM system optimizes text to beat specialists on six tasks
optimize_anything: A Universal API for Optimizing any Text Parameter
-
New Chinese benchmark caps LLM logical accuracy at 37.5 percent
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
-
Open dataset and reweighting match big models in long-context RL
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
-
Governance recipe lifts LLM skill-library performance from 0.26 to 0.58
Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries
-
No multi-word expression is absolutely idiomatic
A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics
-
One model serves many embedding sizes in retrieval
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
-
Merging LLMs into VLMs boosts instructions but not math
Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
-
Base models fool AI detectors into rating text as human
Base Models Look Human To AI Detectors
-
Context management determines real-world Transformer Turing-completeness
Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management
-
TokenDrift cuts Gen-PPL by 89% at 4 steps in DDLMs
Drifting Objectives for Refining Discrete Diffusion Language Models
-
CEPO boosts math reasoning to 43.43% at 2B and 60.56% at 4B
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
-
Backtracking fixes dual biases in LLM reasoning distillation
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
-
Pairwise confidence weights sharpen LLM policy optimization
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
-
Pairwise sums replace group means in LLM policy optimization
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
-
Reassembling entity pairs boosts synthetic QA accuracy by 88.9%
EmbGen: Teaching with Reassembled Corpora
-
Entropy shaping makes LLMs concise on easy math and deeper on hard ones
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning
-
Framework creates custom science benchmarks for LLMs from existing data
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
-
Architecture lets AI agents break rules legitimately when justified
PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies
-
Supreme Court quashes 18 points more matrimonial petitions than Karnataka HC
IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis
-
Retrieval rewriting lifts LLM calibration up to 58%
Retrieval-Augmented Linguistic Calibration
-
Benchmark labels hallucinations via explicit reference worlds
HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
5 Piths -
STAR-PólyaMath hits perfect scores on Putnam and IMO
STAR-P\'olyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision
-
LLMs close 99% of deals but earn low profits in hidden pricing
PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations
-
Multi-agent evaluators lock reading items to target difficulty
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
-
Small targeted probes break document parsers as much as large ones
How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence
-
Metric selects only necessary rationales for LLM misinformation checks
Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection
-
LLMs learn redundant copies of concepts across languages
Language models struggle with compartmentalization