archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 8
-
Modular platform enables concurrent LLM evaluation
OpenCompass: A Universal Evaluation Platform for Large Language Models
-
English pivots cut causal grounding of explanations by up to 5.7x
Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations
-
DECOR scores LLM responses on four manipulation dimensions for deception
DECOR: Auditing LLM Deception via Information Manipulation Theory
-
End-to-end models output formal text straight from Chinese speech
FormalASR: End-to-End Spoken Chinese to Formal Text
-
Language access managers accept AI but require human oversight
AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers
-
Step-level scores flag reasoning errors in closed LLMs
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
-
Fine-tuning on fMRI boosts ECoG language predictions
Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG
-
LLM Uncertainty Scores Only Measure Output Consistency
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
-
LLM judges spot agent failures less than half the time
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
-
Recurrent router matches MoA accuracy with fewer active agents
MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent
-
English prompts improve LLM diagnostic accuracy over French
Prompting language influences diagnostic reasoning and accuracy of large language models
-
Agents launch unsafe actions after benign errors in 65% of trials
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
-
Local attack and support calls stabilize global argument rankings
GRASP: Deterministic argument ranking in interaction graphs
-
One model trained on text and time series matches both specialists
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
-
VLMs need tight data alignment and miss weak signals in egocentric video
EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data
-
Benchmark shows 15-31 point headroom for better AI delegation
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
-
Graph separation shows public channels carry all indirect private influence
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
-
Bounded ReAct loop boosts zero-shot DST by 14 points
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
-
ElevenLabs Scribe v2 leads on code-switched Arabic
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
-
ElevenLabs Scribe leads on code-switched ASR with 13.2% WER
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
-
ElevenLabs ASR leads on code-switched speech at 13 percent error
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
-
Model scaling outpaces evaluation capacity in low-resource NLP
The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints
-
Control layer above optimizer keeps LLM training stable under stress
Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
-
Adaptive block selection matches full attention at 75% sparsity
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
-
Code harness turns LLMs into verifiable AI agents
Code as Agent Harness
-
Active exploration outperforms passive in spatial intelligence tasks
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
-
Self-distillation from crops boosts MLLM detail recognition
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
-
LLM fact recall improves with model size and topic frequency in data
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
-
Multi-dimensional preferences resist reward hacking in LLM training
General Preference Reinforcement Learning
-
Multi-dimensional preferences stop reward hacking in LLM reinforcement learning
General Preference Reinforcement Learning
-
Multi-dimensional preferences prevent reward hacking in LLM alignment
General Preference Reinforcement Learning
-
EnvFactory uses 85 environments for 15% tool-use gains
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL
-
FL nearly matches centralized results for depression detection
FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data
-
Generative AI ads intervene in model generation rather than visible placements
Generative AI Advertising as a Problem of Trustworthy Commercial Intervention
-
Config choices rival model selection on GIM benchmark
GIM: Evaluating models via tasks that integrate multiple cognitive domains
-
Human soft labels improve calibration and training stability
An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration
-
Backdoor circuit routes trigger to switch model language output
Language-Switching Triggers Take a Latent Detour Through Language Models
-
Trained MoE models skip over half their experts after adaptation
Post-Trained MoE Can Skip Half Experts via Self-Distillation
-
Token statistics on expert solutions forecast LLM performance
Forecasting Downstream Performance of LLMs With Proxy Metrics
-
Memory of past evaluations improves rubric updates for RL
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
-
Stripping consent declarations raises overeager rate in coding agents
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks
-
Meta-cognitive configurator lifts agent persuasion success rates
MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion
-
Embeddings and clustering unify inconsistent IS constructs
GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems
-
Memory systems score 27.9% under fact interference in long contexts
MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
-
Readers regress to likely error sites in garden-path sentences
Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences
-
Probe trajectories predict model future better than static checks
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
-
Frontier LLMs score under 40% on dynamic tool-use benchmark
STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics
-
Continuous diffusion scales to 20x compute gap of autoregressive models
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
-
Judging ICL demonstration success yields 23x speedup and higher accuracy
Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection
-
Fine-tuning lifts Ancient-to-Modern Greek translation by 10 BLEU points
Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models
2 Piths