archive
Every paper Pith has read. Search by title, abstract, or pith.
14513 papers in cs.AI · page 17
-
FAGER metric leads in factual checks for AI image generators
FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models
-
One model predicts shapes for many tendon-driven continuum robots
Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots
-
Benchmark shows 15-31 point headroom for better AI delegation
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
-
ScheduleFree+ beats WSD schedules on long LLM pretraining
ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
-
LLM elicits dynamic features to optimize system prompts
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
-
Graph separation shows public channels carry all indirect private influence
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
-
MANGO achieves top results in online continual learning benchmarks
MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning
-
Bounded ReAct loop boosts zero-shot DST by 14 points
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
-
CRAFT pipeline leads MAGMaR video QA at 0.739 average
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering
-
Multi-horizon training captures longer solar forecast dependencies
Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting
-
Networks on correlation matrices beat SPD and Grassmannian baselines
Riemannian Networks over Full-Rank Correlation Matrices
-
ElevenLabs ASR leads on code-switched speech at 13 percent error
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
-
ElevenLabs Scribe v2 leads on code-switched Arabic
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
-
ElevenLabs Scribe leads on code-switched ASR with 13.2% WER
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
-
AI agents simulate employee responses to AI workplace changes
Toward an AI-Powered Computational Testbed for Workforce Policy
-
LiFT lifts 2D generators to coherent 3D medical volumes
LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators
-
KVBuffer cuts linear attention decoding latency by up to 45%
KVBuffer: IO-aware Serving for Linear Attention
-
Vision LLMs grade handwritten math with high accuracy
Automated Grading of Handwritten Mathematics Using Vision-Capable LLMs
-
Gradient projection and orthogonalization cut multi-task unlearning interference
Interference-Aware Multi-Task Unlearning
-
Agent networks need trust built in from the start
Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On
-
RL fine-tuning aligns traffic simulations with real data
RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
-
Hybrid KAN-MLP raises F1 scores 5.33% in IMU activity recognition
KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition
-
Multi-agent LLM method hits 78.1% accuracy on NL2SQL benchmark
AgentNLQ: A General-Purpose Agent for Natural Language to SQL
-
Control layer above optimizer keeps LLM training stable under stress
Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
-
Oracle routing lifts selective refusal scores by 12.9 points
Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing
-
Distillation transfers linearized task arithmetic to non-linear models
Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic
-
Distillation gives non-linear models linearized task arithmetic
Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic
-
Treat AI models as untrusted to secure agents
Agent Security is a Systems Problem
-
Agent security requires system-level enforcement treating models as untrusted
Agent Security is a Systems Problem
-
TRIAD bounds time-to-failure for multi-turn multimodal attacks
Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks
-
Self-supervised backbones boost artwork classification
Harnessing Self-Supervised Features for Art Classification
-
Synthetic prior with stress and realism lifts tabular model performance
Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality
-
Adaptive block selection matches full attention at 75% sparsity
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention
-
Code harness turns LLMs into verifiable AI agents
Code as Agent Harness
-
Active exploration outperforms passive in spatial intelligence tasks
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
-
Neural architecture learns object state manifolds from sensor data
WorldString: Actionable World Representation
-
Neural architecture learns object state changes from 3D scans
WorldString: Actionable World Representation
-
Self-distillation from crops boosts MLLM detail recognition
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
-
AI medical advisors underweight patient autonomy
What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models
-
PHR context boosts helpfulness of LLM health answers
Evaluating the Utility of Personal Health Records in Personalized Health AI
-
LLM fact recall improves with model size and topic frequency in data
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
-
Benchmark tests dexterous Texas Hold'em play at 61 percent success
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
-
Segmentation proxy aligns multimodal understanding and generation
Semantic Generative Tuning for Unified Multimodal Models
-
Distilled students match 90% AUC from health foundation models
Distilling Tabular Foundation Models for Structured Health Data
-
PopPy speeds Python AI apps up to 6.4x by parallelizing external calls
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
-
Tabular foundation models show little diversity for ensembling
Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap
-
Benchmark tests LLM agents on generating reusable skills
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents
-
LLM converts user prompts into optimization model patches
Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches
-
Multi-agent pipeline extracts traceable specs from legacy code
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents
-
Perturbation metric scores and trains better AI explanations
Learning Quantifiable Visual Explanations Without Ground-Truth