Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
The cost of dynamic reasoning: Demystifying ai agents and test- time scaling from an ai infrastructure perspective.arXiv preprint arXiv:2506.04301
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
AgentCARD benchmark shows heterogeneous LLM agent teams with mixed deployments reach the cost-accuracy frontier, delivering up to 44% higher accuracy or 12x lower cost than uniform teams, with domain-specific role bottlenecks.
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
Agentic workloads with context caching become decode-dominated with high KV-cache reuse and show tool use shifting from early read/explore to later execute/write phases.
Frontier LLMs homogenize toward systematic and analytical personalities, suppressing emotional traits like remorseful or sycophantic, indicating an implicit consensus on optimal assistant behavior.
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
citing papers explorer
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
Specialize Roles, Mix Deployments: Pushing the Cost-Accuracy Frontier of LLM Agent Teams
AgentCARD benchmark shows heterogeneous LLM agent teams with mixed deployments reach the cost-accuracy frontier, delivering up to 44% higher accuracy or 12x lower cost than uniform teams, with domain-specific role bottlenecks.
-
Inference-Time Budget Control for LLM Search Agents
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
-
Agentic AI Workload Characteristics
Agentic workloads with context caching become decode-dominated with high KV-cache reuse and show tool use shifting from early read/explore to later execute/write phases.
-
Same Voice, Different Lab: On the Homogenization of Frontier LLM Personalities
Frontier LLMs homogenize toward systematic and analytical personalities, suppressing emotional traits like remorseful or sycophantic, indicating an implicit consensus on optimal assistant behavior.
-
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
-
AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices
AgentStop uses execution signals to early-terminate failing local LLM agent trajectories, cutting energy use 15-20% with minimal utility loss.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.