hub Mixed citations

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin · 2024 · cs.CL · arXiv 2406.07496

Mixed citation behavior. Most common role is background (62%).

67 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 67 citing papers arXiv PDF

abstract

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 2 baseline 1

citation-polarity summary

background 5 use method 2 baseline 1

representative citing papers

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents

cs.RO · 2026-06-03 · unverdicted · novelty 7.0

VASO is a verification-guided self-evolution framework for LLM robot skill contracts that reaches 97.2% formal-specification compliance on Jackal and quadcopter tasks using under 100 samples.

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

cs.LG · 2026-05-29 · unverdicted · novelty 7.0 · 2 refs

PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.

Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Causal diagnosis identifies the routing module as bottleneck in LLM agents but prompt patching there degrades results due to linguistic co-adaptation, while upstream patching improves them.

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

cs.AI · 2026-05-07 · conditional · novelty 7.0

Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- and scale-dependent.

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than single-pass models like LigandMPNN.

Meta-Harness: End-to-End Optimization of Model Harnesses

cs.AI · 2026-03-30 · unverdicted · novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.

FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification

cs.AR · 2026-03-06 · unverdicted · novelty 7.0

FVRuleLearner introduces an Operator Reasoning Tree to learn operator-specific rules that improve natural-language to SystemVerilog assertion generation, raising syntax correctness by 3.95% and functional correctness by 31.17% over baselines.

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

cs.AI · 2026-01-22 · conditional · novelty 7.0

DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.

Gen-n-Val: Agentic Image Data Generation and Validation

cs.CV · 2025-06-05 · conditional · novelty 7.0

Gen-n-Val uses LLM and VLLM agents with Layer Diffusion and TextGrad to generate and validate synthetic instance data, cutting invalid samples from 50% to 7% and improving rare-class performance on LVIS and COCO benchmarks.

Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

cs.HC · 2024-09-23 · unverdicted · novelty 7.0

Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.

Automated Design of Agentic Systems

cs.AI · 2024-08-15 · conditional · novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.

Teaching LLMs to Recommend and Defer in Underrepresented Epilepsy Care

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

MANANA prompt-learning framework improves LLM prescription prediction accuracy by 4-8 points on Ugandan epilepsy cohorts and supports selective deferral at 95-99% precision on high-confidence cases.

Automating the Design of Embodied AgentArchitectures

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.

VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

EvoDS adds autonomous skill acquisition via synthesis-validation-reuse and adaptive context compression via learned control within a two-stage multi-agent RL scheme, claiming 28.9% average gains over prior agents on four benchmarks plus elimination of out-of-token failures.

MUSE: A Unified Agentic Harness for MLLMs

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and verifier-guided repair without model retraining.

citing papers explorer

Showing 50 of 67 citing papers.

VASO: Formally Verifiable Self-Evolving Skills for Physical AI Agents cs.RO · 2026-06-03 · unverdicted · none · ref 21 · internal anchor
VASO is a verification-guided self-evolution framework for LLM robot skill contracts that reaches 97.2% formal-specification compliance on Jackal and quadcopter tasks using under 100 samples.
When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks? cs.LG · 2026-05-29 · unverdicted · none · ref 22 · 2 links · internal anchor
PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents cs.AI · 2026-05-21 · unverdicted · none · ref 36 · internal anchor
Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.
Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines cs.CL · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
Causal diagnosis identifies the routing module as bottleneck in LLM agents but prompt patching there degrades results due to linguistic co-adaptation, while upstream patching improves them.
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization cs.CL · 2026-05-20 · unverdicted · none · ref 9 · internal anchor
TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts cs.AI · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
ReElicit uses LLMs to elicit adaptive feature embeddings for Gaussian process Bayesian optimization of system prompts under aggregate-only feedback, outperforming baselines across ten tasks with a 30-evaluation budget.
Harnessing Agentic Evolution cs.AI · 2026-05-13 · unverdicted · none · ref 38 · internal anchor
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 67 · 2 links · internal anchor
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding cs.AI · 2026-05-07 · conditional · none · ref 34 · internal anchor
Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- and scale-dependent.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments cs.SE · 2026-05-04 · unverdicted · none · ref 30 · internal anchor
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design cs.LG · 2026-04-19 · unverdicted · none · ref 17 · internal anchor
RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than single-pass models like LigandMPNN.
Meta-Harness: End-to-End Optimization of Model Harnesses cs.AI · 2026-03-30 · unverdicted · none · ref 55 · internal anchor
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.
FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification cs.AR · 2026-03-06 · unverdicted · none · ref 60 · internal anchor
FVRuleLearner introduces an Operator Reasoning Tree to learn operator-specific rules that improve natural-language to SystemVerilog assertion generation, raising syntax correctness by 3.95% and functional correctness by 31.17% over baselines.
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification cs.AI · 2026-01-22 · conditional · none · ref 21 · internal anchor
DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.
Gen-n-Val: Agentic Image Data Generation and Validation cs.CV · 2025-06-05 · conditional · none · ref 41 · internal anchor
Gen-n-Val uses LLM and VLLM agents with Layer Diffusion and TextGrad to generate and validate synthetic instance data, cutting invalid samples from 50% to 7% and improving rare-class performance on LVIS and COCO benchmarks.
Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation cs.HC · 2024-09-23 · unverdicted · none · ref 78 · internal anchor
Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 230 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
Teaching LLMs to Recommend and Defer in Underrepresented Epilepsy Care cs.LG · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
MANANA prompt-learning framework improves LLM prescription prediction accuracy by 4-8 points on Ugandan epilepsy cohorts and supports selective deferral at 95-99% precision on high-confidence cases.
Automating the Design of Embodied AgentArchitectures cs.RO · 2026-06-29 · unverdicted · none · ref 31 · internal anchor
Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.
Towards Spec Learning: Inference-Time Alignment from Preference Pairs cs.CL · 2026-06-22 · unverdicted · none · ref 120 · internal anchor
Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.
VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers cs.CV · 2026-06-17 · unverdicted · none · ref 16 · internal anchor
VTOS jointly searches solution and observer programs to adaptively orchestrate vision tools, outperforming static pipelines on dense object counting and zero-shot plant disease segmentation.
EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management cs.AI · 2026-06-02 · unverdicted · none · ref 74 · internal anchor
EvoDS adds autonomous skill acquisition via synthesis-validation-reuse and adaptive context compression via learned control within a two-stage multi-agent RL scheme, claiming 28.9% average gains over prior agents on four benchmarks plus elimination of out-of-token failures.
MUSE: A Unified Agentic Harness for MLLMs cs.CV · 2026-06-02 · unverdicted · none · ref 60 · internal anchor
MUSE is a unified agentic harness that improves off-the-shelf MLLMs on visual spatial planning, perception, multimodal reasoning, and fine-grained discrimination benchmarks through structured execution modules and verifier-guided repair without model retraining.
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search cs.AI · 2026-05-30 · unverdicted · none · ref 37 · internal anchor
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
ExpGraph: Model-Agnostic Experience Learning with Graph-Structured Memory for LLM Agents cs.CL · 2026-05-29 · unverdicted · none · ref 54 · internal anchor
ExpGraph builds a graph of summarized agent experiences and uses graph diffusion plus an RL-trained retrieval copilot to improve frozen LLM executors on QA, math, code, and agentic tasks without parameter updates.
LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring cs.CL · 2026-05-26 · unverdicted · none · ref 21 · internal anchor
Training-free prompt optimization methods, including five new education-focused ones, surpass the strongest RL-trained baseline across five conditions on two OOD suites while showing distinct teaching behavior patterns.
Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis cs.CL · 2026-05-26 · unverdicted · none · ref 57 · internal anchor
Observational causal-inspired analysis finds prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random artifacts.
SPEAR: Code-Augmented Agentic Prompt Optimization cs.CL · 2026-05-25 · unverdicted · none · ref 5 · internal anchor
SPEAR augments automatic prompt engineering with an agent that uses evaluate, python, set_prompt and finish tools plus auto-rollback guardrails, outperforming baselines on industrial LLM-judge tasks and BBH by leveraging Python-authored structural error analysis.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills cs.AI · 2026-05-22 · unverdicted · none · ref 41 · 2 links · internal anchor
SkillOpt introduces a controllable text-space optimizer that evolves agent skills via add/delete/replace edits accepted only on strict held-out validation improvement, reporting consistent gains across 52 model-benchmark-harness combinations.
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents cs.AI · 2026-05-21 · conditional · none · ref 16 · internal anchor
Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
optimize_anything: A Universal API for Optimizing any Text Parameter cs.CL · 2026-05-19 · unverdicted · none · ref 32 · internal anchor
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
Harnesses for Inference-Time Alignment over Execution Trajectories cs.LG · 2026-05-15 · unverdicted · none · ref 35 · internal anchor
Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.
Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering cs.CL · 2026-05-15 · conditional · none · ref 37 · internal anchor
NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement cs.AI · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks cs.CR · 2026-05-10 · unverdicted · none · ref 43 · internal anchor
MT-JailBench is a modular benchmark that standardizes evaluation of multi-turn jailbreaks to identify key success drivers and enable stronger combined attacks.
AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems cs.LG · 2026-05-09 · unverdicted · none · ref 7 · internal anchor
AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration cs.LG · 2026-05-08 · unverdicted · none · ref 32 · internal anchor
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 89 · internal anchor
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis cs.AI · 2026-04-20 · unverdicted · none · ref 9 · internal anchor
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization cs.AI · 2026-04-19 · unverdicted · none · ref 114 · internal anchor
SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees cs.AI · 2026-04-13 · unverdicted · none · ref 12 · internal anchor
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 103 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
Reflective Context Learning: Studying the Optimization Primitives of Context Space cs.LG · 2026-04-03 · unverdicted · none · ref 13 · internal anchor
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
Self-Optimizing Multi-Agent Systems for Deep Research cs.IR · 2026-04-03 · unverdicted · none · ref 19 · internal anchor
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
UNBOX: Unveiling Black-box visual models with Natural-language cs.CV · 2026-03-09 · unverdicted · none · ref 9 · internal anchor
UNBOX recovers interpretable text concepts that maximally activate classes in black-box vision models by recasting activation maximization as semantic search with LLMs and diffusion models.
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search cs.LG · 2026-03-02 · unverdicted · none · ref 34 · internal anchor
Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
Memory in the Age of AI Agents cs.CL · 2025-12-15 · unverdicted · none · ref 110 · internal anchor
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory cs.AI · 2025-11-26 · unverdicted · none · ref 44 · internal anchor
ViLoMem is a dual-stream grow-and-refine memory system that separates visual and logical error patterns in MLLMs to improve pass@1 accuracy and reduce repeated mistakes across six multimodal benchmarks.

TextGrad: Automatic "Differentiation" via Text

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer