Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
citing papers explorer
-
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
-
Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization
Introduces TA-MDP and proves GRPO convergence at O(1/sqrt(T)), a reward decomposition bound, and PAC-Bayes generalization for tool-augmented LVLM policies.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.