arxiv: 2605.06642 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI

Recognition: unknown

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Xiangyuan Xue , Yifan Zhou , Zidong Wang , Shengji Tang , Philip Torr , Wanli Ouyang , Lei Bai , Zhenfei Yin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Strategic Trajectory AbstractionAgentic Reinforcement LearningLong-horizon TasksLLM AgentsHierarchical RolloutsALFWorldWebShopSciWorld

0 comments

The pith

StraTA conditions all agent actions on one compact strategy sampled once from the initial state to strengthen exploration and credit assignment in long-horizon tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models used as interactive agents often remain purely reactive, which limits their ability to explore effectively or assign credit across extended sequences. StraTA addresses this by sampling a single compact strategy from the starting task state and then conditioning every subsequent action on that fixed strategy. The method trains strategy generation together with action execution through a hierarchical rollout structure that incorporates diverse strategy sampling and self-judgment steps. On standard agent benchmarks the approach produces higher success rates and faster learning than reactive baselines.

Core claim

StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment.

What carries the argument

The explicit trajectory-level strategy sampled once from the initial state that is used to condition every later action.

If this is right

Success rates on household simulation tasks rise above 90 percent.
Web-based shopping tasks reach success rates above 80 percent with fewer samples.
Scientific reasoning environments yield overall scores that exceed those of some closed-source models.
Sample efficiency improves across multiple interactive agent benchmarks.
Joint training of strategy and action modules produces more coherent trajectories than purely reactive policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same once-per-trajectory strategy idea could be tested in robotic control settings where replanning frequency is costly.
If the strategy abstraction works, it suggests that other hierarchical RL methods might benefit from freezing high-level guidance early rather than regenerating it at every step.
Longer tasks than those tested might require mechanisms to update or switch strategies mid-trajectory without losing the credit-assignment benefit.
The approach may reduce the total number of environment interactions needed to reach a given performance level in any domain where credit must propagate over dozens of steps.

Load-bearing premise

A single strategy chosen at the start remains useful and non-restrictive for guiding actions across the entire length of a long task.

What would settle it

An ablation in which removing the strategy-conditioning step produces no change or a drop in success rate on long-horizon tasks such as ALFWorld would falsify the central claim.

read the original abstract

Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StraTA gets solid benchmark numbers by fixing one strategy at the start and training it hierarchically, but the abstract gives no ablations to show that fixed conditioning is what actually helps.

read the letter

StraTA samples a compact strategy from the initial state and then conditions every later action on that same strategy. The training uses a hierarchical GRPO-style rollout that also adds diverse strategy sampling and a self-judgment step to critique the choices. The authors train the strategy generator and the action policy jointly under this setup. On the reported runs, this produces 93.1 percent success on ALFWorld, 84.2 percent on WebShop, and a 63.5 percent overall score on SciWorld that beats some closed-source models. Those are the kinds of numbers that would interest anyone already running RL on LLM agents for multi-step tasks. The method itself is straightforward enough that groups could implement the pieces without much extra machinery. The main gap is that the abstract supplies no ablation that turns the fixed-strategy conditioning on and off, nor any test of whether the strategy needs to be refreshed partway through an episode. Without those controls it is hard to know whether the gains come from the initial abstraction or simply from the extra rollout diversity and self-judgment. The stress-test concern about the strategy going stale in changing environments therefore still stands on the information given. The work stays purely empirical; there are no derivations or formal checks. Researchers who fine-tune agents for embodied or web environments would find the recipe worth trying on their own benchmarks. I would send it to peer review once the authors add the missing ablations, baseline details, and basic statistical reporting, because the evaluation settings are the right ones and the core idea is easy to test.

Referee Report

3 major / 2 minor

Summary. The paper introduces Strategic Trajectory Abstraction (StraTA), a framework for LLM-based agents that samples a single compact strategy from the initial task state and conditions all subsequent actions on it. Training uses a hierarchical GRPO-style rollout that jointly optimizes strategy generation and action execution, with additions for diverse strategy sampling and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld report consistent gains in sample efficiency and final performance over baselines, with success rates of 93.1% (ALFWorld), 84.2% (WebShop), and 63.5% overall (SciWorld), exceeding some closed-source models.

Significance. If the empirical results are robust, StraTA offers a lightweight way to inject trajectory-level planning into reactive LLM agents, potentially improving exploration and credit assignment in long-horizon interactive tasks. The reported numbers on three standard benchmarks indicate practical value for agentic RL, and the hierarchical rollout design is a simple, implementable extension of existing GRPO methods. The work is entirely empirical with no parameter-free derivations or machine-checked proofs, so its theoretical contribution is limited to the specific mechanism tested.

major comments (3)

[Experiments] Experiments section: the central claim that the fixed initial strategy improves exploration and credit assignment rests on the assumption that this strategy remains relevant across state changes, yet no ablation is reported that varies strategy update frequency, removes strategy conditioning, or tests contingency handling. Without these controls it is impossible to attribute the reported gains (93.1% ALFWorld, 84.2% WebShop, 63.5% SciWorld) to the abstraction mechanism rather than the GRPO design or self-judgment component.
[Abstract] Abstract and Experiments section: performance numbers are stated without baseline implementation details, number of independent runs, error bars, statistical tests, or data-exclusion rules. This absence directly affects soundness of the claim that StraTA “consistently improves” over strong baselines, as post-hoc choices cannot be ruled out.
[Method] Method section: the hierarchical GRPO rollout is described at a high level but lacks explicit equations or pseudocode showing how the joint reward is computed across strategy and action levels or how the self-judgment signal is incorporated into the policy gradient. This makes it difficult to verify that the training procedure is correctly aligned with the stated objective of incentivizing strategic abstraction.

minor comments (2)

[Experiments] Figure captions and axis labels in the experimental plots could be expanded to include the exact baseline names and the number of trajectories used for each curve.
The paper would benefit from a short related-work paragraph explicitly contrasting StraTA with prior hierarchical RL and plan-and-execute agent methods.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that the fixed initial strategy improves exploration and credit assignment rests on the assumption that this strategy remains relevant across state changes, yet no ablation is reported that varies strategy update frequency, removes strategy conditioning, or tests contingency handling. Without these controls it is impossible to attribute the reported gains (93.1% ALFWorld, 84.2% WebShop, 63.5% SciWorld) to the abstraction mechanism rather than the GRPO design or self-judgment component.

Authors: We agree that additional controls are needed to isolate the contribution of the initial strategy abstraction. In the revised manuscript we will add ablations that (i) remove strategy conditioning while keeping the rest of the hierarchical GRPO and self-judgment pipeline intact and (ii) vary strategy update frequency (including re-sampling at fixed intervals). We will also analyze cases where the initial strategy becomes outdated and how the critical self-judgment component mitigates this. These experiments will be reported with the same evaluation protocol as the main results. revision: yes
Referee: [Abstract] Abstract and Experiments section: performance numbers are stated without baseline implementation details, number of independent runs, error bars, statistical tests, or data-exclusion rules. This absence directly affects soundness of the claim that StraTA “consistently improves” over strong baselines, as post-hoc choices cannot be ruled out.

Authors: We acknowledge that the current presentation lacks sufficient experimental rigor details. The revised manuscript will expand the Experiments section with a new subsection that reports: exact baseline implementations and hyper-parameters, number of independent random seeds (with at least 5 runs per setting), standard error bars on all success rates and scores, results of paired statistical tests, and any data-exclusion rules applied. The abstract will be lightly revised to reference these details. revision: yes
Referee: [Method] Method section: the hierarchical GRPO rollout is described at a high level but lacks explicit equations or pseudocode showing how the joint reward is computed across strategy and action levels or how the self-judgment signal is incorporated into the policy gradient. This makes it difficult to verify that the training procedure is correctly aligned with the stated objective of incentivizing strategic abstraction.

Authors: We agree that the method description would benefit from greater formality. In the revised manuscript we will insert explicit equations for the hierarchical GRPO objective, showing the joint reward formulation that combines strategy-level and action-level returns, and the precise manner in which the self-judgment signal is used to modulate the advantage estimates and policy gradient. We will also add pseudocode for the full rollout and update loop. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with independent experimental results

full rationale

The paper presents StraTA as an empirical method for agentic RL: it samples a compact strategy from the initial state, conditions actions on it via hierarchical GRPO rollouts, and augments with diverse strategy rollout and self-judgment. All performance claims (93.1% on ALFWorld, 84.2% on WebShop, 63.5% on SciWorld) are reported as outcomes of experiments against baselines on fixed benchmarks. No equations, derivations, or fitted parameters are described that would reduce any result to an input quantity by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The derivation chain is therefore self-contained as a description of a training procedure whose validity rests on external benchmark measurements rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects high-level assumptions visible in the description. No new physical entities are postulated. Standard RL assumptions about hierarchical credit assignment are invoked without independent verification in the provided text.

axioms (1)

domain assumption Hierarchical decomposition of strategy and action improves long-horizon credit assignment in agentic RL
The framework is built on this premise to justify the joint training design.

pith-pipeline@v0.9.0 · 5493 in / 1243 out tokens · 40069 ms · 2026-05-08T09:54:16.762253+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 36 canonical work pages · 21 internal anchors

[1]

Science China Information Sciences , year=

The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , year=
[2]

Frontiers of Computer Science , year=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , year=
[3]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

The landscape of agentic reinforcement learning for llms: A survey , author=. arXiv preprint arXiv:2509.02547 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

International Conference on Learning Representations , year=

ReAct: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations , year=
[5]

Forty-first International Conference on Machine Learning , year=

Executable code actions elicit better llm agents , author=. Forty-first International Conference on Machine Learning , year=
[6]

Advances in Neural Information Processing Systems , year=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , year=
[7]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

Reflact: World-grounded decision making in llm agents via goal-state reflection , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

2025
[8]

arXiv preprint arXiv:2503.09572 , year =

Plan-and-act: Improving planning of agents for long-horizon tasks , author=. arXiv preprint arXiv:2503.09572 , year=

work page arXiv
[9]

Advances in Neural Information Processing Systems , year=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , year=
[10]

Findings of the Association for Computational Linguistics: ACL 2024 , year=

Agenttuning: Enabling generalized agent abilities for llms , author=. Findings of the Association for Computational Linguistics: ACL 2024 , year=

2024
[11]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

Agentbank: Towards generalized llm agents via fine-tuning on 50000+ interaction trajectories , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

2024
[12]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

Agentgym: Evaluating and training large language model-based agents across diverse environments , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=
[13]

Agentgym-rl: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning

Agentgym-rl: Training llm agents for long-horizon decision making through multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.08755 , year=

work page arXiv
[14]

The Twelfth International Conference on Learning Representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The Twelfth International Conference on Learning Representations , year=
[15]

First Conference on Language Modeling , year=

AutoGen: Enabling next-gen LLM applications via multi-agent conversations , author=. First Conference on Language Modeling , year=
[16]

Forty-first international conference on machine learning , year=

Improving factuality and reasoning in language models through multiagent debate , author=. Forty-first international conference on machine learning , year=
[17]

Forty-first International Conference on Machine Learning , year=

Gptswarm: Language agents as optimizable graphs , author=. Forty-first International Conference on Machine Learning , year=
[18]

Comas: Co-evolving multi-agent systems via interaction rewards.CoRR, abs/2510.08529, 2025

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards , author=. arXiv preprint arXiv:2510.08529 , year=

work page arXiv
[19]

Advances in Neural Information Processing Systems , year=

WebShop: Towards scalable real-world web interaction with grounded language agents , author=. Advances in Neural Information Processing Systems , year=
[20]

Zhou, Shuyan and Xu, Frank F and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Bisk, Yonatan and Fried, Daniel and Alon, Uri and others , booktitle=
[21]

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shi, Dongchan and Liu, Fangyu and Yu, Zirui and Ma, Baian and Li, Guanghui and Uddin, Arman and Yu, Hao and Zhang, Ruisheng and Chen, Xuanzhe and Zhou, Victor and Shi, Shuyan and Zhu, Minjie and Fan, Leyla and Li, ...
[22]

International Conference on Learning Representations , year=

ALFWorld: Aligning text and embodied environments for interactive learning , author=. International Conference on Learning Representations , year=
[23]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review arXiv
[24]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation , author=. arXiv preprint arXiv:2506.18088 , year=

work page internal anchor Pith review arXiv
[25]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

ScienceWorld: Is your agent smarter than a 5th grader? , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , year=

2022
[26]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

work page internal anchor Pith review arXiv
[27]

Probing scientific general intelligence of llms with scientist- aligned workflows.arXiv preprint arXiv:2512.16969, 2025

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows , author=. arXiv preprint arXiv:2512.16969 , year=

work page arXiv
[28]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review arXiv 1909
[29]

Advances in Neural Information Processing Systems , year=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , year=
[30]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review arXiv
[31]

Advances in Neural Information Processing Systems , year=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , year=
[32]

International Conference on Artificial Intelligence and Statistics , year=

A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , year=
[33]

KTO: Model Alignment as Prospect Theoretic Optimization

Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

work page internal anchor Pith review arXiv
[34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year=
[35]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

REINFORCE++: An efficient RLHF algorithm with robustness to both prompt and reward models , author=. arXiv preprint arXiv:2501.03262 , year=

work page internal anchor Pith review arXiv
[36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review arXiv
[37]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review arXiv
[38]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review arXiv
[39]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

Cppo: Accelerating the training of group relative policy optimization-based reasoning models , author=. arXiv preprint arXiv:2503.22342 , year=

work page arXiv
[40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review arXiv
[41]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review arXiv
[42]

Self-Rewarding Language Models

Self-Rewarding Language Models , author=. arXiv preprint arXiv:2401.10020 , year=

work page internal anchor Pith review arXiv
[43]

arXiv preprint arXiv:2505.22660

Maximizing confidence alone improves reasoning , author=. arXiv preprint arXiv:2505.22660 , year=

work page arXiv
[44]

Learning to reason without external rewards.arXiv preprint arXiv:2505.19590, 2025

Learning to reason without external rewards , author=. arXiv preprint arXiv:2505.19590 , year=

work page arXiv
[45]

Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

Right question is already half the answer: Fully unsupervised llm reasoning incentivization , author=. arXiv preprint arXiv:2504.05812 , year=

work page arXiv
[46]

TTRL: Test-Time Reinforcement Learning

TTRL: Test-time reinforcement learning , author=. arXiv preprint arXiv:2504.16084 , year=

work page Pith review arXiv
[47]

Advances in neural information processing systems , year=

Fine-tuning large vision-language models as decision-making agents via reinforcement learning , author=. Advances in neural information processing systems , year=
[48]

Agent lightning: Train any ai agents with reinforcement learning,

Agent lightning: Train any ai agents with reinforcement learning , author=. arXiv preprint arXiv:2508.03680 , year=

work page arXiv
[49]

rLLM: A Framework for Post-Training Language Agents , author=
[50]

Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

Agent q: Advanced reasoning and learning for autonomous ai agents , author=. arXiv preprint arXiv:2408.07199 , year=

work page arXiv
[51]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning , author=. arXiv preprint arXiv:2504.20073 , year=

work page internal anchor Pith review arXiv
[52]

arXiv preprint arXiv:2505.03792 , year=

Towards efficient online tuning of vlm agents via counterfactual soft reinforcement learning , author=. arXiv preprint arXiv:2505.03792 , year=

work page arXiv
[53]

arXiv preprint arXiv:2402.19446 , year=

Archer: Training language model agents via hierarchical multi-turn rl , author=. arXiv preprint arXiv:2402.19446 , year=

work page arXiv
[54]

The Fourteenth International Conference on Learning Representations , year=

Tree search for LLM agent reinforcement learning , author=. The Fourteenth International Conference on Learning Representations , year=
[55]

The Fourteenth International Conference on Learning Representations , year=

Agentic reinforcement learning with implicit step rewards , author=. The Fourteenth International Conference on Learning Representations , year=
[56]

Group-in-Group Policy Optimization for LLM Agent Training

Group-in-group policy optimization for llm agent training , author=. arXiv preprint arXiv:2505.10978 , year=

work page internal anchor Pith review arXiv
[57]

Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817,

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks , author=. arXiv preprint arXiv:2602.22817 , year=

work page arXiv
[58]

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents , author=. arXiv preprint arXiv:2602.16165 , year=

work page internal anchor Pith review arXiv
[59]

Demystifying long chain-of-thought reasoning in llms, 2025

Demystifying long chain-of-thought reasoning in llms , author=. arXiv preprint arXiv:2502.03373 , year=

work page arXiv
[60]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

The entropy mechanism of reinforcement learning for reasoning language models , author=. arXiv preprint arXiv:2505.22617 , year=

work page internal anchor Pith review arXiv
[61]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review arXiv
[62]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-
[63]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle=. Judging
[64]

The Innovation , year=

A survey on llm-as-a-judge , author=. The Innovation , year=
[65]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[66]

Advances in Neural Information Processing Systems , year=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , year=
[67]

GPT‑5.1: A smarter, more conversational ChatGPT , author=
[68]

Introducing Claude 4 , author=
[69]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review arXiv
[70]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review arXiv