Recognition: unknown
Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models
Pith reviewed 2026-05-10 04:28 UTC · model grok-4.3
The pith
Small language models learn a policy to request adaptive help from large ones during multi-step reasoning and outperform static approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An SLM learns to proactively decide how to request an LLM during multi-step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. Collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints, with a scaling effect in which stronger SLMs become more self-reliant and stronger LLMs enable fewer and more informative interactions. The learned dynamic strategies significantly outperform static pipelines and standalone inference and transfer robustly to unseen LLMs.
What carries the argument
A learned policy inside the SLM that decides when and how to request adaptive feedback from the LLM inside a dynamic collaboration framework.
If this is right
- Stronger SLMs become more self-reliant and request LLM help less frequently.
- Stronger LLMs produce fewer but more informative interactions.
- Dynamic strategies outperform both static pipelines and standalone inference.
- Learned strategies transfer effectively to LLMs not seen during training.
Where Pith is reading between the lines
- Hybrid systems could reduce overall inference cost by limiting expensive LLM calls to only necessary moments.
- Most computation could stay local on small models, supporting privacy-sensitive applications.
- Similar learned help-seeking policies might generalize to other pairs of resource-light and resource-heavy models beyond language tasks.
Load-bearing premise
An SLM can be trained to reliably learn an effective policy for when and how to request help such that overall system performance improves without hidden costs in training or inference overhead.
What would settle it
A controlled test in which the trained dynamic policy is run on new multi-step reasoning tasks or previously unseen LLMs and fails to exceed the accuracy or efficiency of either standalone models or fixed static pipelines.
Figures
read the original abstract
Large language models (LLMs) offer strong capabilities but raise cost and privacy concerns, whereas small language models (SLMs) facilitate efficient and private local inference yet suffer from limited capacity. To synergize the complementary strengths, we introduce a dynamic collaboration framework, where an SLM learns to proactively decide how to request an LLM during multi-step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. We further systematically investigate how collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints. Evaluation results reveal a distinct scaling effect: stronger SLMs become more self-reliant, while stronger LLMs enable fewer and more informative interactions. In addition, the learned dynamic collaboration strategies significantly outperform static pipelines and standalone inference, and transfer robustly to unseen LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a dynamic collaboration framework where an SLM learns a policy to decide when and how to request adaptive feedback from an LLM during multi-step reasoning, rather than using the LLM as a passive tool. It systematically examines how these strategies depend on SLM/LLM capabilities, efficiency, and privacy constraints. Results highlight a scaling effect (stronger SLMs become more self-reliant; stronger LLMs yield fewer, more informative interactions), with the learned policies outperforming static pipelines and standalone inference while transferring robustly to unseen LLMs.
Significance. If the results hold under the reported controls, this work is significant for practical hybrid LLM/SLM systems that balance performance with cost and privacy. The scaling trends and cross-LLM transfer experiments provide actionable insights beyond single-model or static routing approaches. Explicit controls for model sizes, ablations, and reproducible policy-training details strengthen the contribution.
minor comments (3)
- [§4] §4 (Evaluation setup): the description of the reward signal for policy learning should explicitly state whether it incorporates only task accuracy or also penalizes interaction count and latency; this detail is load-bearing for interpreting the efficiency claims.
- [Figure 3] Figure 3 (scaling trends): the x-axis labels for SLM/LLM sizes are not uniformly formatted across panels, making direct comparison of the self-reliance trend difficult.
- [Table 2] Table 2 (transfer results): report the number of unseen LLMs tested and the variance across runs; the current aggregate numbers leave open whether transfer holds for all model families.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, the recognition of its significance for hybrid SLM-LLM systems, and the recommendation for minor revision. We are pleased that the scaling behaviors, cross-LLM transfer, and advantages over static pipelines are highlighted as actionable insights.
Circularity Check
No significant circularity detected
full rationale
The manuscript describes an empirical policy-learning setup in which an SLM is trained to decide when and how to query an LLM for feedback during multi-step reasoning. All central claims rest on experimental comparisons to static baselines, standalone inference, and transfer tests across LLMs, with reported controls for model size, scaling trends, and ablations. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the headline result to its own inputs appear in the work. The evaluation metrics and training procedure are externally falsifiable on held-out tasks and models, rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An SLM can be trained via standard learning methods to produce a policy that improves joint performance when paired with an LLM
Reference graph
Works this paper leans on
-
[1]
Braccini, M., Filippo, A
URL https://proceedings.mlr.press/ v202/aher23a.html. Braccini, M., Filippo, A. D., Lombardi, M., and Milano, M. Swarm intelligence: A novel and unconventional approach to dance choreography creation. InProceed- ings of the 3rd Workshop on Artificial Intelligence and Creativity co-located with 27th European Conference on Artificial Intelligence (ECAI 2024...
2024
-
[5]
URL https: //doi.org/10.1145/3387514.3405874
doi: 10.1145/3387514.3405874. URL https: //doi.org/10.1145/3387514.3405874. Mallen, A., Asai, A., Zhong, V ., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.CoRR, abs/2212.10511,
-
[7]
Hybridflow: A flexible and efficient RLHF framework
URL https://aclanthology.org/2022. aacl-main.18. Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pp. 1279–1297. ACM, ...
-
[11]
doi: 10.18653/V1/2023.FINDINGS-EMNLP
-
[13]
doi: 10.48550/ARXIV .2311.08152. URL https: //doi.org/10.48550/arXiv.2311.08152. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K.,...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[14]
The state and fate of linguistic diversity and inclusion in the NLP world
URL https://doi.org/10.18653/v1/ 2023.emnlp-main.936. Yue, M., Zhao, J., Zhang, M., Du, L., and Yao, Z. Large language model cascades with mixture of thought rep- resentations for cost-efficient reasoning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://ope...
-
[15]
URL https: //doi.org/10.1109/LRA.2023.3327672
doi: 10.1109/LRA.2023.3327672. URL https: //doi.org/10.1109/LRA.2023.3327672. Zhuang, R., Wu, T., Wen, Z., Li, A., Jiao, J., and Ramchan- dran, K. Embedllm: Learning compact representations of large language models. InThe Thirteenth Interna- tional Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,
-
[16]
" LLM Cascade (Yue et al., 2024) % %
URL https://openreview.net/forum? id=Fs9EabmQrJ. 13 Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models A. Comparison with Existing Work Table 5.Comparative overview of existing work and our dynamic collaboration framework. Domian Method Dynamic FrameworkPrivacy PreservationInteraction EfficiencyTask Performance AutoMix (A...
2024
-
[17]
Request for question context
-
[18]
In the first case, the agent request context, you should perform reasoning again within <think> and </think>, and provide the necessary context inside<message>and</message>
Provide useful information. In the first case, the agent request context, you should perform reasoning again within <think> and </think>, and provide the necessary context inside<message>and</message>. In the second case, the agent provide information, you should perform reasoning again within <think> and </think>, and use this information to inform the n...
-
[19]
- Guidelines for reasoning: you should be organized and direct and avoid repetition
Step-by-step reasoning: Before taking any action, always include a step-by-step reasoning section explaining how you interpreted the query and how you arrived at your decision. - Guidelines for reasoning: you should be organized and direct and avoid repetition
-
[20]
- If you must request context, state exactly what additional context you need and why using<need>and</need>tags
Decide action: - Request for context: do this only when the query truly cannot be answered correctly without parts of the original context. - If you must request context, state exactly what additional context you need and why using<need>and</need>tags. - Do not ask for factual information. - Most important: Do not repeat the request. - Provide helpful inf...
-
[21]
the user’s
**Generate Personal Information:** Identify the core entities from the original question and the answer. Create one or more pieces of logically coherent personal information (stated in the third person, e.g., “the user’s...”) that are associated with these core entities
-
[22]
**Reformulate the Core Fact:** Extract the core factual statement from the original QA pair. Rewrite this fact so that the answer term is referred to *indirectly* through the newly generated personal information, while the core entities from the question are kept explicit and also linked to the personal information
-
[23]
China”] - **Step 1: Generate Personal Information** - Core entities: “Beijing
**Synthesize the New Question:** Create a new question based on the personal information that is generated for the answer in Step 1. Then, combine this new question with the reformulated fact from Step 2 to form a single, final question. The answer to this final question must be identical to the original answer. **Example:** - **Input:** - Question: Which...
-
[24]
A request that involves the leakage of private information should be associated with the user or user-related entities with clear evidence
-
[25]
A request that simply poses a question without containing the specified privacy information does not constitute a privacy leak
-
[26]
reason": your reasoning here,
A request that seeks an objective fact does not constitute a privacy leak. Here is the input: requests:{requests} privacy information:{privacy-info} Think step by step and output your reasoning. Finally give your judgment with the following format: ‘‘‘json { "reason": your reasoning here, "leakage": true or false } ‘‘‘ Figure 17.Prompts of LLM judge for p...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.