arxiv: 2604.17827 · v1 · submitted 2026-04-20 · 💻 cs.CL

Recognition: unknown

Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models

Hang Zeng , Xiangyu Liu , Yong Hu , Chaoyue Niu , Jiarui Zhang , Shaojie Tang , Fan Wu , Guihai Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords dynamic collaborationsmall language modelslarge language modelshelp-seeking policymulti-step reasoningadaptive feedbackmodel scalingtransferability

0 comments

The pith

Small language models learn a policy to request adaptive help from large ones during multi-step reasoning and outperform static approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an SLM can be trained to decide proactively when and how to ask an LLM for help in multi-step tasks, with the LLM responding adaptively instead of as a fixed tool. This dynamic setup is evaluated under different model strengths and constraints such as efficiency and privacy. Results indicate that stronger SLMs ask for help less often while stronger LLMs give more targeted responses. The learned strategies beat both standalone models and fixed collaboration pipelines, and they continue to work when applied to LLMs not encountered in training.

Core claim

An SLM learns to proactively decide how to request an LLM during multi-step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. Collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints, with a scaling effect in which stronger SLMs become more self-reliant and stronger LLMs enable fewer and more informative interactions. The learned dynamic strategies significantly outperform static pipelines and standalone inference and transfer robustly to unseen LLMs.

What carries the argument

A learned policy inside the SLM that decides when and how to request adaptive feedback from the LLM inside a dynamic collaboration framework.

If this is right

Stronger SLMs become more self-reliant and request LLM help less frequently.
Stronger LLMs produce fewer but more informative interactions.
Dynamic strategies outperform both static pipelines and standalone inference.
Learned strategies transfer effectively to LLMs not seen during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid systems could reduce overall inference cost by limiting expensive LLM calls to only necessary moments.
Most computation could stay local on small models, supporting privacy-sensitive applications.
Similar learned help-seeking policies might generalize to other pairs of resource-light and resource-heavy models beyond language tasks.

Load-bearing premise

An SLM can be trained to reliably learn an effective policy for when and how to request help such that overall system performance improves without hidden costs in training or inference overhead.

What would settle it

A controlled test in which the trained dynamic policy is run on new multi-step reasoning tasks or previously unseen LLMs and fails to exceed the accuracy or efficiency of either standalone models or fixed static pipelines.

Figures

Figures reproduced from arXiv: 2604.17827 by Chaoyue Niu, Fan Wu, Guihai Chen, Hang Zeng, Jiarui Zhang, Shaojie Tang, Xiangyu Liu, Yong Hu.

**Figure 1.** Figure 1: Static interaction framework in existing work vs. Our dynamic collaboration framework between SLM and LLM. A natural solution is to integrate SLMs with LLMs (Yue et al., 2024; Zhang et al., 2024a; Aggarwal et al., 2024). Existing approaches typically treat the SLM as a preprocessor that partially handles user queries and invokes the LLM under a predefined interaction pattern, after which the SLM integrate… view at source ↗

**Figure 2.** Figure 2: Illustration of collaboration workflow between on-device SLM and cloud-based LLM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: Training reward and quality reward of Qwen3-0.6B and Qwen3-4B collaborating with Qwen3-235B-A22B-Instruct. This limitation primarily stems from its weak instructionfollowing ability: during RL training, it seldom explores correctly formatted LLM requests, and the few generated requests are often malformed or semantically incoherent, failing to elicit useful responses and positive rewards. Consequently, S… view at source ↗

**Figure 7.** Figure 7: As the penalty weight increases, the SLM relies more on local reasoning and requests the LLM less frequently. For example, relative to no efficiency penalty, 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Efficiency Penalty Weight 25 30 35 40 45 50 55 60 65 Avg EM 49.08 48.47 47.58 45.36 Avg EM EM Range (Min-Max) Avg Turn Avg Turn Range (Min-Max) 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Avg Turn 1.01 0.87 0.72 0.59 (a) G… view at source ↗

**Figure 8.** Figure 8: Privsample and EM with different privacy penalty weight. setting the weight to 0.8 reduces the average interaction turns between Qwen3-4B and Qwen3-235B-A22B-Instruct by 0.43, at the cost of a 3.3% drop in EM. Overall, the number of LLM queries is positively correlated with final performance, revealing a clear trade-off between efficiency and answer quality. This trade-off can be controlled by adjusting t… view at source ↗

**Figure 9.** Figure 9: Correct and wrong cases of format of SLM trajectories [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Privacy injection pipeline for dataset processing. new statement like “Kevin’s family go to a place in user’s favorite movie, Home Alone 2, which is also where the user’s cousin lives.” Finally, we synthesize a new question based on the personal information—here, “Where does the user’s cousin live?”, and combine it with the reformulated statement to create a privacy-infused QA pair: “Kevin’s family go to … view at source ↗

**Figure 11.** Figure 11: Case study of collaboration strategy of Qwen3-8B with Qwen3-235B-A22B-Instruct. User LLM SLM The director of Runaway (2005), whom the user admires, graduated from the same college that the user's sibling attended, which is also where the user watched Runaway multiple times. Where did the user's sibling attend college? Okay. The question is about ... So the answer should be the college that the director at… view at source ↗

**Figure 12.** Figure 12: A failure case of Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: System prompt of proactive SLM during RL training. System Prompts of Feedback-Providing LLM. You are a knowledgeable agent collaborating with a user who knows the original question but may lack expertise. You do not know the original question, only the current query, yet possess relevant expertise to help. Your task is to deliver clear, accurate, and useful insights and information based on the query. Whe… view at source ↗

**Figure 14.** Figure 14: System Prompts of Feedback-Providing LLM. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Prompts of privacy injection pipeline, including three steps. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Prompts of privacy-injected data check. Prompts of LLM Judge for Privacy Leakage. Your task is to determine whether the following requests, by themselves, include any of the specified privacy information. Note that: 1. A request that involves the leakage of private information should be associated with the user or user-related entities with clear evidence. 2. A request that simply poses a question without… view at source ↗

**Figure 17.** Figure 17: Prompts of LLM judge for privacy leakage, given a SLM’s request and a set of privacy information. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

read the original abstract

Large language models (LLMs) offer strong capabilities but raise cost and privacy concerns, whereas small language models (SLMs) facilitate efficient and private local inference yet suffer from limited capacity. To synergize the complementary strengths, we introduce a dynamic collaboration framework, where an SLM learns to proactively decide how to request an LLM during multi-step reasoning, while the LLM provides adaptive feedback instead of acting as a passive tool. We further systematically investigate how collaboration strategies are shaped by SLM and LLM capabilities as well as efficiency and privacy constraints. Evaluation results reveal a distinct scaling effect: stronger SLMs become more self-reliant, while stronger LLMs enable fewer and more informative interactions. In addition, the learned dynamic collaboration strategies significantly outperform static pipelines and standalone inference, and transfer robustly to unseen LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces a dynamic collaboration framework where an SLM learns a policy to decide when and how to request adaptive feedback from an LLM during multi-step reasoning, rather than using the LLM as a passive tool. It systematically examines how these strategies depend on SLM/LLM capabilities, efficiency, and privacy constraints. Results highlight a scaling effect (stronger SLMs become more self-reliant; stronger LLMs yield fewer, more informative interactions), with the learned policies outperforming static pipelines and standalone inference while transferring robustly to unseen LLMs.

Significance. If the results hold under the reported controls, this work is significant for practical hybrid LLM/SLM systems that balance performance with cost and privacy. The scaling trends and cross-LLM transfer experiments provide actionable insights beyond single-model or static routing approaches. Explicit controls for model sizes, ablations, and reproducible policy-training details strengthen the contribution.

minor comments (3)

[§4] §4 (Evaluation setup): the description of the reward signal for policy learning should explicitly state whether it incorporates only task accuracy or also penalizes interaction count and latency; this detail is load-bearing for interpreting the efficiency claims.
[Figure 3] Figure 3 (scaling trends): the x-axis labels for SLM/LLM sizes are not uniformly formatted across panels, making direct comparison of the self-reliance trend difficult.
[Table 2] Table 2 (transfer results): report the number of unseen LLMs tested and the variance across runs; the current aggregate numbers leave open whether transfer holds for all model families.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the recognition of its significance for hybrid SLM-LLM systems, and the recommendation for minor revision. We are pleased that the scaling behaviors, cross-LLM transfer, and advantages over static pipelines are highlighted as actionable insights.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes an empirical policy-learning setup in which an SLM is trained to decide when and how to query an LLM for feedback during multi-step reasoning. All central claims rest on experimental comparisons to static baselines, standalone inference, and transfer tests across LLMs, with reported controls for model size, scaling trends, and ablations. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains that reduce the headline result to its own inputs appear in the work. The evaluation metrics and training procedure are externally falsifiable on held-out tasks and models, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes standard supervised or reinforcement learning can produce a reliable help-seeking policy.

axioms (1)

domain assumption An SLM can be trained via standard learning methods to produce a policy that improves joint performance when paired with an LLM
Central to the claim that the learned strategy outperforms baselines

pith-pipeline@v0.9.0 · 5450 in / 1153 out tokens · 46265 ms · 2026-05-10T04:28:17.198250+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Braccini, M., Filippo, A

URL https://proceedings.mlr.press/ v202/aher23a.html. Braccini, M., Filippo, A. D., Lombardi, M., and Milano, M. Swarm intelligence: A novel and unconventional approach to dance choreography creation. InProceed- ings of the 3rd Workshop on Artificial Intelligence and Creativity co-located with 27th European Conference on Artificial Intelligence (ECAI 2024...

2024
[5]

URL https: //doi.org/10.1145/3387514.3405874

doi: 10.1145/3387514.3405874. URL https: //doi.org/10.1145/3387514.3405874. Mallen, A., Asai, A., Zhong, V ., Das, R., Hajishirzi, H., and Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories.CoRR, abs/2212.10511,

work page doi:10.1145/3387514.3405874
[7]

Hybridflow: A flexible and efficient RLHF framework

URL https://aclanthology.org/2022. aacl-main.18. Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pp. 1279–1297. ACM, ...

work page doi:10.1145/3689031.3696075 2022
[11]

doi: 10.18653/V1/2023.FINDINGS-EMNLP

work page doi:10.18653/v1/2023.findings-emnlp 2023
[13]

Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhe-Feng Wang, Baoxing Huai, and Min Zhang

doi: 10.48550/ARXIV .2311.08152. URL https: //doi.org/10.48550/arXiv.2311.08152. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K.,...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[14]

The state and fate of linguistic diversity and inclusion in the NLP world

URL https://doi.org/10.18653/v1/ 2023.emnlp-main.936. Yue, M., Zhao, J., Zhang, M., Du, L., and Yao, Z. Large language model cascades with mixture of thought rep- resentations for cost-efficient reasoning. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://ope...

work page doi:10.18653/v1/ 2023
[15]

URL https: //doi.org/10.1109/LRA.2023.3327672

doi: 10.1109/LRA.2023.3327672. URL https: //doi.org/10.1109/LRA.2023.3327672. Zhuang, R., Wu, T., Wen, Z., Li, A., Jiao, J., and Ramchan- dran, K. Embedllm: Learning compact representations of large language models. InThe Thirteenth Interna- tional Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,

work page doi:10.1109/lra.2023.3327672 2023
[16]

" LLM Cascade (Yue et al., 2024) % %

URL https://openreview.net/forum? id=Fs9EabmQrJ. 13 Learning to Seek Help: Dynamic Collaboration Between Small and Large Language Models A. Comparison with Existing Work Table 5.Comparative overview of existing work and our dynamic collaboration framework. Domian Method Dynamic FrameworkPrivacy PreservationInteraction EfficiencyTask Performance AutoMix (A...

2024
[17]

Request for question context
[18]

In the first case, the agent request context, you should perform reasoning again within <think> and </think>, and provide the necessary context inside<message>and</message>

Provide useful information. In the first case, the agent request context, you should perform reasoning again within <think> and </think>, and provide the necessary context inside<message>and</message>. In the second case, the agent provide information, you should perform reasoning again within <think> and </think>, and use this information to inform the n...
[19]

- Guidelines for reasoning: you should be organized and direct and avoid repetition

Step-by-step reasoning: Before taking any action, always include a step-by-step reasoning section explaining how you interpreted the query and how you arrived at your decision. - Guidelines for reasoning: you should be organized and direct and avoid repetition
[20]

- If you must request context, state exactly what additional context you need and why using<need>and</need>tags

Decide action: - Request for context: do this only when the query truly cannot be answered correctly without parts of the original context. - If you must request context, state exactly what additional context you need and why using<need>and</need>tags. - Do not ask for factual information. - Most important: Do not repeat the request. - Provide helpful inf...
[21]

the user’s

**Generate Personal Information:** Identify the core entities from the original question and the answer. Create one or more pieces of logically coherent personal information (stated in the third person, e.g., “the user’s...”) that are associated with these core entities
[22]

**Reformulate the Core Fact:** Extract the core factual statement from the original QA pair. Rewrite this fact so that the answer term is referred to *indirectly* through the newly generated personal information, while the core entities from the question are kept explicit and also linked to the personal information
[23]

China”] - **Step 1: Generate Personal Information** - Core entities: “Beijing

**Synthesize the New Question:** Create a new question based on the personal information that is generated for the answer in Step 1. Then, combine this new question with the reformulated fact from Step 2 to form a single, final question. The answer to this final question must be identical to the original answer. **Example:** - **Input:** - Question: Which...
[24]

A request that involves the leakage of private information should be associated with the user or user-related entities with clear evidence
[25]

A request that simply poses a question without containing the specified privacy information does not constitute a privacy leak
[26]

reason": your reasoning here,

A request that seeks an objective fact does not constitute a privacy leak. Here is the input: requests:{requests} privacy information:{privacy-info} Think step by step and output your reasoning. Finally give your judgment with the following format: ‘‘‘json { "reason": your reasoning here, "leakage": true or false } ‘‘‘ Figure 17.Prompts of LLM judge for p...