arxiv: 2604.06205 · v1 · submitted 2026-03-15 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation

Shutong Zhang , Dylan Zhou , Yinxiao Liu , Yang Yang , Huiwen Luo , Wenfei Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords content moderationchain-of-thoughtsmall language modelstool augmentationmultimodal reasoningsafety classification

0 comments

The pith

A fine-tuned small language model learns to use external tools selectively for better content safety decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Tool-MCoT, which fine-tunes a small language model on chain-of-thought data that includes external tool calls generated by a larger LLM. This training lets the small model improve its reasoning over multimodal user content for safety classification. The model gains significant accuracy while learning to invoke tools only on difficult cases rather than every input. A reader would care because it points to a practical route for accurate moderation at lower compute cost and latency than running full-scale models on every post.

Core claim

Training a small language model on tool-augmented multimodal chain-of-thought data generated by a larger LLM enables the small model to effectively use those tools for improved reasoning in content safety moderation and to call the tools selectively, achieving a balance between accuracy and inference efficiency.

What carries the argument

Tool-augmented multimodal chain-of-thought, which embeds external tool calls into reasoning traces so the small model can follow guided steps for safety judgments on text, images, and other media.

If this is right

Small language models can reach higher moderation accuracy on complex multimodal content without matching the full compute of large models.
Selective tool calling keeps average inference cost low while preserving most of the accuracy improvement.
Reasoning strategies can transfer from large models to small ones through generated tool-use training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training pattern may transfer to other tasks that mix reasoning with external lookups, such as policy compliance checks.
Production moderation pipelines could route simple cases to the small model and only escalate complex ones to tool calls or larger systems.
One could measure whether the small model invents tool sequences the teacher model never demonstrated.

Load-bearing premise

The tool-augmented chain-of-thought examples produced by the larger LLM contain enough high-quality, unbiased patterns for the small model to learn correct and selective tool use.

What would settle it

A controlled test showing that the fine-tuned small model matches or underperforms its untuned base version on held-out multimodal safety cases or that it invokes tools on nearly every input instead of only when needed.

Figures

Figures reproduced from arXiv: 2604.06205 by Dylan Zhou, Huiwen Luo, Shutong Zhang, Wenfei Zou, Yang Yang, Yinxiao Liu.

**Figure 1.** Figure 1: An overview of our two-stage pipeline. We first generate tool reasoning data using a LLM as a teacher model, and then use the generated reasoning data to fine-tune the SLM. During inference, the fine-tuned SLM utilizes the tool framework to conduct the content safety moderation task. Agentic Tool Framework. The Agentic Tool Framework is a collection of specialized tools designed to augment the capabilities… view at source ↗

**Figure 2.** Figure 2: Multi-turn tool calling conversation. Model select necessary tools for harder samples. In the selective tool use setting, the model learns to determine when and which tools to call. It is trained to bypass tool use for simple samples. For more complex samples, the model is fine-tuned to call the OCR tool for images with overlaid text, use the object detection tool for images with complicated layouts (i.e.… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison between model outputs with and without tool reasoning. The tool reasoning model (bottom) correctly identifies the nuanced context in both cases, where the standard model fails. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

The growth of online platforms and user content requires strong content moderation systems that can handle complex inputs from various media types. While large language models (LLMs) are effective, their high computational cost and latency present significant challenges for scalable deployment. To address this, we introduce Tool-MCoT, a small language model (SLM) fine-tuned for content safety moderation leveraging external framework. By training our model on tool-augmented chain-of-thought data generated by LLM, we demonstrate that the SLM can learn to effectively utilize these tools to improve its reasoning and decision-making. Our experiments show that the fine-tuned SLM achieves significant performance gains. Furthermore, we show that the model can learn to use these tools selectively, achieving a balance between moderation accuracy and inference efficiency by calling tools only when necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tool-MCoT distills selective tool use from a large LLM into a small model for content moderation, but the training data circularity and missing metrics make the gains hard to verify.

read the letter

Tool-MCoT fine-tunes a small model on LLM-generated tool-augmented chain-of-thought data for multimodal content safety moderation, and it reports that the model learns to call tools selectively for better efficiency. That's the core contribution here. The new part is applying this to content moderation with multiple media types, where the small model gets to use tools only when the reasoning needs them. It does well at framing a practical problem: big models are too slow for scale, so distill the reasoning into something lighter that still has access to tools. The soft spots come from how the data is made. The training trajectories are created by a larger LLM, so the decisions about tool necessity are whatever that model thought was right. Without an independent way to check those decisions, like human labels on when a tool was actually required, the selectivity could just be copying the teacher's habits rather than discovering a good policy. This circularity is a real concern for the efficiency claims. The abstract talks about significant gains and selective use but doesn't include any numbers, baselines, or error analysis, which leaves the claims hard to assess from the summary alone. The math and setup look standard for this kind of work, no obvious errors in the approach described. The citation pattern follows the usual CoT and tool-use papers without missing key references. This is for engineers and researchers focused on deploying moderation systems on limited hardware. A reader who wants a concrete example of tool distillation for safety tasks could pick up some implementation ideas from it. I would send this to peer review. The idea is grounded enough and the application is relevant, so it merits a full review to sort out the evaluation details and data quality.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Tool-MCoT, a small language model (SLM) fine-tuned on tool-augmented multimodal chain-of-thought data generated by a larger LLM for content safety moderation. It claims that the resulting SLM achieves significant performance gains over unspecified baselines while learning to invoke external tools selectively, thereby balancing moderation accuracy against inference efficiency.

Significance. If the central claims survive scrutiny of the data-generation pipeline, the work would offer a practical route to deploy capable moderation systems on resource-limited hardware. The selective-tool-use result, if shown to exceed simple imitation of the teacher policy, would be a modest but useful contribution to efficient tool-augmented reasoning for SLMs.

major comments (3)

[Abstract] Abstract: the assertion of 'significant performance gains' and 'selective tool use' is unsupported by any reported metrics, baselines, dataset statistics, or error analysis. Without these numbers the central empirical claim cannot be evaluated.
[Section 3] Section 3 (Training Data Generation): the supervision consists entirely of LLM-generated tool-augmented CoT trajectories. No independent human annotation or oracle is described for labeling tool necessity. This circularity directly undermines the claim that the SLM has learned genuine selective invocation rather than reproducing the teacher's policy.
[Section 4] Section 4 (Experiments): the manuscript provides no ablation on tool-necessity labeling quality, no comparison against a non-tool baseline or a randomly tool-calling control, and no analysis of cases where the teacher over- or under-calls tools. These omissions leave the efficiency-accuracy trade-off claim unsupported.

minor comments (2)

[Abstract] The abstract would be strengthened by the inclusion of at least one key quantitative result (e.g., F1 or accuracy delta) and the primary baseline.
[Section 2] Notation for tool invocation format and multimodal input encoding should be introduced with a concrete example in the first figure or table.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript lacks the quantitative support, baselines, and analyses needed to substantiate the central claims. We will undertake a major revision to add the requested metrics, ablations, comparisons, and discussions. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'significant performance gains' and 'selective tool use' is unsupported by any reported metrics, baselines, dataset statistics, or error analysis. Without these numbers the central empirical claim cannot be evaluated.

Authors: We agree that the abstract currently states results without supporting numbers. In the revision we will rewrite the abstract to report concrete metrics (accuracy deltas versus baselines, tool invocation rate as evidence of selectivity, dataset size and composition) and will reference the new error analysis and efficiency measurements that will be added to Section 4. revision: yes
Referee: [Section 3] Section 3 (Training Data Generation): the supervision consists entirely of LLM-generated tool-augmented CoT trajectories. No independent human annotation or oracle is described for labeling tool necessity. This circularity directly undermines the claim that the SLM has learned genuine selective invocation rather than reproducing the teacher's policy.

Authors: The concern is valid: because labels come solely from the teacher LLM, the SLM could simply be imitating the teacher policy. We will expand Section 3 with a full description of the generation pipeline, including any filtering or quality checks applied to the trajectories. We will also add a new subsection in Section 4 that directly compares the SLM's tool-calling decisions against the teacher's on a held-out set, quantifying agreement and disagreement to show where the student deviates from or improves upon the teacher. We acknowledge that full human annotation of tool necessity is absent and will explicitly discuss this limitation and its implications for the selectivity claim. revision: partial
Referee: [Section 4] Section 4 (Experiments): the manuscript provides no ablation on tool-necessity labeling quality, no comparison against a non-tool baseline or a randomly tool-calling control, and no analysis of cases where the teacher over- or under-calls tools. These omissions leave the efficiency-accuracy trade-off claim unsupported.

Authors: We accept this criticism. The experiments section will be substantially expanded to include: (i) a non-tool fine-tuned SLM baseline, (ii) a random tool-calling control, (iii) an ablation varying the quality of tool-necessity labels (e.g., by using different teacher prompts or models), and (iv) a dedicated analysis of teacher over- and under-calling cases together with how the SLM behaves on those instances. All results will be accompanied by explicit accuracy numbers, tool-call frequency, and latency measurements to ground the efficiency-accuracy trade-off claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical fine-tuning procedure in which an SLM is trained on tool-augmented CoT trajectories produced by an external LLM. Reported gains in moderation accuracy and selective tool calling are presented as experimental outcomes measured on held-out data, not as quantities derived by construction from the training inputs themselves. No equations, self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology. The setup is a standard distillation-style experiment whose success remains falsifiable by independent test sets and does not reduce to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The central claim implicitly rests on the unstated assumption that LLM-generated tool-augmented reasoning traces transfer effectively to smaller models.

pith-pipeline@v0.9.0 · 5448 in / 1215 out tokens · 32667 ms · 2026-05-15T10:55:12.526990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage pipeline... LLM as teacher... fine-tune SLM... selective tool use on harder samples
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jcost not referenced; no recognition cost or golden-ratio identities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 1 internal anchor

[1]

Vlm as policy: Common-law content moderation framework for short video platform, 2025

Xingyu Lu, Tianke Zhang, Chang Meng, Xiaobei Wang, Jinpeng Wang, YiFan Zhang, Shisong Tang, Changyi Liu, Haojie Ding, Kaiyu Jiang, Kaiyu Tang, Bin Wen, Hai-Tao Zheng, Fan Yang, Tingting Gao, Di Zhang, and Kun Gai. Vlm as policy: Common-law content moderation framework for short video platform, 2025

work page 2025
[2]

Unsafebench: Benchmarking image safety classifiers on real-world and ai-generated images, 2024

Yiting Qu, Xinyue Shen, Yixin Wu, Michael Backes, Savvas Zannettou, and Yang Zhang. Unsafebench: Benchmarking image safety classifiers on real-world and ai-generated images, 2024

work page 2024
[3]

Exploring hate speech detection in multimodal publications, 2019

Raul Gomez, Jaume Gibert, Lluis Gomez, and Dimosthenis Karatzas. Exploring hate speech detection in multimodal publications, 2019

work page 2019
[4]

Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality

Jialin Yuan, Ye Yu, Gaurav Mittal, Matthew Hall, Sandra Sajeev, and Mei Chen. Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality . In2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024

work page 2024
[5]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

work page 2023
[6]

A survey on bridging vlms and synthetic data.Authorea Preprints, 2025

Mohammad Ghiasvand Mohammadkhani, Saeedeh Momtazi, and Hamid Beigy. A survey on bridging vlms and synthetic data.Authorea Preprints, 2025

work page 2025
[7]

Vision-language models for vision tasks: A survey, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey, 2024

work page 2024
[8]

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. Octo- tools: An agentic framework with extensible tools for complex reasoning.arXiv preprint arXiv:2502.11271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Vipergpt: Visual inference via python execution for reasoning.Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023

work page 2023
[10]

Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023

work page 2023
[11]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[12]

Small llms are weak tool learners: A multi-llm agent, 2024

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent, 2024

work page 2024
[13]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023
[14]

Small models struggle to learn from strong reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ra- masubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. arXiv preprint arXiv:2502.12143, 2025

work page arXiv 2025
[15]

Gemini: A family of highly capable multimodal models, 2025

Gemini Team. Gemini: A family of highly capable multimodal models, 2025

work page 2025
[16]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning, 2022

work page 2022
[17]

Gemma 3 technical report, 2025

Gemma Team. Gemma 3 technical report, 2025

work page 2025
[18]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

work page 2021
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. 6

work page 2024
[20]

Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022

work page 2022
[21]

Prompt stealing attacks against text-to-image generation models, 2024

Xinyue Shen, Yiting Qu, Michael Backes, and Yang Zhang. Prompt stealing attacks against text-to-image generation models, 2024

work page 2024
[22]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with "gradient descent" and beam search, 2023

work page 2023
[23]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. A Training Setup A.1 Hardware Selection All experiments were conducted on a cluster of eight NVIDIA H100 (80GB) GPUs. Specifically, LoRA fine-tuning was performed using Distributed Data Parallel (DDP), while the GRPO training utilized DeepSpeed ZeRO Stage 3 to manage the increa...

work page 2019