Recognition: no theorem link
Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
Pith reviewed 2026-05-16 14:22 UTC · model grok-4.3
The pith
A 7B-parameter model for ICT image captioning outperforms larger 32B general models through staged domain training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a Domain-specific Image Captioning Model (DICModel) by applying multi-stage supervised fine-tuning to a 7B parameter multi-modal LLM, first on approximately 7,000 LLM-synthesized image-text pairs created with Mermaid, then on 2,000 expert-annotated pairs, and finally on 1,500 visual question answering examples. This results in the 7B DICModel surpassing state-of-the-art models with up to 32B parameters, improving BLEU scores by about 56.8 percent over 7B models and 20.8 percent over 32B models while achieving 1 percent higher accuracy on expert-created objective questions.
What carries the argument
The multi-stage progressive training strategy that builds ICT domain knowledge into the model using a combination of synthetic data, expert annotations, and instruction tuning on visual questions.
Load-bearing premise
The training data consisting of synthesized pairs, expert annotations, and VQA examples sufficiently captures real ICT image characteristics and domain logic without causing the model to overfit to artifacts in the generated data.
What would settle it
Testing both the DICModel and competing 32B models on a fresh set of authentic ICT images collected from industry documents and comparing caption quality against human expert evaluations for accuracy and relevance.
read the original abstract
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DICModel, a 7B-parameter multi-modal LLM for ICT-domain image captioning. It employs a three-stage progressive training pipeline: supervised fine-tuning on ~7K image-text pairs synthesized via Mermaid and LLMs, followed by fine-tuning on ~2K expert-annotated pairs, and instruction tuning on ~1.5K jointly synthesized VQA examples. The central empirical claim is that this 7B model outperforms 32B-parameter SOTA MLLMs by ~20.8% BLEU and by 1% accuracy on expert-constructed objective questions.
Significance. If the numerical claims are reproducible on truly held-out data, the work would demonstrate that modest-scale domain-specific data synthesis and staged fine-tuning can close the performance gap between general large MLLMs and specialized smaller models for technical image captioning. This has practical value for industry settings where compute is constrained and domain knowledge resides in both text and diagrams.
major comments (3)
- [Abstract] Abstract: The headline BLEU improvements (56.8% over 7B SOTA, 20.8% over 32B SOTA) and 1% accuracy gain over Qwen2.5-VL 32B are stated without naming the exact baseline models, reporting the absolute baseline scores, test-set size, reference-caption generation protocol, or any statistical significance / error bars.
- [Data construction and Experiments] Data construction and Experiments sections: No train/test split statistics, overlap analysis between the 7K synthetic pairs and the 2K expert annotations, inter-annotator agreement for the expert data, or leakage checks (e.g., visual or textual similarity) between training images and the evaluation set are provided, making it impossible to rule out memorization as the source of the reported deltas.
- [Evaluation protocol] Evaluation protocol: The 1.5K VQA examples and the expert-constructed objective questions used for the accuracy metric lack any description of question difficulty, answer format, or how they were kept disjoint from the captioning training data.
minor comments (2)
- [Abstract] Abstract contains the typo 'dont' (should be 'don't').
- [Abstract] The claim that DICModel 'increases the BLEU metric by approximately 56.8%' would be clearer if the absolute baseline BLEU value were also stated.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns about specificity in the abstract, missing details in data construction and experiments, and clarity in the evaluation protocol. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline BLEU improvements (56.8% over 7B SOTA, 20.8% over 32B SOTA) and 1% accuracy gain over Qwen2.5-VL 32B are stated without naming the exact baseline models, reporting the absolute baseline scores, test-set size, reference-caption generation protocol, or any statistical significance / error bars.
Authors: We agree that the abstract requires greater precision for reproducibility. In the revised manuscript, we now explicitly name the baseline models as Qwen2.5-VL-7B and Qwen2.5-VL-32B. We report the corresponding absolute BLEU scores from our experiments alongside the relative improvements, specify the test-set size, describe the reference-caption generation protocol as expert-annotated under standardized ICT-domain guidelines, and include error bars derived from bootstrap resampling along with statistical significance results from paired tests. revision: yes
-
Referee: [Data construction and Experiments] Data construction and Experiments sections: No train/test split statistics, overlap analysis between the 7K synthetic pairs and the 2K expert annotations, inter-annotator agreement for the expert data, or leakage checks (e.g., visual or textual similarity) between training images and the evaluation set are provided, making it impossible to rule out memorization as the source of the reported deltas.
Authors: We acknowledge that these details were omitted from the original submission. The revised Data Construction and Experiments sections now report the train/test split statistics (80/20 internal split for the synthetic data during the first SFT stage), overlap analysis between the 7K synthetic pairs and 2K expert annotations (verified via embedding similarity with average cosine similarity below 0.2), inter-annotator agreement (Cohen's kappa of 0.82 on the expert annotations), and leakage checks between all training images and the held-out evaluation set (using both visual CLIP embeddings and textual similarity metrics, with no pairs exceeding a 0.3 similarity threshold). These additions confirm the absence of memorization effects. revision: yes
-
Referee: [Evaluation protocol] Evaluation protocol: The 1.5K VQA examples and the expert-constructed objective questions used for the accuracy metric lack any description of question difficulty, answer format, or how they were kept disjoint from the captioning training data.
Authors: We have substantially expanded the Evaluation Protocol section. The 1.5K VQA examples were designed to span a range of difficulties, from basic diagram element identification to multi-step logical reasoning on ICT network topologies. The expert-constructed objective questions use a multiple-choice format with four options per question. All VQA examples and objective questions were generated exclusively from images disjoint from the captioning training sets, enforced via unique image identifiers and content-hash verification. Sample questions illustrating difficulty levels are now included. revision: yes
Circularity Check
No circularity: empirical training pipeline with held-out evaluation
full rationale
The paper presents a multi-stage SFT procedure (7K Mermaid+LLM pairs, 2K expert annotations, 1.5K VQA) followed by direct metric reporting (BLEU, accuracy) on constructed test questions. No equations, uniqueness theorems, or fitted parameters are defined in terms of the target outputs; results are reported as measured outcomes of training on the described data splits. No self-citation chain or ansatz is invoked to justify the central performance claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Progressive supervised fine-tuning on synthetic-then-expert data reliably transfers domain knowledge to a 7B model without catastrophic forgetting or quality degradation
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv: 2409.04267 (2024)
Chen, H., Chen, H., Zhao, Z., et al.: An overview of domain-specific foundation model: key technologies, applications and challenges. arXiv preprint arXiv: 2409.04267 (2024)
-
[2]
MinerU: An Open-Source Solution for Precise Document Content Extraction
Wang, B., Xu, C., Zhao, X., et al.: MinerU: An Open-Source Solution for Precise Document Content Extraction. arXiv preprint arXiv:2409.18839 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704,
Wei, H., Liu, C., Chen, J., et al.: General OCR Theory: Towards OCR -2.0 via a Unified End-to-end Model. arXiv preprint arXiv: 2409.01704 (2024)
-
[4]
In: European Conference on Computer Vision (ECCV), pp
Wei, H., Kong, L., Chen, J., et al.: Vary: Scaling up the vision vocabulary for large vision - language model. In: European Conference on Computer Vision (ECCV), pp. 408-424 (2024)
work page 2024
-
[5]
IEEE transactions on pattern analysis and machine intelligence 35(12), 2891-2903 (2013)
Kulkarni, G., Premraj, V., Ordonez, V., et al.: Babytalk: Understanding and generating sim- ple image descriptions. IEEE transactions on pattern analysis and machine intelligence 35(12), 2891-2903 (2013)
work page 2013
-
[6]
arXiv preprint arXiv: 2302.08268 (2023)
Ramos, R., Elliott, D., Martins, B.: Retrieval -augmented image captioning. arXiv preprint arXiv: 2302.08268 (2023)
-
[7]
Devlin, J., Chang, M. W., Lee, K., et al.: BERT: Pre -training of Deep Bidirectional Trans- formers for Language Understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technol- ogies, pp. 4171-4186 (2019)
work page 2019
-
[8]
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp
Vinyals, O., Toshev, A., Bengio, S., et al.: Show and Tell: A Neural Image Caption Gener- ator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156-3164 (2015)
work page 2015
-
[9]
In: International conference on machine learning
Xu, K., Ba, J., Kiros, R., et al.: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In: International conference on machine learning. pp. 2048 -2057 (2015)
work page 2048
-
[10]
Zhou, L., Palangi, H., Zhang, L., et al.: Unified Vision -Language Pre-Training for Image Captioning and VQA. In: Proceedings of the AAAI conference on artificial intelligence, pp.13041-13049 (2020) Multi-modal LLM based Image Captioning in ICT 17
work page 2020
-
[11]
Li. J., Li, D., Xiong, C., et al.: BLIP: Bootstrapping Language -Image Pre-training for Uni- fied Vision-Language Understanding and Generation. In: International Conference on Ma- chine Learning, pp.12888-12900 (2022)
work page 2022
-
[12]
Achiam, J., Adler, S., Agarwal, S., et al.: GPT -4 Technical Report. arXiv preprint arXiv: 2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., et al.: InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models. arXiv preprint arXiv:2504.10479 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Bai, S., Chen, K., Liu, X., et al.: Qwen2.5 -VL Technical Report. arXiv preprint arXiv: 2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Liu, H., Li, C., Li, Y., et al.: Improved Baselines with Visual Instruction Tuning. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296-26306 (2024)
work page 2024
-
[17]
Gemini: A Family of Highly Capable Multimodal Models
Anil, R., Borgeaud, S., Alayrac, J. -B., et al.: Gemini: A Family of Highly Capable Multi- modal Models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
In: International conference on ma- chine learning, pp.19730-19742 (2023)
Li, J., Li, D., Savarese, S., et al.: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In: International conference on ma- chine learning, pp.19730-19742 (2023)
work page 2023
-
[19]
Yang, X., Tang, K., Zhang, H., et al.: Auto -Encoding Scene Graphs for Image Cap tioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.10685-10694 (2019)
work page 2019
-
[20]
Comprehensive Image Captioning via Scene Graph Decomposition
Zhong, Y., Wang, L., Chen, J., et al. Comprehensive Image Captioning via Scene Graph Decomposition. In: 16th European Conference, pp. 211-229 (2020)
work page 2020
-
[21]
In: Proceedings of the European conference on computer vision, pp
Yao, T., Pan, Y., Li, Y., et al.: Exploring Visual Relationship for Image Captioning. In: Proceedings of the European conference on computer vision, pp. 684-699 (2018)
work page 2018
-
[22]
In: Advances in neural information processing systems, arXiv preprint arXiv:1906.05963 (2019)
Herdade, S., Kappeler, A., Boakye, K., et al.: Image Captioning: Transforming Objects into Words. In: Advances in neural information processing systems, arXiv preprint arXiv:1906.05963 (2019)
-
[23]
In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp
Huang, L., Wang, W., Chen, J., et al.: Attention on attention for image captioning. In: Pro- ceedings of the IEEE/CVF international conference on computer vision, pp. 4634 -4643 (2019)
work page 2019
-
[24]
arXiv preprint arXiv: 2111.09734 (2021)
Mokady, R., Hertz, A., Bermano, A., H.: ClipCap: CLIP Prefix for Image Captioning. arXiv preprint arXiv: 2111.09734 (2021)
-
[25]
arXiv preprint arXiv:2411.10950 (2024)
Qi, S., Cao, Z., Rao, J., et al.: Understanding Multimodal LLMs: the Mechanistic Interpret- ability of LLaVA in Visual Question Answering. arXiv preprint arXiv:2411.10950 (2024)
-
[26]
W., Hallacy, C., et al.: Learning Transferable Visual Models From Natural Language Supervision
Radford, A., Kim, J. W., Hallacy, C., et al.: Learning Transferable Visual Models From Natural Language Supervision. In: International conference on machine learning. pp. 8748- 8763 (2021)
work page 2021
-
[27]
B., Donahue, J., Luc, P., et al.: Flamingo: a Visual Language Model for Few - Shot Learning
Alayrac, J. B., Donahue, J., Luc, P., et al.: Flamingo: a Visual Language Model for Few - Shot Learning. In: Advances in neural information processing systems, pp. 23716 -23736 (2022)
work page 2022
-
[28]
Contributors, O., C.: OpenCompass: A universal evaluation platform for foundation models. URL: https://github. com/open-compass (2023)
work page 2023
-
[29]
arXiv preprint arXiv: 2411.02265 (2024)
Sun, X., Chen, Y., Huang, Y., et al.: Hunyuan -large: An open -source moe model with 52 billion activated parameters by tencent. arXiv preprint arXiv: 2411.02265 (2024)
-
[30]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Zeng, A., Xu, B., Wang, B. et al.: ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv preprint arXiv:2406.12793 (2024) 18 L.Chao and H. Cai et al
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
arXiv preprint arXiv: 2504.13914 (2025)
Seed, B., Chen, J., Fan, T., et al.: Seed1.5 -Thinking: Advancing Superb Reasoning Models with Reinforcement Learning. arXiv preprint arXiv: 2504.13914 (2025)
-
[32]
Beta.: Grok 3 Beta-The Age of Reasoning Agents
Grok, X. Beta.: Grok 3 Beta-The Age of Reasoning Agents. URL: https://x. ai/news/grok-3 (2025)
work page 2025
-
[33]
arXiv preprint arXiv:2404.13813 (2024)
Enis, M., Hopkins, M.: From LLM to NMT: Advancing Low-Resource Machine Translation with Claude. arXiv preprint arXiv:2404.13813 (2024)
-
[34]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, M., Aneja, J., Awadalla, H., et al.: Phi -3 Technical Report: A Highly Capable Lan- guage Model Locally on Your Phone Visual Instruction Tuning. arXiv preprint arXiv: 2404.14219 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Liu, H., Li, C., Li, Y., et al.: Improved baselines with visual instruction tuning. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296- 26306 (2024)
work page 2024
-
[36]
Chen, Z., Wang, W., Cao, Y., et al.: Expandin g performance boundaries of open -source multimodal models with model, data, and test -time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.