pith. sign in

arxiv: 2509.26158 · v2 · submitted 2025-09-30 · 💻 cs.CV · cs.AI

Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis

Pith reviewed 2026-05-18 12:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords edge-case synthesistext-to-image generationlarge language modelspreference learningobject detectiondata augmentationrobustnessFishEye8K
0
0 comments X

The pith

An LLM fine-tuned with preference learning rephrases captions to steer text-to-image models toward synthesizing difficult edge-case images that improve object detection robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that manually curating challenging edge cases is a major bottleneck for reducing dataset bias in deep neural networks, and shows that this can be automated through a pipeline that uses a preference-tuned large language model to rewrite image captions into diverse prompts. These prompts then direct a text-to-image model to generate hard visual scenarios for training. A sympathetic reader would care because this shifts data expansion from slow human effort to scalable, targeted synthesis, potentially allowing models to become more reliable on real-world distributions like fisheye images. On the FishEye8K object detection benchmark the method outperforms both naive augmentation and hand-engineered prompts.

Core claim

The paper claims that fine-tuning a large language model via preference learning to rephrase image captions into diverse textual prompts allows a text-to-image model to generate difficult visual scenarios, and that training object detectors on these synthetic edge cases yields superior robustness on the FishEye8K benchmark compared with naive augmentation or manually engineered prompts.

What carries the argument

The preference-learned LLM rephraser that converts standard image captions into targeted, diverse prompts to steer text-to-image generation toward edge cases.

If this is right

  • Data curation for robustness can move from manual labeling of rare cases to automated, repeatable synthesis.
  • Models can be iteratively improved by repeatedly generating new targeted edge cases without additional human annotation.
  • The approach scales to new domains by applying the same caption-rephrasing and generation steps to other datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be tested on other vision tasks such as segmentation or depth estimation to check whether the robustness gains generalize.
  • If the generated images contain subtle domain shifts, combining them with real data filtering steps might be needed to avoid performance plateaus.
  • Extending the method to video or multi-view synthesis could address temporal or geometric edge cases that single images miss.

Load-bearing premise

The images produced by the text-to-image model from the LLM-rephrased prompts are realistic and lie close enough to the real data distribution to improve robustness without introducing harmful artifacts or biases.

What would settle it

Train an object detector on the FishEye8K training set augmented with the synthetic edge cases, then measure whether its mean average precision on a held-out real test set is higher than both the naive-augmentation baseline and the manual-prompt baseline while standard metrics show no degradation.

Figures

Figures reproduced from arXiv: 2509.26158 by Kyeongryeol Go.

Figure 1
Figure 1. Figure 1: Training pipeline of the rephrasing LLM. Best viewed in color. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data augmentation pipeline by inference of the preference-tuned LLM. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Number of images per camera ID and time-of-day in FishEye8K. The dataset is split into [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Filtered ground-truth annotations and predictions to compute mAP w/o TP. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of naive, manual, and automatic by UMAP [ [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of real and synthetic data. Pseudo annotations are marked with different colors [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
read the original abstract

The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at https://github.com/gokyeongryeol/ATES.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces an automated pipeline for text-guided edge-case synthesis in which an LLM fine-tuned via preference learning rephrases image captions into diverse prompts that guide a text-to-image model to generate challenging visual scenarios. These synthetic images are intended to expand training data coverage and mitigate dataset bias. The central empirical claim is that the resulting augmentation yields superior robustness on the FishEye8K object detection benchmark relative to both naive augmentation and manually engineered prompts.

Significance. If the performance gains are shown to arise from targeted edge-case synthesis rather than data volume alone, the work would provide a scalable, automated alternative to manual curation of difficult examples, with potential impact on continual robustness improvement for vision models under distribution shift. The public release of code at https://github.com/gokyeongryeol/ATES is a clear strength that supports reproducibility.

major comments (2)
  1. [Evaluation] Evaluation section: the claim that the method 'achieves superior robustness' on FishEye8K is presented without any reported quantitative metrics (e.g., mAP, AP on rare classes), error bars, statistical significance tests, train/validation/test splits, or the exact number of synthetic images added. This absence prevents assessment of whether observed gains exceed those obtainable by simply scaling the training set size.
  2. [Experiments] Experiments / Ablation studies: no controls are described that match the number of added images across the naive-augmentation, manual-prompt, and proposed-method conditions, nor is there quantification of how often the generated samples increase model error on held-out real edge cases from FishEye8K. Without these, any measured improvement could be explained by data scaling rather than the targeted synthesis mechanism.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'superior robustness' should be accompanied by the concrete metric and the magnitude of improvement to allow readers to gauge the result immediately.
  2. [Method] Method description: clarify the precise preference-learning objective and the criteria used to label preferred versus dispreferred prompt pairs, as these choices directly affect the diversity and difficulty of the generated edge cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects for improving the rigor of our evaluation. We will revise the manuscript accordingly to address the concerns raised regarding quantitative metrics and experimental controls.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the claim that the method 'achieves superior robustness' on FishEye8K is presented without any reported quantitative metrics (e.g., mAP, AP on rare classes), error bars, statistical significance tests, train/validation/test splits, or the exact number of synthetic images added. This absence prevents assessment of whether observed gains exceed those obtainable by simply scaling the training set size.

    Authors: We agree that the current presentation lacks sufficient quantitative detail to fully substantiate the robustness claims. In the revised manuscript, we will report mAP scores, AP for rare classes, error bars computed over multiple random seeds, statistical significance tests comparing our method to baselines, details on the data splits, and the exact number of synthetic images added. These revisions will allow a clearer determination of whether the gains are due to targeted edge-case synthesis or simply increased data volume. revision: yes

  2. Referee: [Experiments] Experiments / Ablation studies: no controls are described that match the number of added images across the naive-augmentation, manual-prompt, and proposed-method conditions, nor is there quantification of how often the generated samples increase model error on held-out real edge cases from FishEye8K. Without these, any measured improvement could be explained by data scaling rather than the targeted synthesis mechanism.

    Authors: We acknowledge the need for better-controlled experiments to isolate the contribution of our method. We will incorporate ablation studies where the number of added synthetic images is matched across all conditions (naive augmentation, manual prompts, and our automated approach). Furthermore, we will add an analysis measuring how frequently the generated samples correspond to increased error rates on held-out real edge cases from FishEye8K. This will help demonstrate that improvements arise from the targeted nature of the synthesis rather than data scaling alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmark with no self-referential derivations

full rationale

The paper describes an automated pipeline that uses an LLM (fine-tuned via preference learning) to rephrase captions into prompts for a text-to-image model, then evaluates the resulting synthetic data on the independent FishEye8K object detection benchmark. Claims of superior robustness are supported by direct comparisons to naive augmentation and manual prompts rather than any internal equations, fitted parameters, or predictions that reduce by construction to the method's own inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way within the provided text, and the central results remain falsifiable against external real-world data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into exact hyperparameters and modeling choices; the approach rests on standard generative modeling assumptions rather than novel axioms.

free parameters (1)
  • LLM preference learning hyperparameters
    Fine-tuning details and reward model parameters are not specified in the abstract but are required for the pipeline.
axioms (1)
  • domain assumption Text-to-image models can produce images that serve as effective training data for object detection when guided by complex prompts
    Core assumption enabling the synthesis step; invoked implicitly in the pipeline description.

pith-pipeline@v0.9.0 · 5664 in / 1304 out tokens · 43419 ms · 2026-05-18T12:43:08.409696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 2 internal anchors

  1. [1]

    Mitigating dataset bias by using per-sample gradient

    Sumyeong Ahn, Seongyoon Kim, and Se-Young Yun. Mitigating dataset bias by using per-sample gradient. arXiv preprint arXiv:2205.15704, 2022

  2. [2]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  3. [3]

    Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

    Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

  4. [4]

    Geodiffusion: Text-prompted geometric control for object detection data generation.arXiv preprint arXiv:2306.04607, 2023

    Kai Chen, Enze Xie, Zhe Chen, Yibo Wang, Lanqing Hong, Zhenguo Li, and Dit-Yan Yeung. Geodiffusion: Text-prompted geometric control for object detection data generation.arXiv preprint arXiv:2306.04607, 2023

  5. [5]

    Dream the impossible: Outlier imagination with diffusion models.Advances in Neural Information Processing Systems, 36:60878–60901, 2023

    Xuefeng Du, Yiyou Sun, Jerry Zhu, and Yixuan Li. Dream the impossible: Outlier imagination with diffusion models.Advances in Neural Information Processing Systems, 36:60878–60901, 2023

  6. [6]

    Scaling laws of synthetic images for model training

    Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian. Scaling laws of synthetic images for model training... for now. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7382–7392, 2024

  7. [7]

    Transferable candidate proposal with bounded uncertainty

    Kyeongryeol Go and Kye-Hyeon Kim. Transferable candidate proposal with bounded uncertainty. In NeurIPS 2023 Workshop on Adaptive Experimental Design and Active Learning in the Real World

  8. [8]

    Fisheye8k: A benchmark and dataset for fisheye camera object detection

    Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Erkhembayar Ganbold, Jun-Wei Hsieh, Ming-Ching Chang, Ping-Yang Chen, Byambaa Dorj, Hamad Al Jassmi, Ganzorig Batnasan, Fady Alnajjar, et al. Fisheye8k: A benchmark and dataset for fisheye camera object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5305–5313, 2023

  9. [9]

    Training deep neural-networks using a noise adaptation layer

    Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In International conference on learning representations, 2017

  10. [10]

    Generative data mining with longtail-guided diffusion.arXiv preprint arXiv:2502.01980, 2025

    David S Hayden, Mao Ye, Timur Garipov, Gregory P Meyer, Carl V ondrick, Zhao Chen, Yuning Chai, Eric Wolff, and Siddhartha S Srinivasa. Generative data mining with longtail-guided diffusion.arXiv preprint arXiv:2502.01980, 2025

  11. [11]

    Bias mitigation for machine learning classifiers: A comprehensive survey.ACM Journal on Responsible Computing, 1(2):1–52, 2024

    Max Hort, Zhenpeng Chen, Jie M Zhang, Mark Harman, and Federica Sarro. Bias mitigation for machine learning classifiers: A comprehensive survey.ACM Journal on Responsible Computing, 1(2):1–52, 2024

  12. [12]

    Randomness is the root of all evil: more reliable evaluation of deep active learning

    Yilin Ji, Daniel Kaestner, Oliver Wirth, and Christian Wressnegger. Randomness is the root of all evil: more reliable evaluation of deep active learning. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3943–3952, 2023

  13. [13]

    Ultralytics yolo11, 2024

    Glenn Jocher and Jing Qiu. Ultralytics yolo11, 2024

  14. [14]

    Ultralytics YOLO, 2023

    Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics YOLO, 2023

  15. [15]

    Edge-case synthesis for fisheye object detection: A data-centric perspective.arXiv preprint arXiv:2507.16254, 2025

    Seunghyeon Kim and Kyeongryeol Go. Edge-case synthesis for fisheye object detection: A data-centric perspective.arXiv preprint arXiv:2507.16254, 2025

  16. [16]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  17. [17]

    Learning debiased represen- tation via disentangled feature augmentation.Advances in Neural Information Processing Systems, 34: 25123–25133, 2021

    Jungsoo Lee, Eungyeup Kim, Juyoung Lee, Jihyeon Lee, and Jaegul Choo. Learning debiased represen- tation via disentangled feature augmentation.Advances in Neural Information Processing Systems, 34: 25123–25133, 2021

  18. [18]

    Chao Liang, Linchao Zhu, Humphrey Shi, and Yi Yang

    Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning.arXiv preprint arXiv:2002.07394, 2020

  19. [19]

    Early-learning regular- ization prevents memorization of noisy labels.Advances in neural information processing systems, 33: 20331–20342, 2020

    Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. Early-learning regular- ization prevents memorization of noisy labels.Advances in neural information processing systems, 33: 20331–20342, 2020

  20. [20]

    How transferable are the datasets collected by active learners.arXiv preprint arXiv:1807.04801, 3, 2018

    David Lowell, Zachary C Lipton, and Byron C Wallace. How transferable are the datasets collected by active learners.arXiv preprint arXiv:1807.04801, 3, 2018

  21. [21]

    Umap: Uniform manifold approxi- mation and projection.The Journal of Open Source Software, 3(29):861, 2018

    Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger. Umap: Uniform manifold approxi- mation and projection.The Journal of Open Source Software, 3(29):861, 2018

  22. [22]

    OpenMMLab Detection Toolbox and Benchmark, 2018

    MMDetection Contributors. OpenMMLab Detection Toolbox and Benchmark, 2018. 5

  23. [23]

    Learning from failure: De-biasing classifier from biased classifier.Advances in Neural Information Processing Systems, 33:20673–20684, 2020

    Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier.Advances in Neural Information Processing Systems, 33:20673–20684, 2020

  24. [24]

    Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation.Advances in Neural Information Processing Systems, 36: 76872–76892, 2023

    Quang Nguyen, Truong Vu, Anh Tran, and Khoi Nguyen. Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation.Advances in Neural Information Processing Systems, 36: 76872–76892, 2023

  25. [25]

    Gpt-4.1, 2024

    OpenAI. Gpt-4.1, 2024. Accessed: 2025-07-05

  26. [26]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  27. [27]

    Swapping autoencoder for deep image manipulation.Advances in Neural Information Processing Systems, 33:7198–7211, 2020

    Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei Efros, and Richard Zhang. Swapping autoencoder for deep image manipulation.Advances in Neural Information Processing Systems, 33:7198–7211, 2020

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  29. [29]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  30. [30]

    TRL: Transformer Reinforcement Learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer Reinforcement Learning

  31. [31]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Perric Cistac, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics, 2020

  32. [32]

    On the limitation of diffusion models for synthesizing training datasets.arXiv preprint arXiv:2311.13090, 2023

    Shin’ya Yamaguchi and Takuma Fukuda. On the limitation of diffusion models for synthesizing training datasets.arXiv preprint arXiv:2311.13090, 2023

  33. [33]

    Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

    Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

  34. [34]

    Expanding small-scale datasets with guided imagination.Advances in neural information processing systems, 36:76558–76618, 2023

    Yifan Zhang, Daquan Zhou, Bryan Hooi, Kai Wang, and Jiashi Feng. Expanding small-scale datasets with guided imagination.Advances in neural information processing systems, 36:76558–76618, 2023

  35. [35]

    Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

    Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

  36. [36]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  37. [37]

    edge-ness,

    Zhuofan Zong, Guanglu Song, and Yu Liu. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6748–6758, 2023. 6 A Related work Synthetic data from generative models.Recent advancements in generative models, particularly diffusion-based approaches, have led to a remarkable i...

  38. [38]

    Inference with baseline model: First, run inference with Model A on the evaluation dataset to obtain its set of predictions

  39. [39]

    •G weak: All other annotations, which correspond to Model A’s FN or were not associated with any high-confidence prediction

    Partition ground-truth annotations: Categorize all ground-truth annotations in the evaluation set into two disjoint sets based on ModelA’s predictions: •G non-weak: The set of annotations correctly identified by a TP prediction from Model A (based on the IoU greater than 0.95 and class label matching). •G weak: All other annotations, which correspond to M...

  40. [40]

    Inference with new model: Run inference with the new Model B on the same evaluation dataset to obtain its predictions

  41. [41]

    Filter predictions of new model: Filter the predictions from Model B based on their relation- ship with the partitioned annotation sets. A prediction from Model B is kept for evaluation only if it meets one of the following criteria: • Its highest-IoU corresponding annotation belongs to Gweak (i.e., it addresses a weak instance). • Its highest-IoU corresp...

  42. [42]

    We provide an example of the filtered ground-truth annotations and predictions in Figure 4

    Calculate final metric: Calculate the mAP, class-wise AP, or other relevant metrics using only the filtered set of predictions from the previous step and ground-truth annotations in Gweak. We provide an example of the filtered ground-truth annotations and predictions in Figure 4. E.5 Significance and benefits The mAP w/o TP metric offers several key advan...

  43. [43]

    serene”, “peaceful

    Scene atmosphere enhancement: Rephrased captions frequently introduce emotive and descriptive adjectives, such as “serene”, “peaceful”, “bustling”, or “vibrant”, to convey the mood of the scene more vividly than the original caption. This trend appears in all 20 pairs, indicating its universal role in enhancing expressiveness

  44. [44]

    front-view

    Perspective specification: Many rephrasings explicitly indicate the camera viewpoint (e.g., “front-view”, “side-view”, “low-angle”), adding spatial context that is often implicit in the 13 (a) naive (b) manual (c) automatic Figure 5: Comparison of naive, manual, and automatic by UMAP [ 21] of CLIP [28] embeddings. Gray points represent real data, while th...

  45. [45]

    clear sky

    Temporal and weather visualization: Neutral references to weather or time (e.g., “clear sky”, “daytime”) are often expanded into more expressive phrases, such as “golden morning light”, “sun-drenched afternoon”, or “warm glow”, enhancing the visual imagery. This trend is consistently applied across all pairs, showing a clear preference for temporal and en...

  46. [46]

    intersection

    Intersection type clarification: Generic mentions of “intersection” are frequently replaced with more precise terms like “T-junction”, “Y-junction”, or “street corner”, providing clearer spatial context. Applied in 14/20 pairs, this trend improves spatial specificity without altering the core scene description

  47. [47]

    stroll”, “zip by

    Action emphasis: Rephrased captions tend to highlight the dynamics of pedestrians and vehicles, converting simple existence statements into descriptive actions such as “stroll”, “zip by”, or “weave through”, thereby increasing narrative engagement. This occurs in 16/20 pairs, emphasizing the narrative benefit of describing movement

  48. [48]

    A photo of

    Urban context enrichment: Additional details about shops, cafes, buildings, and signage are often added to create a richer urban setting, making the scene more specific and relatable. This trend is applied in 17/20 pairs, reflecting a strong tendency to augment environmental details. Overall, these trends consistently appear across the majority of caption...