pith. sign in

arxiv: 2605.21411 · v1 · pith:3MGJGZGBnew · submitted 2026-05-20 · 💻 cs.CV

RoadTones: Tone Controllable Text Generation from Road Event Videos

Pith reviewed 2026-05-21 04:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords road event videostone controllable captioningvideo-language modelschain-of-thought generationRoadTones-51K datasetfactual consistencytone adherenceuser study validation
0
0 comments X

The pith

A new dataset, model, and evaluation suite enables video models to generate road event captions with controllable tone while preserving facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing video-language models produce factual descriptions of road events but cannot adjust their tone, urgency, or style. This limits their usefulness in settings where how a message is delivered matters as much as its accuracy. The paper builds a human-validated pipeline to expand road video data with tonal annotations and multiple captions per event, creating the RoadTones-51K dataset. It introduces the RoadTones-VL-CoT model that conditions generation on tone through intermediate chain-of-thought drafts. A dedicated evaluation suite and user study then measure both factual consistency and successful tone control.

Core claim

By expanding existing road-video corpora through a human-validated pipeline that adds diverse tonal annotations and multi-tone captions, the work creates the RoadTones-51K dataset. The RoadTones-VL-CoT model then generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability, while RoadTones-Eval jointly assesses factual consistency and tone adherence. User studies confirm that the resulting captions maintain quality, adhere to the requested tone, and remain factually consistent.

What carries the argument

RoadTones-VL-CoT, a controllable video-to-text model that produces tone-conditioned Chain-of-Thought intermediate drafts to separate stylistic control from factual content.

If this is right

  • The same road event can be described with different urgency or calmness without changing the underlying facts.
  • Communication in safety-critical driving scenarios becomes more adaptable to audience or context needs.
  • Interpretability increases because the model exposes its tone-conditioned reasoning steps before the final caption.
  • Evaluation now requires joint measurement of factual accuracy and stylistic adherence rather than accuracy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to real-time alert systems that match warning tone to the severity of detected hazards.
  • Similar tone-conditioning pipelines might apply to other video domains such as medical or sports footage.
  • If the data pipeline can be partially automated, larger-scale tone control becomes feasible with less human effort.

Load-bearing premise

The human-validated data generation pipeline produces reliable and unbiased tonal annotations and multi-tone captions that accurately reflect controllable stylistic variations independent of factual content.

What would settle it

A controlled test in which human raters or automated metrics show that changing the tone instruction consistently alters factual content or fails to produce the requested tone across multiple road events.

Figures

Figures reproduced from arXiv: 2605.21411 by Chirag Parikh, Ravi Kiran Sarvadevabhatla, Siddhi Pravin Lipare.

Figure 1
Figure 1. Figure 1: Tone-Controlled Road Video Captioning. Our user in￾terface demonstrates captioning capability with fine-grained control across five tone dimensions: Personality Writing Style, Event De￾tails (Informativeness), Structural Attributes, and Caption Length. 1 The user uploads a Road Video, 2 selects the desired controls with specific intensities, and adds structural attributes such as ‘Emo￾jis’ and ‘Hashtags’. … view at source ↗
Figure 2
Figure 2. Figure 2: Generating Distinct Tone Captions Per-Video. A Given a reference video Cr, we first retrieve similar road events using a k-nearest neighbor approach. B We obtain tone profiles from captions using the Tone Extractor ( TX , Sec. 4.2). The Tone Evaluator ( TE ,Sec. 6) then selects the tone profiles most dissimilar to the reference tr . C The selected tone profiles ( t2 , t3 , tk ) are fed to TC-Gen (Sec. 4.1)… view at source ↗
Figure 3
Figure 3. Figure 3: Tone-controlled Caption Generation pipeline. Inputs A.1 and A.2 include the target Tone Controls (Narrative and Structural) and a detailed video summary respectively. The inputs are fed to B. Tone-controlled Caption Generator , a two-stage pipeline ( 1 , 2 ). At each stage, the generator conditions on the pipeline inputs to produce candidate captions. Stage 1 infuses Writing style and enforces Structural C… view at source ↗
Figure 4
Figure 4. Figure 4: Controlling individual tonal attributes in the generated caption. The central panel in figure shows a video V, tone controls TC and its corresponding caption 0 from our dataset. The surrounding captions ( 1 - 8 ) correspond to changes in one of the tonal attributes shown in their header. For e.g., caption 1 was obtained by increasing the tonal intensity of Caring Personality from Absent (0-0.2) to Very Str… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Personality, Writing Style and Structural at [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Applicability of our tone-controlled caption generation pipeline on popular road video datasets: SUTD-TrafficQA[28] and LingoQA[13]. For each video sample, we show two distinct tone captions ( 1 , 2 ) with corresponding dominant Personality and Writing Style attributes highlighted in text. Video summary is shown for reference. More samples are provided in Sec. A.1. Cap: Fine-tuning on original RoadSocial c… view at source ↗
Figure 7
Figure 7. Figure 7: Applicability of our tone-controlled caption generation pipeline on popular road video dataset: LingoQA[13]. For each video sample, we show two distinct tone captions ( 1 , 2 ) with corresponding dominant Personality and Writing Style attributes highlighted in text. Video summary is shown for reference. LingoQA Caption @RoadSafety Watch Two vehicles, including a grey car, collided on a straight road. Exces… view at source ↗
Figure 10
Figure 10. Figure 10: Applicability of our tone-controlled caption gen￾eration pipeline on popular road video dataset: SUTD￾TrafficQA[28]. For each video sample, we show two distinct tone captions ( 1 , 2 ) with corresponding dominant Personality and Writing Style attributes highlighted in text. Video summary is shown for reference. the LLM [17] to propose a new candidate that strongly captures the caption’s expressive form. P… view at source ↗
Figure 11
Figure 11. Figure 11: Top 25 Personality Traits Intensity Distribution in RoadTones-51K. The chart visualizes the total instances of 25 most frequent personality traits segmented into three intensity bins: Weak (0.4-0.6), Moderate (0.6-0.8), and Strong (0.8-1.0). Traits with intensity level less than 0.4 are not considered for tone-controllable captioning [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Intensity Distribution of Writing Styles in RoadTones￾51K. The chart visualizes the total instances of 16 writing styles segmented into four intensity bins: Very Weak (0.2-0.4), Weak (0.4- 0.6), Moderate (0.6-0.8), and Strong (0.8-1.0). Attributes with in￾tensity level less than 0.2 are not considered for tone-controllable captioning. Tone Extraction Prompts. We provide the prompts to extract the tone con… view at source ↗
Figure 15
Figure 15. Figure 15: Personality Trait Distribution. This figure displays the distribution of 75 most frequent personality traits in RoadTones-51K. Less frequent traits are aggregated into the “Others” category. Caption Parking behind curves on winding roads, even for short stops or emergencies, can block visibility and disrupt traffic. Understanding these risks and following rules helps keep everyone safer. #RoadSafety Perso… view at source ↗
Figure 16
Figure 16. Figure 16: Representative samples from RoadTones-51K with potential usecase/applications. TC-Gen ’s tone-controlled captions can be used in diverse domains, such as issuing 1 Safety Advisories, conducting 2 Post-Drive Analysis, or creating 3 Engaging Posts for social media. fulness beyond stylistic variation. Dataset mAP@5 mAP@all ACC NMI nexar-collision[5] 0.986 0.852 0.75 0.341 ROADWork [4] 0.983 0.884 0.847 0.504… view at source ↗
Figure 17
Figure 17. Figure 17: User Study Interface for TC-Gen Caption Quality Assessment. Participants viewed a video, its video summary and evaluated the quality of the corresponding caption generated by TC-Gen based on Tone Alignment, Tone Relevance, Factual Consistency, Usefulness and Human-Likeness on a 5-pt Likert Scale [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: User Study Interface for Agreement on RoadTones-Eval Metrics. Participants viewed a video, its video summary and rated the corresponding caption generated by ROADTONES-VL-COT based on Tone Alignment and Factual Consistency on a 5-pt Likert Scale. The user ratings were then correlated with scores computed by RoadTones-Eval metrics. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: User Study Interface for Tone Controllability Evaluation. Participants viewed a video, its video summary and evaluated Tone Controllability and Factual Consistency of the corresponding captions generated by TC-Gen . 23 [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: TC-Gen Stage- 1 prompt. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: TC-Gen Stage- 2 Prompt. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Writing Style tone extraction prompt. The prompt defines the task, context, scoring criteria, and restrictions provided to the LLM for writing style attributes intensity prediction based on caption text. Video summary about the key road event is also provided to disentangle the factual from the tonal content of the caption. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Writing Style tone schema defining the 16 attributes along with examples based on intensity levels. [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Personality trait extraction prompt. The prompt defines the task, context, scoring criteria, and restrictions provided to the LLM for personality traits intensity prediction based on caption text. Video summary about the key road event is also provided to disentangle the factual from the tonal content of the caption. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Informativeness level extraction prompt. The prompt defines the task, context, scoring criteria, and restrictions provided to the LLM for informativeness level prediction based on the amount of factual information conveyed through the caption relative to the detailed road video summary. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Structural attributes extraction prompt. The prompt guides the LLM to classify the presence (’yes’ or ’no’) of Location, Date/Time, and First-Person View based on the provided definitions. SYSTEM PROMPT: You are an expert personality annotation evaluator. USER PROMPT: You are evaluating how well generated personality trait annotations match the ground truth. Ground truth traits with intensities: {personal… view at source ↗
Figure 27
Figure 27. Figure 27: Personality tone alignment evaluation prompt (Sp). 31 [PITH_FULL_IMAGE:figures/full_fig_p031_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Writing Style tone alignment evaluation prompt (Sw). SYSTEM PROMPT: You are an expert evaluator of factual consistency. USER PROMPT: Compare the factual consistency of the caption based on the video summary of the road event. Video Summary: {video_summary} Caption: {caption} Instructions: - Focus only on core events and factual content (who/what/where/what happened). - Video Summary provides a detailed de… view at source ↗
Figure 29
Figure 29. Figure 29: Factual Consistency score evaluation prompt (F C). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Qualitative comparison of ROADTONES-VL-COT model predictions with respect to TC-Gen generated ground truth captions and intermediate stage-level outcomes provided as rationales. Reasoning step- 4 selects the stage-level caption that best satisfies the tone controls (marked by ). 33 [PITH_FULL_IMAGE:figures/full_fig_p033_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: ROADTONES-VL-COT consistently follows the specified tonal controls. Gemini-2.5-pro [3] exhibits minor tonal misalignment, whereas Qwen3-VL-8B-Instruct [21] and Mini-CPM-V 4.5 [29] show significantly poor adherence to the tone controls. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Qualitative comparison of tone-controlled captions generated by [PITH_FULL_IMAGE:figures/full_fig_p035_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Qualitative comparison of tone-controlled captions generated by [PITH_FULL_IMAGE:figures/full_fig_p036_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Qualitative comparison of tone-controlled captions generated by [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Qualitative comparison of tone-controlled captions generated by [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Qualitative comparison of tone-controlled captions generated by [PITH_FULL_IMAGE:figures/full_fig_p039_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Qualitative comparison of tone-controlled captions generated by [PITH_FULL_IMAGE:figures/full_fig_p040_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Qualitative comparison of tone-controlled captions generated by [PITH_FULL_IMAGE:figures/full_fig_p041_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Interface for RoadTones User Study familiarization phase. For the image shown, participants viewed a video, its video summary and identified the presence of dominant tone in caption. Questionnaire for all tasks can be viewed in the supplementary video: RoadTones UserStudy familiarization.mp4 [PITH_FULL_IMAGE:figures/full_fig_p042_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Controlling individual tonal attributes in the generated caption. The central panel in figure shows a video V, tone controls TC and its corresponding caption 0 from our dataset. The surrounding captions ( 1 - 8 ) correspond to changes in one of the tonal attributes shown in their header. For e.g., caption 1 was obtained by increasing the tonal intensity of Caring Personality from Absent (0-0.2) to Very St… view at source ↗
Figure 41
Figure 41. Figure 41: Controlling individual tonal attributes in the generated caption. The central panel in figure shows a video V, tone controls TC and its corresponding caption 0 from our dataset. The surrounding captions ( 1 - 8 ) correspond to changes in one of the tonal attributes shown in their header. For e.g., caption 1 was obtained by increasing the tonal intensity of Caring Personality from Absent (0-0.2) to Very St… view at source ↗
Figure 42
Figure 42. Figure 42: Controlling individual tonal attributes in the generated caption. The central panel in figure shows a video V, tone controls TC and its corresponding caption 0 from our dataset. The surrounding captions ( 1 - 8 ) correspond to changes in one of the tonal attributes shown in their header. For e.g., caption 1 was obtained by increasing the tonal intensity of Caring Personality from Absent (0-0.2) to Very St… view at source ↗
read the original abstract

Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RoadTones, a comprehensive dataset-model-evaluation suite for tone-controllable captioning of road event videos. It constructs the RoadTones-51K dataset via a human-validated pipeline that augments existing road-video corpora with tonal annotations and multi-tone captions. The proposed RoadTones-VL-CoT model generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability, while RoadTones-Eval jointly assesses factual consistency and tone adherence. A user study is reported to support claims of caption quality, tone control, and factual consistency.

Significance. If the central claims hold, the work supplies a useful resource for controllable video-to-text generation in safety-critical domains such as traffic monitoring and emergency communication, where stylistic presentation can influence message effectiveness. The combination of a sizable annotated dataset, an interpretable CoT-based model, and a dedicated joint evaluation suite could facilitate further research on style-content disentanglement in multimodal models. The human-validation step and user study represent constructive empirical grounding when accompanied by the necessary quantitative details.

major comments (2)
  1. [§3.2 (Human-Validated Data Generation Pipeline)] §3.2 (Human-Validated Data Generation Pipeline): The manuscript does not report inter-annotator agreement metrics or annotation guidelines that explicitly instruct annotators to decouple tone from factual event properties (e.g., collision severity or vehicle count). Without post-hoc correlation tests between tone labels and factual metadata, it remains unclear whether the multi-tone captions achieve stylistic variation independent of content; this directly undermines the claim that RoadTones-VL-CoT performs controllable tone generation rather than implicit content prediction.
  2. [§5 (User Study and RoadTones-Eval)] §5 (User Study and RoadTones-Eval): The user study is cited as validating caption quality, tone control, and factual consistency, yet the text supplies no quantitative results (e.g., mean scores, statistical tests, or inter-rater reliability), error analysis, or ablation details. This absence prevents verification of the strength of the empirical support for the central claims on tone adherence and factual consistency.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by briefly stating the scale of the user study (number of participants, videos evaluated) and the key quantitative outcomes rather than only qualitative validation language.
  2. [Figure 2] Figure captions for the model architecture diagram could more explicitly label the tone-conditioning input pathway and the CoT draft generation branch to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to improve transparency and empirical detail.

read point-by-point responses
  1. Referee: [§3.2 (Human-Validated Data Generation Pipeline)] §3.2 (Human-Validated Data Generation Pipeline): The manuscript does not report inter-annotator agreement metrics or annotation guidelines that explicitly instruct annotators to decouple tone from factual event properties (e.g., collision severity or vehicle count). Without post-hoc correlation tests between tone labels and factual metadata, it remains unclear whether the multi-tone captions achieve stylistic variation independent of content; this directly undermines the claim that RoadTones-VL-CoT performs controllable tone generation rather than implicit content prediction.

    Authors: We acknowledge the value of explicit documentation for the annotation process. Although our human-validated pipeline was designed with instructions to prioritize tonal style over factual content, we did not include inter-annotator agreement statistics or the full guidelines in the manuscript. In the revised version, we will add the complete annotation guidelines to the supplementary material, report inter-annotator agreement (e.g., Fleiss’ kappa across annotators), and include post-hoc correlation analyses (Pearson and Spearman) between tone labels and factual metadata such as event severity and object counts. These additions will strengthen the evidence that stylistic variation is independent of content. revision: yes

  2. Referee: [§5 (User Study and RoadTones-Eval)] §5 (User Study and RoadTones-Eval): The user study is cited as validating caption quality, tone control, and factual consistency, yet the text supplies no quantitative results (e.g., mean scores, statistical tests, or inter-rater reliability), error analysis, or ablation details. This absence prevents verification of the strength of the empirical support for the central claims on tone adherence and factual consistency.

    Authors: We agree that the current description of the user study is insufficiently detailed. While the manuscript states that the study validates the relevant properties, specific quantitative results were summarized rather than fully reported. In the revision, we will expand §5 to include mean scores and standard deviations from the Likert-scale ratings, statistical tests (e.g., paired t-tests for tone control comparisons), inter-rater reliability metrics (e.g., Cronbach’s alpha), a concise error analysis, and any relevant ablation results. These details will be placed in the main text or appendix as appropriate to allow full verification of our claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical dataset-model construction

full rationale

The paper presents an empirical contribution consisting of a new dataset (RoadTones-51K), a controllable video-to-text model (RoadTones-VL-CoT), and an evaluation suite (RoadTones-Eval), all built via a human-validated data generation pipeline and validated through user studies. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The central claims rest on external human annotation and validation processes rather than reducing to self-referential definitions or inputs by construction, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; typical large-model training details such as learning rates or prompt templates are not disclosed.

pith-pipeline@v0.9.0 · 5709 in / 1117 out tokens · 34021 ms · 2026-05-21T04:32:19.676154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 5 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

  2. [2]

    fac- tual

    Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, and Jiebo Luo. “fac- tual”or“emotional”: Stylized image captioning with adaptive learning and attention. InProceedings of the european con- ference on computer vision (ECCV), pages 519–535, 2018. 2

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long con- text, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 6, 7, 14, 15, 34, 35, 36, 37, 38,...

  4. [4]

    Roadwork dataset...ICCV, 2025

    Ghosh et al. Roadwork dataset...ICCV, 2025. 19

  5. [5]

    Nexar dashcam collision...CVPR, 2025

    Moura et al. Nexar dashcam collision...CVPR, 2025. 19

  6. [6]

    Mscap: Multi-style image captioning with unpaired stylized text

    Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Han- qing Lu. Mscap: Multi-style image captioning with unpaired stylized text. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4204–4213,

  7. [7]

    Robotron- drive: All-in-one large multimodal model for autonomous driving

    Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Ze- qun Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Robotron- drive: All-in-one large multimodal model for autonomous driving. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 8011–8021, 2025. 6

  8. [8]

    Textual explanations for self-driving vehicles

    Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. InProceedings of the European conference on computer vision (ECCV), pages 563–578, 2018. 1, 2

  9. [9]

    From generation to judgment: Opportunities and challenges of llm-as-a-judge

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025. 6, 7, 20

  10. [10]

    O2na: An object-oriented non- autoregressive approach for controllable video captioning

    Fenglin Liu, Xuancheng Ren, Xian Wu, Bang Yang, Shen Ge, Yuexian Zou, and Xu Sun. O2na: An object-oriented non- autoregressive approach for controllable video captioning. arXiv preprint arXiv:2108.02359, 2021. 2

  11. [11]

    Dolphins: Multimodal language model for driving

    Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. InEuropean Conference on Computer Vision, pages 403–420. Springer, 2024. 1, 2, 6

  12. [12]

    Drama: Joint risk localization and captioning in driving

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023

  13. [13]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan H¨unermann, Alice Karn- sund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024. 1, 2, 7, 8, 13, 16, 17

  14. [14]

    Senticap: Generating image descriptions with sentiments

    Alexander Mathews, Lexing Xie, and Xuming He. Senticap: Generating image descriptions with sentiments. InProceed- ings of the AAAI conference on artificial intelligence, 2016. 2

  15. [15]

    Fine- grained length controllable video captioning with ordinal embeddings.IEEE Access, 12:189667–189688, 2024

    Tomoya Nitta, Takumi Fukuzawa, and Toru Tamaki. Fine- grained length controllable video captioning with ordinal embeddings.IEEE Access, 12:189667–189688, 2024. 2

  16. [16]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. https://cdn.openai. com/gpt-5-system-card.pdf, 2025. 6

  17. [17]

    Introducing gpt-4.1

    OpenAI. Introducing gpt-4.1. https://openai.com/ index/gpt-4-1/, 2025. 3, 4, 6, 16, 17, 18, 20

  18. [18]

    Idd-x: A multi-view dataset for ego-relative important object localization and explanation in dense and unstructured traffic

    Chirag Parikh, Rohit Saluja, CV Jawahar, and Ravi Kiran Sarvadevabhatla. Idd-x: A multi-view dataset for ego-relative important object localization and explanation in dense and unstructured traffic. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14815–14821. IEEE, 2024. 2

  19. [19]

    T., Tathagata Ghosh, and Ravi Kiran Sarvadevabhatla

    Chirag Parikh, Deepti Rawat, Rakshitha R. T., Tathagata Ghosh, and Ravi Kiran Sarvadevabhatla. Roadsocial: A di- verse videoqa dataset and benchmark for road event under- standing from social video narratives, 2025. 1, 2, 3, 6, 8, 16, 17

  20. [20]

    Intentvcnet: Bridging spatio-temporal gaps for intention- oriented controllable video captioning.arXiv preprint arXiv:2507.18531, 2025

    Tianheng Qiu, Jingchun Gao, Jingyu Li, Huiyi Leong, Xuan Huang, Xi Wang, Xiaocheng Zhang, Kele Xu, and Lan Zhang. Intentvcnet: Bridging spatio-temporal gaps for intention- oriented controllable video captioning.arXiv preprint arXiv:2507.18531, 2025. 2

  21. [21]

    Qwen3-vl-8b-instruct

    Qwen. Qwen3-vl-8b-instruct. https://huggingface. co/Qwen/Qwen3-VL-8B-Instruct , 2025. 5, 6, 7, 14, 15, 34, 35, 36, 37, 38, 39, 40, 41

  22. [22]

    Captionsmiths: Flexi- bly controlling language pattern in image captioning.ICCV,

    Kuniaki Saito, Donghyun Kim, Kwanyong Park, Atsushi Hashimoto, and Yoshitaka Ushiku. Captionsmiths: Flexi- bly controlling language pattern in image captioning.ICCV,

  23. [23]

    Engaging image captioning via personality,

    Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. Engaging image captioning via personality,

  24. [24]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 1, 2

  25. [25]

    Emotional video captioning with vision-based emo- tion interpretation network.IEEE Transactions on Image Processing, 33:1122–1135, 2024

    Peipei Song, Dan Guo, Xun Yang, Shengeng Tang, and Meng Wang. Emotional video captioning with vision-based emo- tion interpretation network.IEEE Transactions on Image Processing, 33:1122–1135, 2024. 2

  26. [26]

    Controllable video captioning with pos sequence guidance based on gated fusion network

    Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, and Wei Liu. Controllable video captioning with pos sequence guidance based on gated fusion network. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 2641–2650, 2019. 2

  27. [27]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025. 6 9

  28. [28]

    Sutd-trafficqa: A question an- swering benchmark and an efficient network for video reason- ing over traffic events

    Li Xu, He Huang, and Jun Liu. Sutd-trafficqa: A question an- swering benchmark and an efficient network for video reason- ing over traffic events. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 9878–9888, 2021. 1, 2, 7, 8, 13, 16, 17

  29. [29]

    MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wen- shuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. Minicpm-v 4.5: Cooking effi- cient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025. 6, 14, 15, 34, 35, 36, 37, 38, 39, 40, 41

  30. [30]

    Con- trollable video captioning with an exemplar sentence

    Yitian Yuan, Lin Ma, Jingwen Wang, and Wenwu Zhu. Con- trollable video captioning with an exemplar sentence. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1085–1093, 2020. 2

  31. [31]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025. 6

  32. [32]

    Embodied understanding of driving scenarios

    Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, and Hongyang Li. Embodied understanding of driving scenarios. InEuropean Conference on Computer Vision, pages 129–148. Springer, 2024. 1, 2 10 RoadTones: Tone Controllable Text Generation from Road Event Videos Supplementary Material 11 Contents

  33. [33]

    TC-Gen Tone-Controlled Caption Generation

    Dataset Creation 3 4.1. TC-Gen Tone-Controlled Caption Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4.2. TX Tone Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.3. Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

  34. [34]

    VLM for Tone-Controllable Captioning 5

  35. [35]

    TE Tone Evaluation Metrics 6

  36. [36]

    Let’s look out for one another

    User Study 8 10 . Conclusion 8 List of Figures 13 A . Dataset Creation 16 A.1 .TC-GenTone-Controlled Caption Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 .TX Tone Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 . Data Statistics . . . . . . . . . . ...

  37. [37]

    Struct”) annotations in RoadTones-51K dataset. Only top 30 attributes of Personality tone are shown and remaining 86 are shown as “Others

    The colors in the generated captions map to blue for Personality and brown for Writing Style. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Tone-controlled Caption Generation pipeline.Inputs A.1 and A.2 include the target Tone Con- trols (Narrative and Structural) and a detailed video summary respectively. T...

  38. [38]

    A car pulled out from a side road without leaving sufficient space, nearly hitting a cyclist

    Key Event summary: A dashcam video captured a near-miss incident involving dangerous and careless driving in London, United Kingdom. A car pulled out from a side road without leaving sufficient space, nearly hitting a cyclist. The primary reason for the near-miss was the driver’s carelessness and failure to yield right of way to the cyclist. The incident ...

  39. [39]

    #CyclistLife

    Caption with Writing style and structure applied (informativeness, word count, binary toggles): I seriously can’t believe how close that car came to hitting me today! [scream emoji] Some drivers. . . #CyclistLife

  40. [40]

    Instructions

    Caption with Personality traits refined (preserving writing style and structural controls): I seriously can’t believe how close that car came to hitting me today! [scream emoji] Some drivers. . . #CyclistLife Selection: The third step candidate best satisfies the provided personality, writing style and structural controls; returning it as final. [/REASONI...

  41. [41]

    Assertive

    Compare meanings of traits. Map generated traits to ground truth traits based on semantic similarity. - Example: "Assertive" can align with "Dominant" if close in meaning. - If a ground-truth trait has no similar trait in generated, mark it as missing. - If generated has extra traits not related to ground-truth, treat as mismatch

  42. [42]

    - If meanings align and scores are close, reward higher similarity

    After mapping, compare the intensity scores (0 to 1). - If meanings align and scores are close, reward higher similarity. - If meanings align but scores differ a lot, penalize slightly

  43. [43]

    personality_similarity_score

    Produce a final similarity score. Return the score as a single float value between 0.0 and 1.0 (two decimals), with no other text or characters. Output (JSON only): { "personality_similarity_score": float } Personality Trait Evaluation PromptTE Figure 27.Personality tone alignment evaluation prompt (S p). 31 SYSTEM PROMPT: You are an expert writing style ann...