Assessing the Feasibility of a Video-Based Conversational Chatbot Survey for Measuring Perceived Cycling Safety: A Pilot Study in New York City

Feiyang Ren; Takahiro Yabe; Tamir Mendel; Zhaoxi Zhang

arxiv: 2604.07375 · v1 · submitted 2026-04-07 · 💻 cs.CY · cs.HC

Assessing the Feasibility of a Video-Based Conversational Chatbot Survey for Measuring Perceived Cycling Safety: A Pilot Study in New York City

Feiyang Ren , Zhaoxi Zhang , Tamir Mendel , Takahiro Yabe This is my paper

Pith reviewed 2026-05-10 17:56 UTC · model grok-4.3

classification 💻 cs.CY cs.HC

keywords cycling safetyvideo-based surveyconversational AIperceived safetytransport planningLLM chatbotNew York Citypilot study

0 comments

The pith

Video-based conversational AI chatbots can feasibly collect in-the-moment perceptions of cycling safety.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops and tests a method that pairs short videos of city streets with a conversational AI chatbot to gather immediate feedback on how safe those streets feel for biking. Traditional surveys depend on people's memory of past experiences, which can be incomplete or biased. By having participants watch videos and chat with the AI about their impressions right then, the approach aims to capture more accurate and detailed perceptions along with the specific reasons for them. A small pilot with sixteen people in New York City produced positive ratings for the chatbot's usability and allowed extraction of key safety factors through language analysis and clustering. If this works at scale, planners could get better data to design streets that encourage more people to bike.

Core claim

The study shows that a video-based conversational chatbot survey is feasible for measuring perceived cycling safety, as demonstrated by structured interactions with sixteen participants across nine New York City street segments, positive user experience and usability ratings, and the successful application of natural language processing to extract built-environment attributes, cluster reasons and suggestions, and regress safety outcomes against environmental and demographic variables.

What carries the argument

The modular LLM architecture chatbot integrating prompt engineering, state management, and rule-based control to structure human-AI conversations that capture safety perceptions and reasons during video viewing.

If this is right

Built-environment attributes linked to safety can be extracted directly from open-ended responses using keyword extraction tools.
Semantic clustering of responses identifies recurring reasons for safety perceptions and user suggestions for improvements.
Regression models can quantify the influence of street features and rider demographics on perceived safety scores.
The approach enables collection of data on future visions for transport planning in addition to current perceptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be adapted to study perceptions for other transport modes such as walking or public transit.
It may complement traditional surveys by providing richer, context-specific data that reduces reliance on long-term recall.
Scaling the chatbot to larger participant groups could support city-wide infrastructure decisions based on aggregated perceptual maps.

Load-bearing premise

The chatbot interactions after watching selected videos produce unbiased, in-the-moment perceptions of cycling safety without meaningful influence from the AI's design or the particular video clips chosen.

What would settle it

A direct comparison where the same participants rate the same streets both through the chatbot after videos and immediately after cycling them in person, with large differences in reported safety levels or reasons undermining the method.

Figures

Figures reproduced from arXiv: 2604.07375 by Feiyang Ren, Takahiro Yabe, Tamir Mendel, Zhaoxi Zhang.

**Figure 2.** Figure 2: Chatbot design and user interface. Participants viewed short first-person cycling videos recorded [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Counts of "safe" and "unsafe" across the nine selected street segments. For each street segment, [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of "safe" (left) and "unsafe" (right) feature ratios across four bike lane types (no lane, [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: The left panel presents the distribution of semantic differential ratings measuring User Experience [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Bicycle safety is important for bikeability and transportation efficiency. However, conventional surveys often fall short in capturing how people actually perceive cycling environments because they rely heavily on respondents' recall rather than in-the-moment experience. By leveraging large language models (LLMs), this study proposes a new method of combining video-based surveys with a conversational AI chatbot to collect human perceptions of cycling safety and the reasons behind these perceptions. The paper developed the AI chatbot using a modular LLM architecture, integrating prompt engineering, state management, and rule-based control to support the structure of human-AI interaction. This paper evaluates the feasibility of the proposed video-based conversational chatbot using complete responses from sixteen participants to the pilot survey across nine street segments in New York City. The method feasibility was assessed using a seven-point scale rating for user experience (i.e., ease of use, supportiveness, efficiency) and a five-point scale for chatbot usability (i.e., personality, roboticness, friendliness), yielding positive results with mean scores of 5.00 out of 7 (standard deviation = 1.6) and 3.47 out of 5 (standard deviation = 0.43), respectively. The data feasibility was assessed using multiple techniques: (1) Natural language processing (NLP), such as KeyBERT, for overall safety and feature analysis to extract built-environment attributes; (2) K-means clustering for semantic analysis to identify reasons and suggestions; and (3) regression to estimate the effects of built-environment and demographic variables on perceived safety outcomes. The results show the potential of AI chatbots as a novel approach to collecting data on human perception, behavior, and future visions for transport planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small pilot of a video-plus-chatbot survey for cycling safety gets decent UX feedback but offers thin evidence that the method actually yields unbiased perceptions.

read the letter

The main thing here is a pilot testing whether a video-triggered conversational LLM can replace or supplement traditional recall-based surveys on bike safety. They built a modular chatbot with prompt engineering and state management, showed it to 16 people across nine NYC street segments, and report mean UX scores around 5/7 with usability at 3.47/5. They then ran KeyBERT for feature extraction, K-means on reasons, and some regression on perceived safety. That combination of video stimulus and live dialogue is the concrete new piece; prior work has done video surveys or chatbots separately, but not this exact pairing for transport perceptions. The implementation details on the architecture are clear enough to replicate, and the NLP steps are standard and reproducible. The soft spot is the sample. With N=16 and SD of 1.6 on the main rating, any clustering or regression results are noisy and sensitive to the specific prompts or video choices. There is no control arm against a plain video survey or a paper questionnaire, so we cannot tell whether the chatbot is pulling out genuine in-the-moment views or steering responses through its own framing. The feasibility claim for data quality therefore stays preliminary. This is the kind of methods note that belongs in a transportation or HCI venue rather than a top methods journal. Readers working on urban data collection tools would get some practical ideas from the architecture and the participant comments. It deserves peer review because the idea is straightforward to test further and the authors are transparent about the pilot scale, but any referee will rightly press for a larger sample and at least one comparison condition before the method can be recommended for planning use.

Referee Report

3 major / 2 minor

Summary. The paper proposes and pilots a video-based conversational chatbot survey using a modular LLM architecture (with prompt engineering, state management, and rule-based control) to capture in-the-moment perceptions of cycling safety. Feasibility is assessed via UX ratings (mean 5/7, SD=1.6) and usability scores (mean 3.47/5, SD=0.43) from N=16 participants across nine NYC street segments, followed by KeyBERT extraction of built-environment attributes, K-means clustering of reasons/suggestions, and regression on perceived safety.

Significance. If the central claim holds after addressing validation gaps, the work demonstrates a promising direction for richer, context-aware data collection in transportation planning that goes beyond recall-based surveys. The modular chatbot design is a concrete implementation strength that could be extended, though the pilot scale keeps immediate impact modest.

major comments (3)

[Abstract / Results] Abstract and results on data feasibility: The K-means clustering, KeyBERT analysis, and regression to estimate effects of built-environment and demographic variables on perceived safety are performed on only 16 responses; with high variability (SD=1.6 on the 7-point UX scale), these analyses have low power and are sensitive to outliers or design choices, weakening the claim that the method feasibly yields reliable perceptual data.
[Methods] Methods section on chatbot implementation: No control arm, non-chatbot survey comparison, or ablation of the prompt/state/rule-based components is reported, so it is impossible to rule out that extracted attributes, clusters, or regression coefficients reflect AI dialogue steering rather than unbiased participant perceptions of cycling safety.
[Discussion] Discussion or limitations: The feasibility conclusion rests on self-reported UX without external validation against established cycling safety instruments or in-the-moment measures (e.g., think-aloud protocols), leaving open whether the positive scores (5/7 and 3.47/5) indicate genuine data quality or simply acceptable interaction.

minor comments (2)

[Abstract] The abstract reports SD=0.43 for usability but does not specify the scale anchors or provide item-level breakdowns; adding these would improve interpretability of the 3.47/5 mean.
[Methods] Clarify in the methods how the nine street segments and associated videos were selected and whether they represent a range of safety conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and limitations of this pilot study. We address each major point below and will make targeted revisions to better frame the exploratory nature of the work.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results on data feasibility: The K-means clustering, KeyBERT analysis, and regression to estimate effects of built-environment and demographic variables on perceived safety are performed on only 16 responses; with high variability (SD=1.6 on the 7-point UX scale), these analyses have low power and are sensitive to outliers or design choices, weakening the claim that the method feasibly yields reliable perceptual data.

Authors: We agree that N=16 and the observed variability limit the power and robustness of the secondary analyses. As this is a pilot study, the primary aim was to assess technical and user-experience feasibility of the chatbot approach; the KeyBERT, clustering, and regression results are intended as illustrative examples of extractable data rather than definitive inferences. In revision we will reframe the abstract and results to emphasize the exploratory character of these analyses and add explicit discussion of sample-size limitations and sensitivity to the limitations section. revision: partial
Referee: [Methods] Methods section on chatbot implementation: No control arm, non-chatbot survey comparison, or ablation of the prompt/state/rule-based components is reported, so it is impossible to rule out that extracted attributes, clusters, or regression coefficients reflect AI dialogue steering rather than unbiased participant perceptions of cycling safety.

Authors: The lack of a control arm or component ablation is a genuine limitation of the current pilot, which focused on demonstrating a working modular implementation rather than comparative validation. The rule-based control layer was introduced precisely to constrain dialogue flow and reduce steering, yet without a non-chatbot baseline we cannot empirically isolate its effect. We will expand the methods and limitations sections to describe these design choices more fully and to state clearly that future work must include controlled comparisons to assess potential AI influence on the collected perceptions. revision: partial
Referee: [Discussion] Discussion or limitations: The feasibility conclusion rests on self-reported UX without external validation against established cycling safety instruments or in-the-moment measures (e.g., think-aloud protocols), leaving open whether the positive scores (5/7 and 3.47/5) indicate genuine data quality or simply acceptable interaction.

Authors: Self-reported UX is the standard initial metric for feasibility pilots, but we recognize it does not substitute for external validation of data quality. We will revise the discussion and limitations sections to acknowledge this gap explicitly, note that positive UX scores demonstrate acceptable interaction but do not yet confirm perceptual accuracy, and outline plans for future validation against established instruments or think-aloud protocols. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pilot with direct data collection

full rationale

The paper presents a pilot study that develops a modular LLM chatbot, collects responses from N=16 participants on video-based cycling safety perceptions, and analyzes them via standard off-the-shelf techniques (KeyBERT for attribute extraction, K-means for clustering reasons, and regression for variable effects). Feasibility is assessed through direct self-reported UX and usability scales with no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations or uniqueness theorems. The central claim that the method shows potential for collecting perception data follows from the observed participant scores and extracted patterns rather than reducing to the input design by construction. This is a standard empirical feasibility assessment with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about the validity of the survey method and analysis techniques rather than new physical or mathematical entities.

axioms (2)

domain assumption The seven-point and five-point scales accurately measure user experience and usability.
Standard Likert scales are assumed valid for this context.
domain assumption NLP methods like KeyBERT and K-means can reliably extract safety-related features and reasons from chatbot responses.
Assumes these tools are suitable for qualitative text analysis in this domain.

pith-pipeline@v0.9.0 · 5627 in / 1352 out tokens · 64902 ms · 2026-05-10T17:56:51.754043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

and Muller, B

Afzalan, N. and Muller, B. (2014). The role of social media in green infrastructure planning: A case study of neighborhood participation in park siting.Journal of Urban Technology, 21(3):67–83, ISSN:1466-1853, DOI:10.1080/10630732.2014.940701,http://dx.doi.org/10.1080/10630732.2014.940701. Al Sayyed, H. and Al-Azhari, W . (2025). Investigating the role of...

work page doi:10.1080/10630732.2014.940701 2014
[2]

Exceptional piezoelectricity, high thermal conductivity and stiffness and promising photocatalysis in two-dimensional MoSi2N4 family confirmed by first -principles

Kwon, J.-H. and Cho, G.-H. (2020). An examination of the intersection environment associated with per- ceived crash risk among school-aged children: using street-level imagery and computer vision.Acci- dent Analysis & Prevention, 146:105716, ISSN:0001-4575, DOI:10.1016/j.aap.2020.105716,http: //dx.doi.org/10.1016/j.aap.2020.105716. Lawson, A. R., Pakrashi...

work page doi:10.1016/j.aap.2020.105716 2020
[3]

Nankervis, M. (1999). The eﬀect of weather and climate on bicycle commuting.Transportation Research Part A: Policy and Practice, 33(6):417–431, ISSN:0965-8564, DOI:10.1016/s0965-8564(98)00022-6, http://dx.doi.org/10.1016/S0965-8564(98)00022-6. New York City Department of City Planning (2024). Digital city map (dcm).https://www.nyc.gov/site/ planning/data-...

work page doi:10.1016/s0965-8564(98)00022-6 1999

[1] [1]

and Muller, B

Afzalan, N. and Muller, B. (2014). The role of social media in green infrastructure planning: A case study of neighborhood participation in park siting.Journal of Urban Technology, 21(3):67–83, ISSN:1466-1853, DOI:10.1080/10630732.2014.940701,http://dx.doi.org/10.1080/10630732.2014.940701. Al Sayyed, H. and Al-Azhari, W . (2025). Investigating the role of...

work page doi:10.1080/10630732.2014.940701 2014

[2] [2]

Exceptional piezoelectricity, high thermal conductivity and stiffness and promising photocatalysis in two-dimensional MoSi2N4 family confirmed by first -principles

Kwon, J.-H. and Cho, G.-H. (2020). An examination of the intersection environment associated with per- ceived crash risk among school-aged children: using street-level imagery and computer vision.Acci- dent Analysis & Prevention, 146:105716, ISSN:0001-4575, DOI:10.1016/j.aap.2020.105716,http: //dx.doi.org/10.1016/j.aap.2020.105716. Lawson, A. R., Pakrashi...

work page doi:10.1016/j.aap.2020.105716 2020

[3] [3]

Nankervis, M. (1999). The eﬀect of weather and climate on bicycle commuting.Transportation Research Part A: Policy and Practice, 33(6):417–431, ISSN:0965-8564, DOI:10.1016/s0965-8564(98)00022-6, http://dx.doi.org/10.1016/S0965-8564(98)00022-6. New York City Department of City Planning (2024). Digital city map (dcm).https://www.nyc.gov/site/ planning/data-...

work page doi:10.1016/s0965-8564(98)00022-6 1999