MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration

Dengbo He; Jiangbo Yu; Jiyao Wang; Luis Miranda-Moreno; Raphael Frank; Sasan Jafarnejad; Xiao Yang; Yubo Jiao; Yunbiao Wang

arxiv: 2605.01507 · v2 · submitted 2026-05-02 · 💻 cs.AI

MILD: Mediator Agent System with Bidirectional Perception and Multi-Layered Alignment for Human-Vehicle Collaboration

Jiyao Wang , Yunbiao Wang , Yubo Jiao , Xiao Yang , Dengbo He , Sasan Jafarnejad , Luis Miranda-Moreno , Raphael Frank

show 1 more author

Jiangbo Yu

This is my paper

Pith reviewed 2026-05-12 01:27 UTC · model grok-4.3

classification 💻 cs.AI

keywords human-vehicle collaborationmediator agentbidirectional perceptionpolicy optimizationdriving automationconstraint alignmentexplainable actions

0 comments

The pith

The MILD mediator agent uses bidirectional perception and constraint-weighted optimization to align vehicle actions with safety rules and driver preferences, turning humans into active managers of partial automation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that partial driving automation raises cognitive load on humans because vehicles ignore driver states while drivers cannot see the vehicle's logic. It introduces the MILD system, an agent architecture with one module that perceives both inside the cabin and the road ahead, plus a strategy module that proposes actions. These proposals are refined by Evidence- and Constraint-weighted Policy Optimization, which applies automatic validators and retrieved traffic rules, speed limits, and preferences to keep outputs compliant and explainable. Experiments on three open datasets show gains in perception accuracy and strategy quality under offline metrics, plus higher human ratings for policy adequacy, comfort, and explanations. If the approach holds, drivers shift from passive supervisors to active managers who receive transparent, aligned suggestions.

Core claim

The central claim is that the MILD agentic system, built from a joint in-cabin and out-of-cabin perception agent and a lightweight strategy agent optimized by Evidence- and Constraint-weighted Policy Optimization with automatic validators and retrieval-augmented constraints, produces driving suggestions that are more accurate, structurally complete, evidence-based, and free of violations than baseline methods, as shown by superior performance on three open datasets and improved human ratings for adequacy, comfort, and explanation.

What carries the argument

The MILD agentic architecture that pairs a bidirectional perception agent with a strategy agent refined through Evidence- and Constraint-weighted Policy Optimization using automatic validators and dynamic retrieval of regulations and preferences.

If this is right

Vehicle suggestions gain structural completeness and evidence backing while remaining free of constraint violations.
Drivers receive transparent, preference-aligned explanations that raise shared situational awareness.
Policy quality improves across auditable metrics for both perception and strategy on standard driving datasets.
Human ratings rise for policy adequacy, comfort, and explanation quality relative to prior methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same validator-plus-retrieval pattern might reduce coordination failures in other human-AI control tasks such as shared robotics.
Real-time operation would require testing whether the lightweight strategy agent maintains speed under live sensor streams.
Updating the constraint database could let the system adapt to new traffic laws without retraining the agents.

Load-bearing premise

Automatic validators inside the optimization step can catch and block every constraint violation and value misalignment that arises in real driving without missing edge cases or creating new biases.

What would settle it

A recorded driving sequence in which the MILD system outputs a suggested action that violates a traffic rule, safety constraint, or stated driver preference even after the validators and retrieval module have run.

read the original abstract

Prior studies report that partial driving automation can increase the cognitive demands on human drivers. This effect largely arises from human drivers' lack of transparent insight into the vehicle's intentions and decision logic, as well as from automated systems' limited awareness of the driver's dynamic state and preferences. This bidirectional misalignment undermines shared situational awareness and exacerbates coordination failures in human-vehicle interaction. To address these limitations, we argue for a paradigm shift that elevates the human role from passive supervisor to active manager. We introduce the Mediator-in-the-Loop-Driving (MILD) system, based on an agentic system architecture to facilitate synergistic human-vehicle collaboration. MILD integrates a perception agent for joint in-cabin and out-of-cabin understanding with a lightweight strategy agent that generates compliant and explainable action suggestions. To ensure these strategies are strictly aligned with safety regulations and human values, we develop Evidence- and Constraint-weighted Policy Optimization (ECPO). ECPO leverages automatic validators to steer the agent toward behaviors that are not only accurate but also structurally complete, substantiated by evidence, and free from constraint violations. Furthermore, a retrieval-augmented generation module dynamically incorporates constraints from traffic regulations, speed recommendations, and driver preferences into the decision loop. Field experiments across three open datasets demonstrate that MILD consistently outperforms baselines in both perception accuracy and strategy quality under auditable offline metrics, and yields higher human-rated policy adequacy, comfort, and explanation than baselines. This work offers a practical pathway for building auditable and aligned agents for human-vehicle collaborative driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MILD introduces a mediator agent architecture with ECPO optimization to fix bidirectional misalignment in partial driving automation, but the performance claims rest on experiments whose details are too thin to assess.

read the letter

The paper's main move is to treat the human as an active manager rather than a passive supervisor and build a two-agent system around that. A perception agent fuses cabin and road views, while a lightweight strategy agent produces action suggestions. ECPO then tunes the strategy agent by mixing automatic validators, evidence weighting, and retrieval of rules plus driver preferences. The abstract reports gains on three open datasets in both offline metrics and human ratings for adequacy, comfort, and explanation. That framing of the cognitive-load problem is clear and the architecture is a reasonable way to operationalize alignment constraints without relying solely on model prompting. The integration of RAG for dynamic constraints and the validator loop in ECPO are the parts that feel like actual engineering work rather than another generic agent wrapper. The soft spots sit in the results. The outperformance claims are stated without numbers, baseline descriptions, statistical tests, or any account of how the validators were checked for coverage. The stress-test point about missing edge cases (sensor noise, conflicting preferences, rare rule interactions) lands because nothing in the write-up shows recall on held-out violation types or stress tests. Without those, it is hard to know whether the validators are doing the heavy lifting or just avoiding common failures. This is for applied researchers working on human-AI collaboration in vehicles or safety-critical agents. Someone already thinking about mediator-style systems or constraint-weighted RL could pull useful pieces from the design, but anyone planning to cite or extend the results will need the full experimental section first. I would send it for peer review. The problem is real, the architecture is specified enough to review, and referees can push for the missing metrics and validator details.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Mediator-in-the-Loop-Driving (MILD) system, an agentic architecture with a perception agent for bidirectional in-cabin/out-of-cabin understanding and a lightweight strategy agent. It develops Evidence- and Constraint-weighted Policy Optimization (ECPO) that uses automatic validators plus retrieval-augmented generation to enforce alignment with safety regulations, speed limits, and driver preferences. The central empirical claim is that field experiments across three open datasets show MILD consistently outperforming baselines in perception accuracy and strategy quality under offline metrics, while also achieving higher human ratings for policy adequacy, comfort, and explanation.

Significance. If the reported outperformance and human ratings hold under rigorous validation, the work could meaningfully advance human-vehicle collaboration by shifting from passive supervision to active management via aligned, explainable agents. The bidirectional perception and multi-layered alignment approach directly targets documented issues in partial automation, and the ECPO framework offers a concrete mechanism for constraint-aware policy generation that may generalize beyond driving.

major comments (2)

[Abstract] Abstract: The central claim that 'MILD consistently outperforms baselines in both perception accuracy and strategy quality' and yields higher human ratings rests entirely on field experiments across three open datasets. No quantitative results, baseline descriptions, metric definitions, statistical tests, or dataset details are provided, making it impossible to assess whether the outperformance is substantive or merely stated.
[Abstract] Abstract (ECPO description): The claim that ECPO steers agents toward behaviors 'free from constraint violations' via automatic validators is load-bearing for all alignment and safety assertions. The manuscript supplies no enumeration of validator rules, no coverage or recall metrics on held-out violation types, and no stress tests for edge cases such as conflicting multi-driver preferences, sensor noise, or rare regulatory interactions. This directly matches the stress-test concern and leaves the completeness of the validators unproven.

minor comments (1)

[Abstract] Abstract: The acronym 'MILD' and the phrase 'Mediator-in-the-Loop-Driving' are introduced without a brief definition or reference to prior mediator concepts in human-AI collaboration literature, which could improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions have been made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'MILD consistently outperforms baselines in both perception accuracy and strategy quality' and yields higher human ratings rests entirely on field experiments across three open datasets. No quantitative results, baseline descriptions, metric definitions, statistical tests, or dataset details are provided, making it impossible to assess whether the outperformance is substantive or merely stated.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately evaluate the strength of the claims. The full manuscript details the three open datasets, the baselines (including LLM-based and rule-based agents), the offline metrics for perception accuracy and strategy quality, the human evaluation criteria, and the statistical tests in the Experiments section. We have revised the abstract to incorporate key quantitative highlights from those results, such as relative performance gains, while respecting length constraints. This makes the central claims more transparent without altering the manuscript's core contributions. revision: yes
Referee: [Abstract] Abstract (ECPO description): The claim that ECPO steers agents toward behaviors 'free from constraint violations' via automatic validators is load-bearing for all alignment and safety assertions. The manuscript supplies no enumeration of validator rules, no coverage or recall metrics on held-out violation types, and no stress tests for edge cases such as conflicting multi-driver preferences, sensor noise, or rare regulatory interactions. This directly matches the stress-test concern and leaves the completeness of the validators unproven.

Authors: We acknowledge that while Section 3 describes the ECPO framework, the integration of automatic validators, and the retrieval-augmented generation for incorporating constraints, the manuscript does not provide an explicit enumeration of validator rules or supporting empirical metrics. We have added this information in the revised manuscript, including a detailed list of validator rules, coverage and recall metrics on held-out violation types, and stress-test results for edge cases such as conflicting preferences, sensor noise, and rare regulatory interactions. These additions directly address the completeness concern and strengthen the safety and alignment claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external open datasets and human ratings

full rationale

The paper introduces the MILD architecture (perception agent + strategy agent + ECPO with validators and RAG) as a design contribution and grounds its strongest claims in field experiments on three independent open datasets plus human-rated metrics for policy adequacy, comfort, and explanation. No equations, derivations, or fitted parameters are presented that reduce any 'prediction' to the inputs by construction. No self-citations appear in the provided text, and the validation chain relies on external benchmarks rather than internal self-reference or renaming. The system is self-referential by design (as any agent architecture is), but this does not create circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review yields limited visibility into exact parameters or axioms; inferred components rest on domain assumptions about validator reliability and human preference modeling.

axioms (2)

ad hoc to paper Automatic validators can steer agents toward behaviors that are accurate, structurally complete, substantiated by evidence, and free from constraint violations
Core premise of ECPO described in abstract.
domain assumption Retrieval-augmented generation can dynamically incorporate traffic regulations and driver preferences without introducing new violations
Stated as part of the decision loop.

invented entities (2)

Mediator-in-the-Loop-Driving (MILD) system no independent evidence
purpose: Facilitate synergistic human-vehicle collaboration via bidirectional perception and aligned strategy generation
New named architecture introduced in the paper.
Evidence- and Constraint-weighted Policy Optimization (ECPO) no independent evidence
purpose: Optimize strategies to be compliant with safety regulations and human values using automatic validators
Custom optimization method presented as novel.

pith-pipeline@v0.9.0 · 5609 in / 1412 out tokens · 64417 ms · 2026-05-12T01:27:03.030348+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

age is stored with provenance metadata (e.g., source category, jurisdiction/market, vehicle configuration, and an internal clause ID), enabling policies to log which constraints were consulted. Specifically, given 𝑢=(𝑧,𝑝&'(,𝑝()*,𝑘), we form a retrieval query using jurisdiction and operating mode (from 𝑝()*), driver sensitivities/preferences (from 𝑝&'(), a...

work page 2025
[2]

All baselines, including GPT-5, receive identical schema definitions, profile/constraint fields, and evidence citation instructions to ensure fair comparison

and 𝐋𝐥𝐚𝐦𝐚−𝟑.𝟐−𝟑𝐁 (Grattafiori et al., 2024). All baselines, including GPT-5, receive identical schema definitions, profile/constraint fields, and evidence citation instructions to ensure fair comparison. The RAG module is implemented as a frozen external retriever of MILD. We use BAAI/bge-base-en-v1.5 as the dense bi-encoder retriever. For training, we fr...

work page 2024
[3]

#$["#$Z"#$K[

Inference of agents uses temperature=0.7, top-p=0.8, and max new tokens=5096. At runtime, the perception agent operates on overlapping temporal clips rather than frame-wise inputs. Each refreshed clip is converted into one joint structured summary, which then triggers one policy refresh by the strategy agent. In our implementation, the perception agent us...

work page 2020
[4]

Objective scores are ECPO-based metrics

Ablation of joint in-/out-of-cabin perception on strategy quality on the AIDE dataset. Objective scores are ECPO-based metrics. Human ratings are collected from twenty drivers on 5-Likert scale (higher is better). Scenario Variant Valid% ↑ ViolSev ↓ LowCtrl% ↓ HazF1 ↑ HAS ↑ Adequacy ↑ Comfort ↑ Explanation ↑ Driver-critical w/o 𝑥#$ 99.90 ± 0.15 0.42 ± 0.1...

work page 2025
[5]

Self-driving cars: A survey,

The example output policy of zero-shot GPT-5 and MILD(Qwen) in different scenarios from AIDE. The key differences in policy are marked in red. more modular strategy layer. This trade-off is also reflected in Section 4.6.1, where stronger perception mainly improves grounding-related measures rather than the basic role-boundary and safety behavior of the st...

work page doi:10.1016/j.eswa.2020.113816 2026
[6]

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., & Li, H

arXiv Preprint arXiv:2310.03026. Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., & Li, H. (2024). Drivelm: Driving with graph visual question answering. European Conference on Computer Vision, 256–274. Standing Committee of the National People’s Congress of the People’s Republic of China. (n.d.). Law of t...

work page doi:10.1016/j.trf.2018.11.006 2024
[7]

https://doi.org/10.1057/s41599-022-01463-3 The General Court of the Commonwealth of Massachusetts,

work page doi:10.1057/s41599-022-01463-3
[8]

Massachusetts general laws, part i, title xiv, chapter 90: Motor vehicles and aircraft

(n.d.). Massachusetts general laws, part i, title xiv, chapter 90: Motor vehicles and aircraft. Retrieved January 14, 2026, from https://malegislature.gov/Laws/GeneralLaws/PartI/TitleXIV/Chapter90 Van Der Laan, J. D., Heino, A., & De Waard, D. (1997). A simple procedure for the assessment of acceptance of advanced transport telematics. Transportation Rese...

work page doi:10.1016/s0968-090x(96)00025-3 2026
[9]

His research interests include physiological signal measurement, intelligent transport systems, and human factors

Currently, he is a postdoctoral researcher at McGill University, Canada. His research interests include physiological signal measurement, intelligent transport systems, and human factors. Yunbiao Wang received his Bachelor’s and Master’s degrees from Southwest Jiaotong University. He is currently a Ph.D. candidate at McGill University. His research focuse...

work page 2016
[10]

He is also affiliated with the Department of Civil and Environmental Engineering, HKUST, Hong Kong SAR

He is currently an assistant professor at the Intelligent Transportation Trust and Robotics and Autonomous Systems Thrust, the HKUST(Guangzhou). He is also affiliated with the Department of Civil and Environmental Engineering, HKUST, Hong Kong SAR. From 2020 to 2021, he was a post-doctoral fellow at the University of Toronto. Sasan Jafarnejad obtained his...

work page 2020

[1] [1]

age is stored with provenance metadata (e.g., source category, jurisdiction/market, vehicle configuration, and an internal clause ID), enabling policies to log which constraints were consulted. Specifically, given 𝑢=(𝑧,𝑝&'(,𝑝()*,𝑘), we form a retrieval query using jurisdiction and operating mode (from 𝑝()*), driver sensitivities/preferences (from 𝑝&'(), a...

work page 2025

[2] [2]

All baselines, including GPT-5, receive identical schema definitions, profile/constraint fields, and evidence citation instructions to ensure fair comparison

and 𝐋𝐥𝐚𝐦𝐚−𝟑.𝟐−𝟑𝐁 (Grattafiori et al., 2024). All baselines, including GPT-5, receive identical schema definitions, profile/constraint fields, and evidence citation instructions to ensure fair comparison. The RAG module is implemented as a frozen external retriever of MILD. We use BAAI/bge-base-en-v1.5 as the dense bi-encoder retriever. For training, we fr...

work page 2024

[3] [3]

#$["#$Z"#$K[

Inference of agents uses temperature=0.7, top-p=0.8, and max new tokens=5096. At runtime, the perception agent operates on overlapping temporal clips rather than frame-wise inputs. Each refreshed clip is converted into one joint structured summary, which then triggers one policy refresh by the strategy agent. In our implementation, the perception agent us...

work page 2020

[4] [4]

Objective scores are ECPO-based metrics

Ablation of joint in-/out-of-cabin perception on strategy quality on the AIDE dataset. Objective scores are ECPO-based metrics. Human ratings are collected from twenty drivers on 5-Likert scale (higher is better). Scenario Variant Valid% ↑ ViolSev ↓ LowCtrl% ↓ HazF1 ↑ HAS ↑ Adequacy ↑ Comfort ↑ Explanation ↑ Driver-critical w/o 𝑥#$ 99.90 ± 0.15 0.42 ± 0.1...

work page 2025

[5] [5]

Self-driving cars: A survey,

The example output policy of zero-shot GPT-5 and MILD(Qwen) in different scenarios from AIDE. The key differences in policy are marked in red. more modular strategy layer. This trade-off is also reflected in Section 4.6.1, where stronger perception mainly improves grounding-related measures rather than the basic role-boundary and safety behavior of the st...

work page doi:10.1016/j.eswa.2020.113816 2026

[6] [6]

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., & Li, H

arXiv Preprint arXiv:2310.03026. Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., & Li, H. (2024). Drivelm: Driving with graph visual question answering. European Conference on Computer Vision, 256–274. Standing Committee of the National People’s Congress of the People’s Republic of China. (n.d.). Law of t...

work page doi:10.1016/j.trf.2018.11.006 2024

[7] [7]

https://doi.org/10.1057/s41599-022-01463-3 The General Court of the Commonwealth of Massachusetts,

work page doi:10.1057/s41599-022-01463-3

[8] [8]

Massachusetts general laws, part i, title xiv, chapter 90: Motor vehicles and aircraft

(n.d.). Massachusetts general laws, part i, title xiv, chapter 90: Motor vehicles and aircraft. Retrieved January 14, 2026, from https://malegislature.gov/Laws/GeneralLaws/PartI/TitleXIV/Chapter90 Van Der Laan, J. D., Heino, A., & De Waard, D. (1997). A simple procedure for the assessment of acceptance of advanced transport telematics. Transportation Rese...

work page doi:10.1016/s0968-090x(96)00025-3 2026

[9] [9]

His research interests include physiological signal measurement, intelligent transport systems, and human factors

Currently, he is a postdoctoral researcher at McGill University, Canada. His research interests include physiological signal measurement, intelligent transport systems, and human factors. Yunbiao Wang received his Bachelor’s and Master’s degrees from Southwest Jiaotong University. He is currently a Ph.D. candidate at McGill University. His research focuse...

work page 2016

[10] [10]

He is also affiliated with the Department of Civil and Environmental Engineering, HKUST, Hong Kong SAR

He is currently an assistant professor at the Intelligent Transportation Trust and Robotics and Autonomous Systems Thrust, the HKUST(Guangzhou). He is also affiliated with the Department of Civil and Environmental Engineering, HKUST, Hong Kong SAR. From 2020 to 2021, he was a post-doctoral fellow at the University of Toronto. Sasan Jafarnejad obtained his...

work page 2020