pith. sign in

arxiv: 2606.17423 · v1 · pith:3A2N7B4Pnew · submitted 2026-06-16 · 💱 q-fin.CP · stat.ML

Martingale Doppelg\"anger-Eval: An Identification Framework for Auditing Candlestick Understanding in Vision-Language Models

Pith reviewed 2026-06-26 22:03 UTC · model grok-4.3

classification 💱 q-fin.CP stat.ML
keywords vision-language modelscandlestick chartsidentification problembenchmarkingtrend shortcuttingmarket datavisual evidence
0
0 comments X

The pith

No observational score can distinguish vision-language models that read candlesticks from those that shortcut to past trends

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that on real market histories chart evidence and trends are too strongly coupled for any score based on observed chart-label pairs to tell whether a model is actually using the visual patterns or simply following recent price movements. It introduces a benchmark that generates controlled synthetic markets to break this coupling and isolate the two behaviors. Testing across commercial and open models shows large positive weight on past trend but zero or opposite weight on the actual candlestick evidence. This matters because standard chart benchmarks cannot detect the shortcut and therefore cannot certify whether a model performs technical analysis or merely extrapolates.

Core claim

No evaluation functional computed from observational chart-label data can distinguish a grounded responder from a trend-shortcut responder under strong coupling, whereas matched evidence interventions separate the same responders at an exponential rate and trend-label swaps provide an independent shortcut stress test. The benchmark evaluates frozen VLMs on rendered OHLCV charts under a martingale-null market, injected-alpha counterfactual pairs, trend-confounder swaps, and regime shifts, revealing that models assign large positive coefficients to past trend while evidence coefficients are zero or opposite the rule-implied sign.

What carries the argument

Martingale Doppelgänger-Eval benchmark, which applies four controlled mechanisms (martingale-null market, injected-alpha pairs, trend swaps, regime shifts) to rendered charts together with a structural behavioral model that decomposes responses into trend sensitivity, evidence sensitivity, and evidence faithfulness.

If this is right

  • Models exhibit large positive coefficients on past trend but zero or opposite coefficients on evidence.
  • Matched-pair analyses show models either ignore injected candlestick semantics or move opposite the rule-implied direction.
  • Standard observational chart benchmarks cannot detect this failure mode.
  • The accompanying statistical toolkit supplies minimum detectable effects and block-aware sequential testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same controlled-intervention design could be reused to audit visual understanding in other time-series domains such as medical waveforms or sensor plots.
  • Training pipelines may require explicit decoupling of visual features from temporal autocorrelations to reduce shortcut reliance.
  • Auditors could apply the four-mechanism template to test any AI system that makes decisions from chart-like imagery.

Load-bearing premise

Chart evidence and trend remain strongly coupled in real market histories such that no observational functional can separate grounded versus shortcut responders.

What would settle it

A model that responds to injected-alpha counterfactual pairs in the rule-implied direction while showing zero trend sensitivity in the same controlled settings would falsify the observed failure mode.

Figures

Figures reproduced from arXiv: 2606.17423 by Ziyao Wang.

Figure 1
Figure 1. Figure 1: Shadow-market label mechanisms. The visible chart grammar is held fixed while the future-label mechanism [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Core diagnostics. Left: M0 trend-bias index and null overconfidence. Right: M2 AUC under aligned, balanced, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reliability and selective risk on M1 pairs; confidence is often anti-informative. Evidence faithfulness is weak [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

We introduce Martingale Doppelg\"anger-Eval, a public shadow-market benchmark for auditing whether vision-language models (VLMs) use candlestick evidence rather than extrapolate past trends. The central difficulty is identification: on real market histories, chart evidence and trend are strongly coupled, so an observational score cannot determine whether a fluent technical-analysis narrative is grounded in local visual evidence. We prove this limitation formally: no evaluation functional computed from observational chart--label data can distinguish a grounded responder from a trend-shortcut responder under strong coupling, whereas matched evidence interventions separate the same responders at an exponential rate and trend--label swaps provide an independent shortcut stress test. The benchmark therefore evaluates frozen VLMs on rendered OHLCV charts under four controlled mechanisms: a martingale-null market, injected-alpha counterfactual pairs, trend-confounder swaps, and regime shifts. A structural behavioral model identifies null-market bias, trend sensitivity, evidence sensitivity, prompt/renderer fragility, and evidence faithfulness; the accompanying statistical toolkit provides minimum detectable effects, block-aware sequential testing for metered APIs, and an overlap-weighted artifact check. Across frozen commercial and open VLMs, the identified regression assigns large positive coefficients to past trend but evidence coefficients that are zero or opposite to the rule-implied sign. Matched-pair analyses show that models either ignore injected candlestick semantics or move opposite to the rule-implied direction conditional on responding. The benchmark isolates a failure mode that standard observational chart benchmarks cannot detect and gives a reusable audit template for time-series imagery with controllable label mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Martingale Doppelgänger-Eval, a shadow-market benchmark and identification framework for auditing whether vision-language models (VLMs) ground responses in local candlestick evidence or rely on trend extrapolation when processing rendered OHLCV charts. It formally proves that no observational evaluation functional can distinguish grounded from trend-shortcut responders under strong coupling of chart evidence and trend, shows that matched evidence interventions and trend-label swaps achieve separation at an exponential rate, and applies the benchmark (via martingale-null, injected-alpha pairs, trend swaps, and regime shifts) to frozen commercial and open VLMs. The structural behavioral model and statistical toolkit identify trend sensitivity as dominant, with evidence coefficients zero or opposite to rule-implied signs.

Significance. If the identification result and intervention validity hold, the work supplies a reusable audit template for detecting shortcut learning in VLM analysis of time-series imagery, directly relevant to financial applications where observational benchmarks are confounded. The formal observational impossibility result, the four controlled mechanisms, and the overlap-weighted artifact check are notable strengths that go beyond standard empirical testing.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (formal result): the claimed impossibility that 'no evaluation functional computed from observational chart–label data can distinguish a grounded responder from a trend-shortcut responder under strong coupling' is load-bearing for the entire framework; the manuscript must state the precise theorem, including the exact definition of strong coupling, the class of allowed functionals, and the data-generating process assumptions, because any relaxation of the coupling condition would invalidate the separation claim for the interventions.
  2. [§4, §5] §4 (intervention construction) and §5 (empirical results): the assertion that matched evidence interventions separate responders 'at an exponential rate' and that trend-label swaps provide an independent stress test rests on the four mechanisms being realizable in rendered OHLCV charts without introducing unmodeled visual confounders (e.g., inconsistent OHLC relationships or renderer artifacts). The overlap-weighted artifact check is mentioned but the manuscript must report quantitative diagnostics showing that the joint distribution of low-level candlestick features is preserved up to the controlled alpha or trend component; otherwise the observed zero or opposite evidence coefficients could reflect artifact sensitivity rather than the claimed identification failure.
  3. [§5] §5 (regression results): the structural behavioral model identifies 'evidence sensitivity' via regression coefficients, but the reported finding that evidence coefficients are 'zero or opposite to the rule-implied sign' requires explicit reporting of the minimum detectable effect, the exact specification of the regression (including how prompt/renderer fragility is controlled), and block-aware sequential testing details to confirm that the null-market bias and trend sensitivity estimates are not driven by API metering artifacts.
minor comments (2)
  1. [Title] The title uses 'Doppelg"anger' (escaped quote); standardize to 'Doppelgänger' throughout.
  2. [Abstract] Abstract: the phrase 'public shadow-market benchmark' should be defined on first use or cross-referenced to the section that describes its public release and reproducibility artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating revisions where the manuscript will be updated for greater precision and transparency.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (formal result): the claimed impossibility that 'no evaluation functional computed from observational chart–label data can distinguish a grounded responder from a trend-shortcut responder under strong coupling' is load-bearing for the entire framework; the manuscript must state the precise theorem, including the exact definition of strong coupling, the class of allowed functionals, and the data-generating process assumptions, because any relaxation of the coupling condition would invalidate the separation claim for the interventions.

    Authors: Section 3 already states the theorem with the definition of strong coupling (the condition under which chart evidence and trend are statistically dependent such that any observational functional of the joint distribution cannot separate the two responder types), the class of functionals (all measurable maps from observational chart-label pairs to real numbers), and the DGP (stationary martingale-null processes with the stated coupling). To improve accessibility we will add an explicit boxed statement of the theorem together with the three assumptions immediately after the abstract and cross-reference it in §3; this is a presentational change only. revision: yes

  2. Referee: [§4, §5] §4 (intervention construction) and §5 (empirical results): the assertion that matched evidence interventions separate responders 'at an exponential rate' and that trend-label swaps provide an independent stress test rests on the four mechanisms being realizable in rendered OHLCV charts without introducing unmodeled visual confounders (e.g., inconsistent OHLC relationships or renderer artifacts). The overlap-weighted artifact check is mentioned but the manuscript must report quantitative diagnostics showing that the joint distribution of low-level candlestick features is preserved up to the controlled alpha or trend component; otherwise the observed zero or opposite evidence coefficients could reflect artifact sensitivity rather than the claimed identification failure.

    Authors: The overlap-weighted artifact check was performed precisely to verify that low-level feature distributions (body size, wick ratios, color consistency) remain matched across counterfactual pairs up to the injected alpha or trend component. We will add a dedicated table in §4 reporting the quantitative overlap statistics (e.g., Wasserstein distances and Kolmogorov-Smirnov p-values on the feature marginals) for all four mechanisms; these numbers confirm that renderer artifacts do not drive the separation. The exponential-rate claim follows directly from the information-theoretic bound under the maintained coupling model. revision: yes

  3. Referee: [§5] §5 (regression results): the structural behavioral model identifies 'evidence sensitivity' via regression coefficients, but the reported finding that evidence coefficients are 'zero or opposite to the rule-implied sign' requires explicit reporting of the minimum detectable effect, the exact specification of the regression (including how prompt/renderer fragility is controlled), and block-aware sequential testing details to confirm that the null-market bias and trend sensitivity estimates are not driven by API metering artifacts.

    Authors: The statistical toolkit already defines the minimum detectable effect, the exact linear specification (evidence and trend regressors plus prompt and renderer fixed effects to control fragility), and the block-aware sequential testing procedure that respects API metering constraints. We will insert the numerical MDE values, the full regression equation, and the block-testing results directly into the §5 results tables and text; this makes the reported zero/opposite evidence coefficients fully auditable without altering any estimates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; formal identification result is self-contained

full rationale

The paper's central claim is a formal proof that no observational evaluation functional can distinguish grounded from trend-shortcut responders under strong coupling, with interventions providing separation. No equations, fitted parameters, or self-citations are presented that reduce this result to a definition or input by construction. The benchmark mechanisms (martingale null, injected-alpha pairs, trend swaps, regime shifts) and structural behavioral model are introduced as external constructions independent of the observational data. The derivation chain therefore remains non-circular and externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central identification claim rests on an unshown formal proof whose assumptions are not enumerated here.

pith-pipeline@v0.9.1-grok · 5810 in / 1233 out tokens · 51540 ms · 2026-06-26T22:03:40.789911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Invariant Risk Minimization

    Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization.arXiv preprint arXiv:1907.02893, 2019

  2. [2]

    Qwen2.5-VL technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report, 2025

  3. [3]

    Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995

    Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995

  4. [4]

    Generalized autoregressive conditional heteroskedasticity.Journal of Econometrics, 31(3): 307–327, 1986

    Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity.Journal of Econometrics, 31(3): 307–327, 1986

  5. [5]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

  6. [6]

    Simple technical trading rules and the stochastic properties of stock returns.The Journal of Finance, 47(5):1731–1764, 1992

    William Brock, Josef Lakonishok, and Blake LeBaron. Simple technical trading rules and the stochastic properties of stock returns.The Journal of Finance, 47(5):1731–1764, 1992

  7. [7]

    Crump, V

    Richard K. Crump, V . Joseph Hotz, Guido W. Imbens, and Oscar A. Mitnik. Dealing with limited overlap in estimation of average treatment effects.Biometrika, 96(1):187–199, 2009

  8. [8]

    DeGroot and Stephen E

    Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation of forecasters.Journal of the Royal Statistical Society: Series D (The Statistician), 32(1–2):12–22, 1983

  9. [9]

    Eugene F. Fama. Efficient capital markets: A review of theory and empirical work.The Journal of Finance, 25(2): 383–417, 1970

  10. [10]

    Enhancing stock movement prediction with adversarial training

    Fuli Feng, Huimin Chen, Xiangnan He, Ji Ding, Maosong Sun, and Tat-Seng Chua. Enhancing stock movement prediction with adversarial training. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 5843–5849, 2019

  11. [11]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

  12. [12]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems, volume 30, 2017

  13. [13]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  14. [14]

    Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B, 69(2):243–268, 2007

  15. [15]

    Peter Grünwald, Rianne de Heide, and Wouter M. Koolen. Safe testing.Journal of the Royal Statistical Society: Series B, 86(5):1091–1128, 2024

  16. [16]

    Hans R. Kunsch. The jackknife and the bootstrap for general stationary observations.The Annals of Statistics, 17 (3):1217–1241, 1989

  17. [17]

    Pix2Struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. InProceedings of the 40th International Conference on Machine Learning, pages 18893–18912, 2023

  18. [18]

    LLaV A-OneVision: Easy visual task transfer, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. LLaV A-OneVision: Easy visual task transfer, 2024. 10

  19. [19]

    Zaslavsky

    Fan Li, Kari Lock Morgan, and Alan M. Zaslavsky. Balancing covariates via propensity score weighting.Journal of the American Statistical Association, 113(521):390–400, 2018

  20. [20]

    DePlot: One-shot visual language reasoning by plot-to- table translation

    Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to- table translation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, 2023

  21. [21]

    MatCha: Enhancing visual language pretraining with math reasoning and chart derendering

    Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Martin Eisenschlos. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 12756–12770, 2023

  22. [22]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229, 2019

  23. [23]

    Allan H. Murphy. A new vector partition of the probability score.Journal of Applied Meteorology, 12(4):595–600, 1973

  24. [24]

    Benchmark dataset for mid-price forecasting of limit order book data with machine learning methods.Journal of Forecasting, 37(8):852–866, 2018

    Adamantios Ntakaris, Martin Magris, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis. Benchmark dataset for mid-price forecasting of limit order book data with machine learning methods.Journal of Forecasting, 37(8):852–866, 2018

  25. [25]

    GPT-4 technical report, 2023

    OpenAI. GPT-4 technical report, 2023

  26. [26]

    GPT-4o system card, 2024

    OpenAI. GPT-4o system card, 2024

  27. [27]

    Cambridge University Press, 2 edition, 2009

    Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009

  28. [28]

    Causal inference by using invariant prediction: Identifi- cation and confidence intervals.Journal of the Royal Statistical Society: Series B, 78(5):947–1012, 2016

    Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: Identifi- cation and confidence intervals.Journal of the Royal Statistical Society: Series B, 78(5):947–1012, 2016

  29. [29]

    Game-theoretic statistics and safe anytime- valid inference.Statistical Science, 38(4):576–601, 2023

    Aaditya Ramdas, Peter Grünwald, Vladimir V ovk, and Glenn Shafer. Game-theoretic statistics and safe anytime- valid inference.Statistical Science, 38(4):576–601, 2023

  30. [30]

    why should i trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “why should i trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016

  31. [31]

    Rosenbaum and Donald B

    Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55, 1983

  32. [32]

    Tsybakov.Introduction to Nonparametric Estimation

    Alexandre B. Tsybakov.Introduction to Nonparametric Estimation. Springer, 2009

  33. [33]

    Gauthier-Villars, 1939

    Jean Ville.Étude critique de la notion de collectif. Gauthier-Villars, 1939

  34. [34]

    CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. CharXiv: Charting gaps in realistic chart understanding in multimodal LLMs. InAdvances in Neural Information Processing Systems, 2024

  35. [35]

    Estimating means of bounded random variables by betting.Journal of the Royal Statistical Society: Series B, 86(1):1–27, 2024

    Ian Waudby-Smith and Aaditya Ramdas. Estimating means of bounded random variables by betting.Journal of the Royal Statistical Society: Series B, 86(1):1–27, 2024

  36. [36]

    A reality check for data snooping.Econometrica, 68(5):1097–1126, 2000

    Halbert White. A reality check for data snooping.Econometrica, 68(5):1097–1126, 2000

  37. [37]

    Yumo Xu and Shay B. Cohen. Stock movement prediction from tweets and historical prices. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 1970–1979. Association for Computational Linguistics, 2018. 11

  38. [38]

    ChartBench: A benchmark for complex visual reasoning in charts, 2023

    Zhengzhuo Xu, Sinan Du, Yiyan Qi, Chengjin Xu, Chun Yuan, and Jian Guo. ChartBench: A benchmark for complex visual reasoning in charts, 2023

  39. [39]

    breakouts happen at trend ends

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Weiye Xu, Hao Li, Jiahao Wang, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. 12 A Proofs and Extended Theoretical Analysis A.1 No...