pith. sign in

arxiv: 2604.12659 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.CL

Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting

Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords vision-language modelscandlestick chartsstock price forecastingmulti-scale analysisvisual pattern recognitionprediction biastemporal reasoningbenchmark evaluation
0
0 comments X

The pith

Vision-language models forecast stock prices from candlestick charts only under persistent uptrends or downtrends.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset of multi-scale candlestick charts and an evaluation framework to test whether vision-language models truly interpret visual stock patterns for forecasting. It compares model outputs to an XGBoost baseline using confusion matrices and information coefficient metrics to separate visual understanding from memorized trends or prompt cues. Results show strong performance only in steady directional markets but weakness in typical mixed conditions, along with prediction biases and little reaction to specified time horizons. A reader would care because many financial tools now rely on these models for chart analysis, yet the findings suggest they may miss the varied signals that define everyday trading.

Core claim

We construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs' ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient time series metrics and includes XGBoost as a feature-based temporal baseline. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts.

What carries the argument

The multi-scale candlestick charts dataset and evaluation framework that uses confusion-matrix diagnostics, information coefficient time series metrics, and an XGBoost feature-based baseline to measure genuine visual comprehension of patterns.

If this is right

  • VLMs cannot reliably combine short-term inflection cues with long-term trends for accurate forecasts.
  • Predictions carry systematic biases outside of clear directional persistence.
  • Specifying forecast horizons in prompts does not improve the models' temporal reasoning.
  • Feature-based models such as XGBoost provide stronger performance in non-trending market regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Financial applications using VLMs may require regime-specific safeguards or hybrid statistical components to handle mixed conditions.
  • The benchmark approach could extend to other visual time-series tasks to check for similar trend-only behavior.
  • Training adjustments focused on diverse market regimes might reduce the observed biases if applied to new models.

Load-bearing premise

The new multi-scale dataset and evaluation framework successfully isolate genuine visual comprehension of candlestick patterns by VLMs without contamination from prior training data or prompt artifacts.

What would settle it

Demonstrating that VLMs achieve balanced confusion matrices and high information coefficients across sideways, volatile, and non-persistent market conditions after controls for data leakage would disprove the identified limitations.

Figures

Figures reproduced from arXiv: 2604.12659 by Kaiqi Hu, Linda Xiao, Mingwen Liu, Shiyue Xu, Ziyi Tang.

Figure 1
Figure 1. Figure 1: Pipeline for constructing visual inputs and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: daily candlestick example [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: weekly candlestick example After candlestick chart generation, all visual outputs are orga￾nized using a hierarchical directory structure to support system￾atic visual data processing. The top-level directory is indexed by the cutoff date associated with each observation, followed by the stock code at the second level. Within each stock-specific direc￾tory, candlestick charts at different temporal frequenc… view at source ↗
Figure 4
Figure 4. Figure 4: Heatmap of Claude-Haiku-4-5 5.1.2 Metrics based on Confusion Matrix. Based on the confusion matrix, accuracy, precision, recall, specificity, and F1 score can be computed as defined in Section 4.3.1. The right part of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bias of each model. Well-calibrated (|bias| ≤ 0.05), Slightly biased (0.05 < |bias| ≤ 0.15), Moderately biased (0.15 < |bias| ≤ 0.40), Strongly biased (|bias| > 0.40). Model bias evaluation reveals significant divergence across archi￾tectures, with traditional Machine Learning and VLMs exhibiting distinct directional preferences. Underlying causes warrant further investigation. This systematic heterogeneit… view at source ↗
Figure 8
Figure 8. Figure 8: Bull or Bear Market Heatmap 5.4.2 Extreme Stock Cases Analysis. In this paper, extreme stocks are defined as follows: an individual stock is continuous rise: if the ratio of the certain stock’s rising days > 70% during the test period, and is continuous fall if the ratio of falling days > 70%. The predic￾tive behavior of various models toward these stocks is examined, with results presented in the [PITH_F… view at source ↗
Figure 9
Figure 9. Figure 9: Continuous Rise or Fall Heat-map [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs' comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs' ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs' ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a multi-scale candlestick chart dataset and standardized evaluation framework to benchmark vision-language models (VLMs) on visual stock price forecasting. It claims that existing benchmarks fail to isolate genuine visual comprehension of candlestick patterns or multi-scale integration, and presents results showing VLMs perform well only under persistent uptrend/downtrend conditions while exhibiting weak predictive capability in common scenarios, significant biases, and limited sensitivity to specified forecast horizons; evaluation uses confusion-matrix diagnostics, information coefficient (IC) metrics, and an XGBoost feature-based temporal baseline.

Significance. If the central empirical claims hold, the work would usefully document limitations in current VLMs for multi-scale visual temporal reasoning in finance and motivate better architectures or training for such tasks. Credit is due for constructing a multi-scale dataset that aligns with human analyst practices (longer horizons for trends, shorter for inflections), including a reproducible XGBoost baseline, and combining classification diagnostics with IC time-series metrics rather than relying on accuracy alone.

major comments (2)
  1. [Evaluation Framework and Dataset Construction] The central claim that VLMs fail to 'truly read' candlesticks outside persistent trends and exhibit inherent limitations in temporal reasoning depends on the multi-scale dataset and evaluation framework successfully isolating genuine visual comprehension. The manuscript does not report controls for pretraining contamination on public financial imagery (e.g., synthetic/OOD chart generation, semantic-preserving perturbations such as color inversion or scale randomization, or direct comparisons against text-only equivalents of the same price series). Without these, the reported weak performance in common market scenarios and horizon insensitivity could reflect distribution shift or memorized patterns rather than visual reasoning deficits. This issue is load-bearing for the 'truly read' and 'inherent limitations' conclusions (see Abstract and the evaluation framework description).
  2. [Experimental Results] The XGBoost baseline is described as a feature-based temporal comparator, but it does not address VLM-specific memorization risks on real stock chart images. Adding a text-only VLM ablation or a vision-only control would strengthen the isolation of visual comprehension effects.
minor comments (1)
  1. [Abstract] Abstract contains a minor typographical issue: 'models(VLMs)' should read 'models (VLMs)'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important considerations for strengthening the isolation of visual reasoning effects in our evaluation. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation Framework and Dataset Construction] The central claim that VLMs fail to 'truly read' candlesticks outside persistent trends and exhibit inherent limitations in temporal reasoning depends on the multi-scale dataset and evaluation framework successfully isolating genuine visual comprehension. The manuscript does not report controls for pretraining contamination on public financial imagery (e.g., synthetic/OOD chart generation, semantic-preserving perturbations such as color inversion or scale randomization, or direct comparisons against text-only equivalents of the same price series). Without these, the reported weak performance in common market scenarios and horizon insensitivity could reflect distribution shift or memorized patterns rather than visual reasoning deficits. This issue is load-bearing for the 'truly read' and 'inherent limitations' conclusions (see Abstract and t

    Authors: We agree that explicit controls for pretraining contamination and distribution shift would provide stronger support for attributing performance differences to visual reasoning limitations rather than memorization. Our current framework uses real historical stock data rendered as standard candlestick charts and compares against an XGBoost model trained on numerical temporal features extracted from the same series, which helps separate visual from non-visual temporal modeling. However, we did not include OOD perturbations or text-only equivalents in the submitted version. In revision, we will add a dedicated limitations subsection discussing potential contamination risks and include new experiments with text-only prompts that describe the identical price series in natural language, allowing direct comparison of visual versus textual input for the same forecasting task. This addresses the load-bearing concern without changing the core empirical observations. revision: partial

  2. Referee: [Experimental Results] The XGBoost baseline is described as a feature-based temporal comparator, but it does not address VLM-specific memorization risks on real stock chart images. Adding a text-only VLM ablation or a vision-only control would strengthen the isolation of visual comprehension effects.

    Authors: The XGBoost baseline is intended as a non-visual temporal reference using engineered features from the price series, not as a direct control for VLM memorization. We acknowledge that it does not fully isolate VLM-specific image memorization effects. To strengthen this, the revised manuscript will include a text-only VLM ablation in which the same multi-scale price information is provided via textual descriptions rather than chart images. A pure vision-only control is less applicable to standard VLMs, which integrate vision and language components by design; we will instead reference ablation studies from the broader VLM literature on vision-encoder contributions and note this as a direction for future work. These additions will better isolate the role of visual chart comprehension. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivation chain

full rationale

The paper constructs a new multi-scale candlestick dataset and standardized evaluation framework (confusion matrices, IC metrics, XGBoost baseline) then reports direct experimental results on VLM performance across trend conditions. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. All claims reduce to observable benchmark outputs rather than reducing to inputs by construction. This matches the expected non-circular outcome for an empirical evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work relies on standard machine learning evaluation practices and existing VLM architectures.

pith-pipeline@v0.9.0 · 5550 in / 1007 out tokens · 36464 ms · 2026-05-10T15:20:25.185204+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Ross L. Beck. The gartley and elliott wave relationship. InThe Gartley Trading Method, pages 27–40. Wiley, 2012

  2. [2]

    Chootong and O

    C. Chootong and O. Sornil. Trading signal generation using a combination of chart patterns and indicators.International Journal of Computer Science Issues, 9:202–209, 2012

  3. [3]

    Fnspid: A comprehensive financial news dataset in time series

    Zihan Dong, Xinyu Fan, and Zhiyuan Peng. Fnspid: A comprehensive financial news dataset in time series. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pages 1–10, New York, NY, USA, 2024. Association for Computing Machinery

  4. [4]

    R. D. Edwards, J. Magee, and W. H. C. Bassetti. Replacing dow theory with john magee’s basing points procedure. InTechnical Analysis of Stock Trends, pages 31–40. CRC Press, 2018

  5. [5]

    Huang, M

    J. Huang, M. Xiao, D. Li, Z. Jiang, Y. Yang, Y. Zhang, L. Qian, Y. Wang, X. Peng, Y. Ren, R. Xiang, Z. Chen, X. Zhang, Y. He, W. Han, S. Chen, L. Shen, D. Kim, Y. Yu, J. Tsujii, et al. Open-FinLLMs: Open multimodal large language models for financial applications.arXiv preprint, 2024

  6. [6]

    V. A. Ivanyuk. A long-term forecasting technique based on a multi-trend forecast. Soft Measurements and Computing, 12(73):129–138, 2023

  7. [7]

    Archer John M

    Michael D. Archer John M. Bland, Jay M. Meisler. Trade #5: Don’t be a flatlander: Money management. InForex Essentials in 15 Trades, pages 135–138. Wiley, 2012

  8. [8]

    Multimodal stock price prediction.arXiv preprint, January 2025

    Furkan Karadaş, Bahaeddin Eravcı, and Ahmet Murat Özbayoğlu. Multimodal stock price prediction.arXiv preprint, January 2025. Also in Proceedings of the 17th International Conference on Agents and Artificial Intelligence, Vol. 3 (2025), pp. 687–694

  9. [9]

    Khanderwal and D

    S. Khanderwal and D. Mohanty. Stock price prediction using ARIMA model. International Journal of Marketing & Human Resource Research, 2(2):98–107, 2021

  10. [10]

    VISTA: Vision-language inference for training-free stock time-series analysis.arXiv preprint, 2025

    Tina Khezresmaeilzadeh, Parsa Razmara, Seyedarmin Azizi, Mohammad Erfan Sadeghi, and Erfan Baghaei Potraghloo. VISTA: Vision-language inference for training-free stock time-series analysis.arXiv preprint, 2025. arXiv:2505.18570v2

  11. [11]

    S. S. Khurana, P. Singh, and N. K. Garg. Revolutionize AI trading bots with AutoML-based multi-timeframe bitcoin price prediction.SN Computer Science, 4(5), 2023

  12. [12]

    R. M. I. Kusuma, T.-T. Ho, W.-C. Kao, Y.-Y. Ou, and K.-L. Hua. Using deep learning neural networks and candlestick chart representation to predict stock market. arXiv preprint, 2019

  13. [13]

    Liang, S

    M. Liang, S. Wu, X. Wang, and Q. Chen. A stock time series forecasting approach incorporating candlestick patterns and sequence similarity.Expert Systems with Applications, 205:117595, 2022

  14. [14]

    Sasanur, Megha Sharma, Jiaming Cui, Qingsong Wen, Chao Zhang, and B

    Haoxin Liu, Shangqing Xu, Zhiyuan Zhao, Lingkai Kong, Harshavardhan Ka- marthi, Aditya B. Sasanur, Megha Sharma, Jiaming Cui, Qingsong Wen, Chao Zhang, and B. Aditya Prakash. Time-mmd: Multi-domain multimodal dataset for time series analysis. InAdvances in Neural Information Processing Systems, 2024. Datasets and Benchmarks Track

  15. [15]

    Nichani, L

    R. Nichani, L. Gasmi, N. Laiche, and S. Kabou. Optimizing financial time se- ries predictions with hybrid ARIMA, LSTM, and XGBoost models.Studies in Engineering and Exact Sciences, 5(2):e11188, 2024

  16. [16]

    N. Nikam. Stock prediction using AIML based on candlestick chart analysis. International Journal for Research in Applied Science and Engineering Technology, 13(4):6867–6870, 2025

  17. [17]

    H. A. do Prado, E. Ferneda, L. C. R. Morais, A. J. B. Luiz, and E. Matsura. On the effectiveness of candlestick chart analysis for the brazilian stock market. In Procedia Computer Science, volume 22, pages 1136–1145, 2013

  18. [18]

    G. Roy, J. Fiaidhi, and S. Mohammed. Multi-timeframe algorithmic trading bots using thick data heuristics with deep reinforcement learning.Artificial Intelligence Evolution, pages 107–159, 2022

  19. [19]

    Sheng, Y

    Y. Sheng, Y. Qu, and D. Ma. Stock price crash prediction based on multimodal data machine learning models.Finance Research Letters, 62:105195, 2024

  20. [20]

    D. Shu, H. Yuan, Y. Wang, Y. Liu, H. Zhang, H. Zhao, and M. Du. FinChart-Bench: Benchmarking financial chart comprehension in vision-language models.arXiv preprint, 2025

  21. [21]

    H. S. Sim, H. I. Kim, and J. J. Ahn. Is deep learning for image recognition applicable to stock market prediction?Complexity, 2019(1), 2019

  22. [22]

    Y. Wu. Stock price prediction based on simple decision tree random forest and xgboost.BCP Business & Management, 38:3383–3388, 2023

  23. [23]

    W. Xu, D. Xiang, Y. Liu, X. Wang, Y. Ma, L. Zhang, S. Hu, C. Xu, and J. Zhang. FinMultiTime: A four-modal bilingual dataset for financial time-series analysis. arXiv preprint, September 2025

  24. [24]

    Z. Yu. Stock price prediction using the ARIMA model.Highlights in Science, Engineering and Technology, 88:516–521, 2024

  25. [25]

    W. Zh¯ang. Neural network-based algorithmic trading systems: Multi-timeframe analysis and high-frequency execution in cryptocurrency markets.arXiv preprint, 2025. A Construction of Numerical Inputs Numerical time-series data are constructed to support baseline mod- els and to ensure a fair and controlled comparison with the visual modality. Raw OHLCV reco...

  26. [26]

    Decompose charts: 4- Red/green candle clusters: body/wick strength, reversal signals 5- Moving averages: price position, crossovers, support/resistance 6(5MA = black, 20MA = blue, 90MA = purple) 7- Volume: key levels, spikes, and trends 8- Inflection points: trend shifts, momentum changes

  27. [27]

    Cross-timeframe link: 10- Connect short-term (daily) and long-term (weekly) signals 11Rules: 12- Output ONLY <score>NUM</score>, where NUM is [-0.5, 1.0] with 3 decimal places. 13- +0.001 to +1.000: upward trend 14- -0.500 to -0.001: downward trend 15- 0.000: unclear trend 16- No default positive bias 17- One unique score per stock/date 18- No additional ...