Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting
Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3
The pith
Vision-language models forecast stock prices from candlestick charts only under persistent uptrends or downtrends.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs' ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient time series metrics and includes XGBoost as a feature-based temporal baseline. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts.
What carries the argument
The multi-scale candlestick charts dataset and evaluation framework that uses confusion-matrix diagnostics, information coefficient time series metrics, and an XGBoost feature-based baseline to measure genuine visual comprehension of patterns.
If this is right
- VLMs cannot reliably combine short-term inflection cues with long-term trends for accurate forecasts.
- Predictions carry systematic biases outside of clear directional persistence.
- Specifying forecast horizons in prompts does not improve the models' temporal reasoning.
- Feature-based models such as XGBoost provide stronger performance in non-trending market regimes.
Where Pith is reading between the lines
- Financial applications using VLMs may require regime-specific safeguards or hybrid statistical components to handle mixed conditions.
- The benchmark approach could extend to other visual time-series tasks to check for similar trend-only behavior.
- Training adjustments focused on diverse market regimes might reduce the observed biases if applied to new models.
Load-bearing premise
The new multi-scale dataset and evaluation framework successfully isolate genuine visual comprehension of candlestick patterns by VLMs without contamination from prior training data or prompt artifacts.
What would settle it
Demonstrating that VLMs achieve balanced confusion matrices and high information coefficients across sideways, volatile, and non-persistent market conditions after controls for data leakage would disprove the identified limitations.
Figures
read the original abstract
Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs' comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs' ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs' ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a multi-scale candlestick chart dataset and standardized evaluation framework to benchmark vision-language models (VLMs) on visual stock price forecasting. It claims that existing benchmarks fail to isolate genuine visual comprehension of candlestick patterns or multi-scale integration, and presents results showing VLMs perform well only under persistent uptrend/downtrend conditions while exhibiting weak predictive capability in common scenarios, significant biases, and limited sensitivity to specified forecast horizons; evaluation uses confusion-matrix diagnostics, information coefficient (IC) metrics, and an XGBoost feature-based temporal baseline.
Significance. If the central empirical claims hold, the work would usefully document limitations in current VLMs for multi-scale visual temporal reasoning in finance and motivate better architectures or training for such tasks. Credit is due for constructing a multi-scale dataset that aligns with human analyst practices (longer horizons for trends, shorter for inflections), including a reproducible XGBoost baseline, and combining classification diagnostics with IC time-series metrics rather than relying on accuracy alone.
major comments (2)
- [Evaluation Framework and Dataset Construction] The central claim that VLMs fail to 'truly read' candlesticks outside persistent trends and exhibit inherent limitations in temporal reasoning depends on the multi-scale dataset and evaluation framework successfully isolating genuine visual comprehension. The manuscript does not report controls for pretraining contamination on public financial imagery (e.g., synthetic/OOD chart generation, semantic-preserving perturbations such as color inversion or scale randomization, or direct comparisons against text-only equivalents of the same price series). Without these, the reported weak performance in common market scenarios and horizon insensitivity could reflect distribution shift or memorized patterns rather than visual reasoning deficits. This issue is load-bearing for the 'truly read' and 'inherent limitations' conclusions (see Abstract and the evaluation framework description).
- [Experimental Results] The XGBoost baseline is described as a feature-based temporal comparator, but it does not address VLM-specific memorization risks on real stock chart images. Adding a text-only VLM ablation or a vision-only control would strengthen the isolation of visual comprehension effects.
minor comments (1)
- [Abstract] Abstract contains a minor typographical issue: 'models(VLMs)' should read 'models (VLMs)'.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important considerations for strengthening the isolation of visual reasoning effects in our evaluation. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Evaluation Framework and Dataset Construction] The central claim that VLMs fail to 'truly read' candlesticks outside persistent trends and exhibit inherent limitations in temporal reasoning depends on the multi-scale dataset and evaluation framework successfully isolating genuine visual comprehension. The manuscript does not report controls for pretraining contamination on public financial imagery (e.g., synthetic/OOD chart generation, semantic-preserving perturbations such as color inversion or scale randomization, or direct comparisons against text-only equivalents of the same price series). Without these, the reported weak performance in common market scenarios and horizon insensitivity could reflect distribution shift or memorized patterns rather than visual reasoning deficits. This issue is load-bearing for the 'truly read' and 'inherent limitations' conclusions (see Abstract and t
Authors: We agree that explicit controls for pretraining contamination and distribution shift would provide stronger support for attributing performance differences to visual reasoning limitations rather than memorization. Our current framework uses real historical stock data rendered as standard candlestick charts and compares against an XGBoost model trained on numerical temporal features extracted from the same series, which helps separate visual from non-visual temporal modeling. However, we did not include OOD perturbations or text-only equivalents in the submitted version. In revision, we will add a dedicated limitations subsection discussing potential contamination risks and include new experiments with text-only prompts that describe the identical price series in natural language, allowing direct comparison of visual versus textual input for the same forecasting task. This addresses the load-bearing concern without changing the core empirical observations. revision: partial
-
Referee: [Experimental Results] The XGBoost baseline is described as a feature-based temporal comparator, but it does not address VLM-specific memorization risks on real stock chart images. Adding a text-only VLM ablation or a vision-only control would strengthen the isolation of visual comprehension effects.
Authors: The XGBoost baseline is intended as a non-visual temporal reference using engineered features from the price series, not as a direct control for VLM memorization. We acknowledge that it does not fully isolate VLM-specific image memorization effects. To strengthen this, the revised manuscript will include a text-only VLM ablation in which the same multi-scale price information is provided via textual descriptions rather than chart images. A pure vision-only control is less applicable to standard VLMs, which integrate vision and language components by design; we will instead reference ablation studies from the broader VLM literature on vision-encoder contributions and note this as a direction for future work. These additions will better isolate the role of visual chart comprehension. revision: yes
Circularity Check
No circularity: purely empirical benchmark with no derivation chain
full rationale
The paper constructs a new multi-scale candlestick dataset and standardized evaluation framework (confusion matrices, IC metrics, XGBoost baseline) then reports direct experimental results on VLM performance across trend conditions. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology. All claims reduce to observable benchmark outputs rather than reducing to inputs by construction. This matches the expected non-circular outcome for an empirical evaluation study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ross L. Beck. The gartley and elliott wave relationship. InThe Gartley Trading Method, pages 27–40. Wiley, 2012
work page 2012
-
[2]
C. Chootong and O. Sornil. Trading signal generation using a combination of chart patterns and indicators.International Journal of Computer Science Issues, 9:202–209, 2012
work page 2012
-
[3]
Fnspid: A comprehensive financial news dataset in time series
Zihan Dong, Xinyu Fan, and Zhiyuan Peng. Fnspid: A comprehensive financial news dataset in time series. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pages 1–10, New York, NY, USA, 2024. Association for Computing Machinery
work page 2024
-
[4]
R. D. Edwards, J. Magee, and W. H. C. Bassetti. Replacing dow theory with john magee’s basing points procedure. InTechnical Analysis of Stock Trends, pages 31–40. CRC Press, 2018
work page 2018
-
[5]
J. Huang, M. Xiao, D. Li, Z. Jiang, Y. Yang, Y. Zhang, L. Qian, Y. Wang, X. Peng, Y. Ren, R. Xiang, Z. Chen, X. Zhang, Y. He, W. Han, S. Chen, L. Shen, D. Kim, Y. Yu, J. Tsujii, et al. Open-FinLLMs: Open multimodal large language models for financial applications.arXiv preprint, 2024
work page 2024
-
[6]
V. A. Ivanyuk. A long-term forecasting technique based on a multi-trend forecast. Soft Measurements and Computing, 12(73):129–138, 2023
work page 2023
-
[7]
Michael D. Archer John M. Bland, Jay M. Meisler. Trade #5: Don’t be a flatlander: Money management. InForex Essentials in 15 Trades, pages 135–138. Wiley, 2012
work page 2012
-
[8]
Multimodal stock price prediction.arXiv preprint, January 2025
Furkan Karadaş, Bahaeddin Eravcı, and Ahmet Murat Özbayoğlu. Multimodal stock price prediction.arXiv preprint, January 2025. Also in Proceedings of the 17th International Conference on Agents and Artificial Intelligence, Vol. 3 (2025), pp. 687–694
work page 2025
-
[9]
S. Khanderwal and D. Mohanty. Stock price prediction using ARIMA model. International Journal of Marketing & Human Resource Research, 2(2):98–107, 2021
work page 2021
-
[10]
VISTA: Vision-language inference for training-free stock time-series analysis.arXiv preprint, 2025
Tina Khezresmaeilzadeh, Parsa Razmara, Seyedarmin Azizi, Mohammad Erfan Sadeghi, and Erfan Baghaei Potraghloo. VISTA: Vision-language inference for training-free stock time-series analysis.arXiv preprint, 2025. arXiv:2505.18570v2
-
[11]
S. S. Khurana, P. Singh, and N. K. Garg. Revolutionize AI trading bots with AutoML-based multi-timeframe bitcoin price prediction.SN Computer Science, 4(5), 2023
work page 2023
-
[12]
R. M. I. Kusuma, T.-T. Ho, W.-C. Kao, Y.-Y. Ou, and K.-L. Hua. Using deep learning neural networks and candlestick chart representation to predict stock market. arXiv preprint, 2019
work page 2019
- [13]
-
[14]
Sasanur, Megha Sharma, Jiaming Cui, Qingsong Wen, Chao Zhang, and B
Haoxin Liu, Shangqing Xu, Zhiyuan Zhao, Lingkai Kong, Harshavardhan Ka- marthi, Aditya B. Sasanur, Megha Sharma, Jiaming Cui, Qingsong Wen, Chao Zhang, and B. Aditya Prakash. Time-mmd: Multi-domain multimodal dataset for time series analysis. InAdvances in Neural Information Processing Systems, 2024. Datasets and Benchmarks Track
work page 2024
-
[15]
R. Nichani, L. Gasmi, N. Laiche, and S. Kabou. Optimizing financial time se- ries predictions with hybrid ARIMA, LSTM, and XGBoost models.Studies in Engineering and Exact Sciences, 5(2):e11188, 2024
work page 2024
-
[16]
N. Nikam. Stock prediction using AIML based on candlestick chart analysis. International Journal for Research in Applied Science and Engineering Technology, 13(4):6867–6870, 2025
work page 2025
-
[17]
H. A. do Prado, E. Ferneda, L. C. R. Morais, A. J. B. Luiz, and E. Matsura. On the effectiveness of candlestick chart analysis for the brazilian stock market. In Procedia Computer Science, volume 22, pages 1136–1145, 2013
work page 2013
-
[18]
G. Roy, J. Fiaidhi, and S. Mohammed. Multi-timeframe algorithmic trading bots using thick data heuristics with deep reinforcement learning.Artificial Intelligence Evolution, pages 107–159, 2022
work page 2022
- [19]
-
[20]
D. Shu, H. Yuan, Y. Wang, Y. Liu, H. Zhang, H. Zhao, and M. Du. FinChart-Bench: Benchmarking financial chart comprehension in vision-language models.arXiv preprint, 2025
work page 2025
-
[21]
H. S. Sim, H. I. Kim, and J. J. Ahn. Is deep learning for image recognition applicable to stock market prediction?Complexity, 2019(1), 2019
work page 2019
-
[22]
Y. Wu. Stock price prediction based on simple decision tree random forest and xgboost.BCP Business & Management, 38:3383–3388, 2023
work page 2023
-
[23]
W. Xu, D. Xiang, Y. Liu, X. Wang, Y. Ma, L. Zhang, S. Hu, C. Xu, and J. Zhang. FinMultiTime: A four-modal bilingual dataset for financial time-series analysis. arXiv preprint, September 2025
work page 2025
-
[24]
Z. Yu. Stock price prediction using the ARIMA model.Highlights in Science, Engineering and Technology, 88:516–521, 2024
work page 2024
-
[25]
W. Zh¯ang. Neural network-based algorithmic trading systems: Multi-timeframe analysis and high-frequency execution in cryptocurrency markets.arXiv preprint, 2025. A Construction of Numerical Inputs Numerical time-series data are constructed to support baseline mod- els and to ensure a fair and controlled comparison with the visual modality. Raw OHLCV reco...
work page 2025
-
[26]
Decompose charts: 4- Red/green candle clusters: body/wick strength, reversal signals 5- Moving averages: price position, crossovers, support/resistance 6(5MA = black, 20MA = blue, 90MA = purple) 7- Volume: key levels, spikes, and trends 8- Inflection points: trend shifts, momentum changes
-
[27]
Cross-timeframe link: 10- Connect short-term (daily) and long-term (weekly) signals 11Rules: 12- Output ONLY <score>NUM</score>, where NUM is [-0.5, 1.0] with 3 decimal places. 13- +0.001 to +1.000: upward trend 14- -0.500 to -0.001: downward trend 15- 0.000: unclear trend 16- No default positive bias 17- One unique score per stock/date 18- No additional ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.