When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning

Chi-Min Chan; Jiahao Li; Jiaming Ji; Juntao Dai; Sirui Han; Sitong Fang; Wenjing Cao; Xuyao Wang; Yaodong Yang; Yike Guo

arxiv: 2505.20214 · v2 · submitted 2025-05-26 · 💻 cs.AI

When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning

Sitong Fang , Wenjing Cao , Jiahao Li , Xuyao Wang , Juntao Dai , Chi-Min Chan , Sirui Han , Yike Guo

show 2 more authors

Yaodong Yang Jiaming Ji

This is my paper

Pith reviewed 2026-05-19 12:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords inverse scaling lawtruthfulnessmultimodal reasoningslow-thinkingdepth-first searchbreadth-first searchambiguous inputsreasoning models

0 comments

The pith

Slow-thinking multimodal models fabricate more false details than fast ones when visuals are incomplete or misleading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that slower reasoning does not always produce truer answers in tasks involving images and text. When inputs are ambiguous or misleading, models that engage in extended thinking tend to create plausible-sounding but incorrect justifications for their conclusions. The authors build a dataset of 5,000 progressively complex prompts judged by humans to demonstrate this pattern consistently across models. A reader might care because it questions the benefit of more compute for accuracy in real-world uncertain scenarios like visual question answering.

Core claim

Slow-thinking models are more prone to fabricating plausible yet false details to justify untruthful reasoning when confronted with incomplete or misleading visual inputs. Analysis of a 5,000-sample hierarchical prompt dataset reveals that slower reasoning models tend to follow depth-first search thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search inference and show greater caution under uncertainty.

What carries the argument

The inverse scaling law of truthfulness in multimodal reasoning, manifested through depth-first search in slow models versus breadth-first in fast models.

If this is right

Slow-thinking models are fragile when facing ambiguous multimodal inputs despite success in structured domains.
DFS reasoning causes persistent exploration of flawed premises.
Faster models demonstrate greater caution under uncertainty.
The vulnerability highlights a need for better handling of incomplete visual data in reasoning systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI developers may benefit from capping reasoning steps for uncertain visual inputs to avoid fabricated justifications.
The finding raises questions about the general reliability of increased computation in ambiguous settings.
Further experiments could test if this inverse law appears in other modalities or tasks.

Load-bearing premise

The human-annotated 5,000-sample hierarchical prompt dataset successfully isolates the effect of reasoning depth on truthfulness without confounding from prompt design or model differences.

What would settle it

If a new evaluation with standardized prompts and known ground truth shows no increase in fabricated details from slow-thinking models compared to fast ones, the inverse scaling claim would be challenged.

Figures

Figures reproduced from arXiv: 2505.20214 by Chi-Min Chan, Jiahao Li, Jiaming Ji, Juntao Dai, Sirui Han, Sitong Fang, Wenjing Cao, Xuyao Wang, Yaodong Yang, Yike Guo.

**Figure 1.** Figure 1: Evaluation landscape of MLLMs on TRUTHFULVQA. We normalized inverse variance of accuracy across levels, capturing stability under misleading hierarchical prompts. The performance of chat models scales with model size. However, performance declines when models are fine-tuned for slower reasoning, revealing an inverse scaling trend in multimodal reasoning. more complex, MLLMs often exploit spurious correlat… view at source ↗

**Figure 2.** Figure 2: Overview and pipeline of the hierarchical TRUTHFULVQA framework. The dataset was constructed with contributions from 50 human annotators. Images gathered from online sources are paired with hierarchically structured, human-written question sets designed to probe multiple forms of untruthfulness. Thinking, and other models finetuned from instruct models such as Mulberry series models. Additionally, we eval… view at source ↗

**Figure 3.** Figure 3: Comprehensive evaluation of 50+ models on TRUTHFULVQA. a. Grouped bar chart comparing the normalized LAL across level transitions for three pairs of reasoning and chat models. b. Violin plots of model accuracy at the three levels, illustrating the distribution within each level. c. Density map of mean versus variance of model accuracy across the three levels. d. Heat map of mainstream MLLMs’ accuracy acros… view at source ↗

**Figure 4.** Figure 4: Quantitative experiments on the thinking paradigm. a, Expected Calibration Error (ECE) of nine models, sorted ascending. Each dot marks one model; light blue denotes chat models and dark blue denotes reasoning models. The shaded band groups the reasoning models, and the value of each model’s ECE is printed next to its dot (lower is better). b, Accuracy of five chat models under standard prompting (light bl… view at source ↗

**Figure 5.** Figure 5: Win rate. Comparison of GPT-4o, Llama4- Maverick and Qwen2.5-VL-72B, evaluated by Gemini1.5, TruthfulJudge and human. first, followed by Qwen2.5-VL-72B and GPT-4o. These rankings are consistent with those obtained from multiple-choice evaluations. Full ELo results are provided in Appendix E.4. Overall, our findings show that a well-calibrated MLLM like TruthfulJudge can serve as a reliable proxy for human… view at source ↗

**Figure 6.** Figure 6: The Classification of TRUTHFULVQA In the pairwise preference interface, annotators are presented with an image, a natural language question, and two model-generated responses (Response A and Response B). They are asked to select the better response, or choose “No Preference” if both are equally good or flawed. Evaluation Dimensions: • Visual Grounding: Whether the response references visual elements in the… view at source ↗

**Figure 7.** Figure 7: Interface for pairwise preference annotation [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Interface for absolute scoring annotation [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Pairwise win rates of mainstream MLLMs. - Which response better avoids untruthfulness or fabricated information ? - Which response more correctly interprets factual elements in the image ? Also , consider how each response handles these challenges : 1. Information Hiding : Visual distortion , blurring , concealed features 2. Feature Forgery : Physical manipulation , natural confusion , fake elements 3. Per… view at source ↗

read the original abstract

Reasoning models have attracted increasing attention for their ability to tackle complex tasks, embodying the System II (slow thinking) paradigm in contrast to System I (fast, intuitive responses). Yet a key question remains: Does slower reasoning necessarily lead to more truthful answers? Our findings suggest otherwise. We conduct the first systematic study of the inverse scaling law in slow-thinking paradigms for multimodal reasoning. We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details to justify untruthful reasoning. To analyze this behavior, we construct a 5,000-sample hierarchical prompt dataset annotated by 50 human participants. The prompts progressively increase in complexity, revealing a consistent pattern: slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference, showing greater caution under uncertainty. These findings reveal a critical vulnerability of reasoning models: while effective in structured domains such as math, their DFS-style reasoning becomes fragile when confronted with ambiguous, multimodal inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims an inverse scaling law of truthfulness in multimodal reasoning: slow-thinking (reasoning) models are more prone than fast-thinking (chat) models to fabricating plausible but false details when visual inputs are incomplete or misleading. The authors support this via a new 5,000-sample hierarchical prompt dataset annotated by 50 humans, observing that reasoning models follow depth-first search (DFS) patterns that persist on flawed premises while chat models use breadth-first search (BFS) and exhibit greater caution under uncertainty.

Significance. If the central empirical pattern holds after methodological verification, the result would be significant for multimodal AI and reasoning model design. It challenges the assumption that deeper reasoning improves reliability and identifies a concrete vulnerability (DFS-style persistence on ambiguous inputs) that could inform safer model development. The scale of the human-annotated hierarchical dataset is a clear strength for behavioral analysis.

major comments (3)

[Methods] Methods section: The 5,000-sample dataset is central to the inverse scaling claim, yet the manuscript provides no details on inter-annotator agreement (e.g., Fleiss' kappa), resolution of disagreements among the 50 annotators, or statistical controls for prompt difficulty and linguistic confounds. This leaves the ground-truth truthfulness labels unverified.
[Experimental Setup] Experimental Setup: The comparison between reasoning models and chat models does not control for post-training differences (e.g., CoT-specific fine-tuning versus standard alignment). Without matched base models or ablation of training regime, it is unclear whether the observed fabrication tendency arises from reasoning depth or from training disparities.
[Results] Results: The reported consistent pattern across complexity levels lacks statistical significance tests, confidence intervals, or baseline comparisons against non-reasoning multimodal models. This weakens the load-bearing claim that slower thinking produces reliably lower truthfulness.

minor comments (2)

[Abstract] Abstract: The phrase 'inverse scaling law' is introduced without a formal definition or explicit contrast to existing scaling literature on reasoning or truthfulness.
[Introduction] Introduction: The mapping of model behavior to DFS versus BFS search should be clarified with concrete examples of inference traces to avoid conflation with classical search algorithms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for strengthening our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Methods] Methods section: The 5,000-sample dataset is central to the inverse scaling claim, yet the manuscript provides no details on inter-annotator agreement (e.g., Fleiss' kappa), resolution of disagreements among the 50 annotators, or statistical controls for prompt difficulty and linguistic confounds. This leaves the ground-truth truthfulness labels unverified.

Authors: We agree that these methodological details are necessary to establish the reliability of the annotations. In the revised manuscript, we will add a dedicated subsection in the Methods section reporting Fleiss' kappa for inter-annotator agreement, describing the disagreement resolution process (initial independent annotation followed by group discussion for consensus), and including regression-based controls for prompt difficulty and linguistic confounds. These additions will directly address the verification of ground-truth labels. revision: yes
Referee: [Experimental Setup] Experimental Setup: The comparison between reasoning models and chat models does not control for post-training differences (e.g., CoT-specific fine-tuning versus standard alignment). Without matched base models or ablation of training regime, it is unclear whether the observed fabrication tendency arises from reasoning depth or from training disparities.

Authors: This is a substantive concern regarding potential confounds. Our study compares representative deployed models embodying the slow-thinking versus fast-thinking paradigms, but we recognize that differences in post-training may play a role. We will revise the Experimental Setup and Limitations sections to explicitly discuss this issue and clarify that isolating the effect would require controlled ablations on open-source models, which we note as an important direction for future work. We do not claim the results isolate reasoning depth from all training factors. revision: partial
Referee: [Results] Results: The reported consistent pattern across complexity levels lacks statistical significance tests, confidence intervals, or baseline comparisons against non-reasoning multimodal models. This weakens the load-bearing claim that slower thinking produces reliably lower truthfulness.

Authors: We concur that additional statistical rigor and baselines would strengthen the presentation of results. In the revised Results section, we will report statistical significance tests (including p-values from appropriate tests such as chi-squared or ANOVA across complexity levels), 95% confidence intervals for key metrics, and comparisons against additional non-reasoning multimodal baselines (e.g., standard vision-language models without explicit reasoning). These changes will better support the central claim. revision: yes

Circularity Check

0 steps flagged

Empirical observational study with no derivation chain or self-referential reductions

full rationale

The paper conducts an empirical behavioral study by constructing a new 5,000-sample hierarchical prompt dataset, obtaining human annotations from 50 participants, and observing model responses to incomplete or misleading visual inputs. It reports patterns such as slow-thinking models favoring DFS-style reasoning and fabricating details, contrasted with faster models using BFS inference. No equations, fitted parameters, uniqueness theorems, or ansatzes are presented; the central claims rest on direct experimental observations rather than any step that reduces by construction to prior inputs, self-citations, or renamed known results. The study is self-contained against external benchmarks via the newly collected annotations and model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the constructed prompt hierarchy and the reliability of human truthfulness judgments; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human participants can reliably identify fabricated details versus truthful reasoning in model outputs
The study uses annotations from 50 humans to label truthfulness across the dataset.

pith-pipeline@v0.9.0 · 5751 in / 1169 out tokens · 61722 ms · 2026-05-19T12:41:55.503444+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

slower reasoning models tend to follow depth-first search (DFS) thinking, persistently exploring flawed premises, while faster chat models favor breadth-first search (BFS) inference
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that when confronted with incomplete or misleading visual inputs, slow-thinking models are more prone to fabricating plausible yet false details

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 5 internal anchors

[1]

Critique-out-loud reward models.arXiv preprint arXiv:2408.11791, 2024

Critique-out-loud reward models.arXiv preprint arXiv:2408.11791. Percy artist Moran. 1947. On the method of paired comparisons.Biometrika, 34 Pt 3-4:363–5. Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, and 1 others

work page arXiv 1947
[2]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861. Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee. Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575. Arpad E. Elo. 1978.The Rating of Chessplayers, Past and Present, 1st edition. Arco Publishing, New York, NY . Alessandro Favero, Luca Zancato, Matthew Trager, Sid- dharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto

work page internal anchor Pith review Pith/arXiv arXiv 1978
[5]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xi- awu Zheng, Ke Li, Xing Sun, and 1 others. 2023. Mme: A comprehensive evaluation benchmark f...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Aligning Large Multimodal Models with Factually Augmented RLHF

How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompt. InNeurips Safe Generative AI Workshop 2024. statista. 2025. Beijing’s minimum hourly wage. https: //www.statista.com/statistics/233886/min imum-wage-per-hour-in-china-by-city-and -province/. Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

No Preference

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. Appendix Table of Contents A Research Permit and Existing Assets Licenses 14 B Experiment Resources and Hardware 14 C Dataset Categorization 14 D Annotation Documents 16 D.1 Annotation Protocol of TRUTHFULVQA . . . . . . . . . . . . ...

work page 2023
[8]

Information Hiding : Visual distortion , blurring , concealed features

work page
[9]

Feature Forgery : Physical manipulation , natural confusion , fake elements

work page
[10]

Perspective Restriction : Cropping , unusual angles , shape distortion

work page
[11]

Contextual Bias : Background interference , emotional manipulation

work page
[12]

Information Forgery : Factual fabrication , image manipulation

work page
[13]

Fictional Information : Fabricated elements , imaginary concepts

work page
[14]

Imitative Falsehood : Misapplied reasoning , semantic bias

work page
[15]

- Don't hesitate to assign low confidence score if need to

Eye Illusion : Perceptual multiplicity , optical illusions # label - A : Response A is better than Response B - B : Response B is better than Response A - None : Response A and Response B are almost of the same quality # confidence - Provide a confidence score between 0 and 1 indicating how certain you are about your judgment - 0.0 -0.3: Very uncertain , ...

work page
[16]

Be thorough and objective in your evaluation

work page
[17]

Consider both strengths and weaknesses of each response

work page
[18]

</ critique > , winner label ( A or B or None ) in < label >

Enclose your critique in < critique >... </ critique > , winner label ( A or B or None ) in < label >... </ label > , and confidence score in < confidence >... </ confidence >

work page
[19]

Assign a low or medium confidence score from time to time for data annotation diversity

work page
[20]

messages

Use fine - grained confidence scores ( e . g . , 0.01 increments ) to reflect subtle differences . We verify the consistency between the explanatory rationale and the human preference label by string regular match. F.2 Training Process Based on the above procedures, we construct the dataset M={x (i),r (i) 1 ,r (i) 2 ,y (i)}, where x is the question-image ...

work page 2048

[1] [1]

Critique-out-loud reward models.arXiv preprint arXiv:2408.11791, 2024

Critique-out-loud reward models.arXiv preprint arXiv:2408.11791. Percy artist Moran. 1947. On the method of paired comparisons.Biometrika, 34 Pt 3-4:363–5. Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, and 1 others

work page arXiv 1947

[2] [2]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861. Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee. Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Supervising strong learners by amplifying weak experts.arXiv preprint arXiv:1810.08575. Arpad E. Elo. 1978.The Rating of Chessplayers, Past and Present, 1st edition. Arco Publishing, New York, NY . Alessandro Favero, Luca Zancato, Matthew Trager, Sid- dharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto

work page internal anchor Pith review Pith/arXiv arXiv 1978

[5] [5]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Multi-modal hallucination control by vi- sual information grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14303–14312. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xi- awu Zheng, Ke Li, Xing Sun, and 1 others. 2023. Mme: A comprehensive evaluation benchmark f...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Aligning Large Multimodal Models with Factually Augmented RLHF

How easy is it to fool your multimodal llms? an empirical analysis on deceptive prompt. InNeurips Safe Generative AI Workshop 2024. statista. 2025. Beijing’s minimum hourly wage. https: //www.statista.com/statistics/233886/min imum-wage-per-hour-in-china-by-city-and -province/. Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

No Preference

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Pro- cessing Systems, 36:46595–46623. Appendix Table of Contents A Research Permit and Existing Assets Licenses 14 B Experiment Resources and Hardware 14 C Dataset Categorization 14 D Annotation Documents 16 D.1 Annotation Protocol of TRUTHFULVQA . . . . . . . . . . . . ...

work page 2023

[8] [8]

Information Hiding : Visual distortion , blurring , concealed features

work page

[9] [9]

Feature Forgery : Physical manipulation , natural confusion , fake elements

work page

[10] [10]

Perspective Restriction : Cropping , unusual angles , shape distortion

work page

[11] [11]

Contextual Bias : Background interference , emotional manipulation

work page

[12] [12]

Information Forgery : Factual fabrication , image manipulation

work page

[13] [13]

Fictional Information : Fabricated elements , imaginary concepts

work page

[14] [14]

Imitative Falsehood : Misapplied reasoning , semantic bias

work page

[15] [15]

- Don't hesitate to assign low confidence score if need to

Eye Illusion : Perceptual multiplicity , optical illusions # label - A : Response A is better than Response B - B : Response B is better than Response A - None : Response A and Response B are almost of the same quality # confidence - Provide a confidence score between 0 and 1 indicating how certain you are about your judgment - 0.0 -0.3: Very uncertain , ...

work page

[16] [16]

Be thorough and objective in your evaluation

work page

[17] [17]

Consider both strengths and weaknesses of each response

work page

[18] [18]

</ critique > , winner label ( A or B or None ) in < label >

Enclose your critique in < critique >... </ critique > , winner label ( A or B or None ) in < label >... </ label > , and confidence score in < confidence >... </ confidence >

work page

[19] [19]

Assign a low or medium confidence score from time to time for data annotation diversity

work page

[20] [20]

messages

Use fine - grained confidence scores ( e . g . , 0.01 increments ) to reflect subtle differences . We verify the consistency between the explanatory rationale and the human preference label by string regular match. F.2 Training Process Based on the above procedures, we construct the dataset M={x (i),r (i) 1 ,r (i) 2 ,y (i)}, where x is the question-image ...

work page 2048