Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

Graham Neubig; Sanidhya Vijayvargiya; Vijay Viswanathan

arxiv: 2604.14624 · v1 · submitted 2026-04-16 · 💻 cs.SE · cs.AI

Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

Sanidhya Vijayvargiya , Vijay Viswanathan , Graham Neubig This is my paper

Pith reviewed 2026-05-10 11:25 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords clarification questionssoftware engineeringreinforcement learningtask resolutionuser answerabilityShapley attributionlarge language models

0 comments

The pith

By grounding reinforcement learning rewards in task relevance and user answerability, an 8B-parameter model achieves GPT-5-level resolution of underspecified software engineering issues with 41% fewer clarifying questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how AI assistants can ask better clarifying questions for incomplete software engineering tasks. It uses Shapley attribution to find which missing information most affects task success and compares distributions to see what users can realistically answer. These two properties—task relevance and user answerability—are turned into rewards for training a small clarification module called CLARITI. This module performs as well as much larger models like GPT-5 but generates fewer questions, suggesting that empirical grounding of rewards can improve efficiency in clarification.

Core claim

The authors identify two key properties of effective clarification through analysis: task relevance, which information predicts success, and user answerability, what users can provide. They operationalize these as multi-stage RL rewards to train CLARITI, an 8B model that matches GPT-5 resolution rate on underspecified issues while generating 41% fewer questions. This demonstrates that grounding reward design in empirical analysis of information impact improves clarification efficiency in software engineering tasks.

What carries the argument

CLARITI, an 8B-parameter clarification module trained with multi-stage reinforcement learning rewards derived from task relevance (via Shapley attribution) and user answerability (via distributional comparisons on simulated users).

Load-bearing premise

The assumption that properties identified from Shapley values and simulated user distributions can be directly turned into effective RL rewards that work well for real users without losing important details.

What would settle it

Running CLARITI on a set of real software engineering tasks with actual human users and measuring if it still matches large model resolution rates while asking fewer questions, or if real-user responses deviate significantly from simulations.

Figures

Figures reproduced from arXiv: 2604.14624 by Graham Neubig, Sanidhya Vijayvargiya, Vijay Viswanathan.

**Figure 1.** Figure 1: Our trained clarification model, CLARITI, achieves GPT5-level performance (36.80%) with 41% fewer average questions (3.0 vs 5.1) by prioritizing task relevance and user answerability, demonstrating effective clarification at low user burden. tion questions can substantially improve downstream outcomes (Zhang & Choi, 2025; Chen et al., 2025), motivating recent work on user modeling that personalizes quest… view at source ↗

**Figure 2.** Figure 2: Mean absolute SHAP values with 95% bootstrap confidence intervals measuring the association between each information category and task success. enabling reproducible evaluation without reliance on proprietary APIs. OpenHands provides a sandboxed environment in which agents can edit files, execute bash and Python commands, and iteratively refine solutions. Agents are configured with a maximum of 30 inte… view at source ↗

**Figure 3.** Figure 3: Task success as a function of the number of clarification questions asked. Performance plateaus as question count increases, while the proportion of answerable questions declines. properties of answerable question formulation rather than model-specific artifacts (complete results in Appendix A.6). 5.3. Impact of Answerability on Performance Beyond individual questions, we also examine how the overall comp… view at source ↗

**Figure 4.** Figure 4: Multi-stage reward pipeline with progressive filtering. Generated clarification sets flow through four sequential stages: (1) Non-Redundancy filters generations with high number of questions answerable from the underspecified issue (threshold r ≥ 0.5), (2) Diversity filters generations with generic questions similar across different issues (threshold r ≥ 0.5), (3) Answerability scores whether users can ans… view at source ↗

**Figure 5.** Figure 5: Mean reward progression during GRPO training. The reward increases steadily, indicating successful policy optimization toward generating non-redundant, novel, answerable, and useful clarification questions. (a) Stage 1: Redundancy (b) Stage 2: Novelty (c) Stage 3: Answerability (d) Stage 4: Utility & Specificity [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Stage-wise reward progression during training. Each stage captures a different quality dimension of clarification questions, and the combined optimization across all stages drives the policy toward high-quality question generation [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

Humans often specify tasks incompletely, so assistants must know when and how to ask clarifying questions. However, effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide. We study clarification in real software engineering tasks by quantifying which types of information most affect task success and which questions elicit useful responses from simulated users. Using Shapley attribution and distributional comparisons, we identify two key properties of effective clarification: task relevance (which information predicts success) and user answerability (what users can realistically provide). We operationalize these properties as multi-stage reinforcement learning rewards to train CLARITI, an 8B-parameter clarification module, that matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions. Our results suggest that grounding reward design in empirical analysis of information impact and user answerability improves clarification efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLARITI grounds RL rewards in Shapley analysis of SE task info and simulated answerability to cut questions 41% while matching GPT-5 resolution, but the simulation-to-real gap is unaddressed.

read the letter

The core contribution here is turning Shapley-derived measures of task relevance and user answerability into multi-stage RL rewards for an 8B clarification model. That produces the reported result of matching GPT-5 resolution rates on underspecified SE issues while asking 41% fewer questions. The method is a clear step past generic uncertainty or entropy rewards because it starts from an empirical breakdown of which missing pieces actually move task success and which ones simulated users can supply.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces CLARITI, an 8B-parameter clarification module for software engineering tasks. It uses Shapley attribution on task-success data and distributional comparisons on simulated users to identify task relevance and user answerability properties, which are then operationalized as multi-stage RL rewards. The resulting model is claimed to match GPT-5's resolution rate on underspecified issues while generating 41% fewer questions.

Significance. If the simulation-to-real transfer holds, the work offers a data-driven method for efficient clarification in AI coding assistants, reducing unnecessary questions without loss of resolution. The empirical grounding of rewards via Shapley analysis and answerability distributions is a methodological strength that could influence reward design in other interactive SE agents.

major comments (2)

[Evaluation] Evaluation section: the 41% reduction claim and GPT-5 comparison rest entirely on interactions with the same simulated users used to compute Shapley values and answerability distributions. No cross-validation, ablation on real engineers, or human-subject study is reported, so it is unclear whether the learned policy preserves resolution rates outside the simulator.
[§4] §4 (RL training): the multi-stage reward formulation derived from Shapley and distributional analysis is described at a high level, but the precise weighting between relevance and answerability terms, the RL algorithm details (e.g., PPO hyperparameters, reward scaling), and training curves or variance across seeds are not provided, making it impossible to assess whether the reported efficiency gain is robust.

minor comments (3)

[§3] The abstract and §3 refer to 'simulated users' without specifying the simulation model, prompt templates, or diversity of the simulated engineer population, which affects interpretability of the Shapley results.
[Results] Table or figure reporting the 41% reduction should include error bars, number of tasks, and statistical significance test against the GPT-5 baseline.
[Methodology] Notation for the multi-stage reward (e.g., how task relevance and answerability are combined across stages) is introduced without an explicit equation, complicating reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and indicate the changes made in the revised version.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 41% reduction claim and GPT-5 comparison rest entirely on interactions with the same simulated users used to compute Shapley values and answerability distributions. No cross-validation, ablation on real engineers, or human-subject study is reported, so it is unclear whether the learned policy preserves resolution rates outside the simulator.

Authors: We agree that the reported results, including the 41% reduction and GPT-5 comparison, are obtained exclusively within the simulated user model that was also used to compute the Shapley values and answerability distributions. This design enables controlled, large-scale, and repeatable experiments that isolate the contribution of the empirically derived rewards. The manuscript scopes its claims to this simulated setting, as described in the abstract and evaluation sections. To address the concern regarding generalization, we have added a new Limitations subsection in the Discussion that explicitly notes the simulation-to-real gap and identifies human-subject studies with real engineers as an important direction for future work. This revision clarifies the current scope without overstating the results. revision: partial
Referee: [§4] §4 (RL training): the multi-stage reward formulation derived from Shapley and distributional analysis is described at a high level, but the precise weighting between relevance and answerability terms, the RL algorithm details (e.g., PPO hyperparameters, reward scaling), and training curves or variance across seeds are not provided, making it impossible to assess whether the reported efficiency gain is robust.

Authors: We appreciate the request for greater implementation detail to support reproducibility and assessment of robustness. In the revised manuscript we have expanded Section 4 to include the precise weighting coefficients between the task-relevance and user-answerability reward terms, the complete PPO configuration (including hyperparameters and reward scaling), and we have added training curves together with performance variance across multiple random seeds to a new appendix. These additions directly respond to the comment and allow readers to evaluate the stability of the efficiency gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent empirical analysis

full rationale

The paper first performs Shapley attribution and distributional comparisons on simulated-user data to identify task relevance and user answerability properties, then operationalizes those properties as RL rewards for training the CLARITI model. This sequence does not reduce any claimed prediction or result to its own inputs by construction, nor does it rely on self-citation, self-definition, or renaming of fitted parameters. The final performance comparison (matching GPT-5 resolution rate with 41% fewer questions) is presented as an outcome of the trained policy rather than a quantity that was statistically forced by the reward design itself. No equations or load-bearing steps in the provided text exhibit the required explicit reduction to prior fitted values or self-referential loops. The derivation remains self-contained against its stated empirical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides limited detail; no explicit free parameters, axioms, or invented entities are identifiable beyond the trained model artifact CLARITI itself.

pith-pipeline@v0.9.0 · 5455 in / 1088 out tokens · 53752 ms · 2026-05-10T11:25:01.105818+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
cs.AI 2026-05 unverdicted novelty 7.0

π-Bench is a new evaluation suite that jointly measures proactivity and task completion in AI agents across sustained multi-turn workflows containing hidden intents and cross-session continuity.
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
cs.AI 2026-05 unverdicted novelty 7.0

π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper

[1]

findings-emnlp.1123/

URL https://aclanthology.org/2025. findings-emnlp.1123/. Gan, Y ., Li, C., Xie, J., Wen, L., Purver, M., and Poesio, M. Clarq-llm: A benchmark for models clarifying and re- questing information in task-oriented dialog, 2024. URL https://arxiv.org/abs/2409.06097. Hou, B., Liu, Y ., Qian, K., Andreas, J., Chang, S., and Zhang, Y . Decomposing uncertainty fo...

work page doi:10.18653/v1/2025.crac-1.1 2025
[2]

Zhang, W

URL https://aclanthology.org/2025. findings-naacl.306/. Zhang, M. J. Q., Knox, W. B., and Choi, E. Modeling future conversation turns to teach llms to ask clarifying questions, 2025a. URL https://arxiv.org/abs/ 2410.13788. Zhang, T., Qin, P., Deng, Y ., Huang, C., Lei, W., Liu, J., Jin, D., Liang, H., and Chua, T.-S. Clamber: A benchmark of identifying an...

work page arXiv 2025
[3]

The category name (exactly as listed above)

work page
[4]

Error Information

Specific examples of that information from the issue Output your analysis in JSON format with this structure: { "Error Information": { "present": true/false, "examples": ["specific quote 1", "specific quote 2", ...] }, ... } Be thorough and identify ALL categories that have relevant information. Include specific quotes/examples from the issue for each cat...

work page
[5]

Remove EVERY mention of the information types listed above, except error information and expected behavior where if removing all mentions make the issue unnatural, then vaguely describe it, or remove important parts

work page
[6]

Use the category mapping as reference (but note it may have extra or missing items - use your judgment)

work page
[7]

If you remove reproduction steps, remove sufficient/ALL commands, inputs, and trigger conditions

work page
[8]

If you remove error information, remove sufficient stack traces, error messages, and incorrect output descriptions

work page
[9]

If you remove implementation details, remove important/ALL proposed solutions and approaches

work page
[10]

If you remove version/environment info, remove most/ALL dependency versions, OS details, configs

work page
[11]

If you remove external references, remove ALL links, API docs, dataset mentions, commit hashes, and descriptions of external reference content

work page
[12]

If you remove expected behavior, remove sufficient descriptions of what should happen or correct behavior

work page
[13]

Write like a REAL developer - natural, authentic, no theatrical language

work page
[14]

Do NOT add extra formatting that real developers don’t use 15 Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

work page
[15]

Do NOT mention what you removed or that information is missing

work page
[16]

Make sure your rewrite is DIFFERENT from these previous versions.] Output ONLY the rewritten issue inside<rewrite></rewrite>tags

The result should sound like an incomplete but real issue [If applicable: Previous rewrites hid these categories: ... Make sure your rewrite is DIFFERENT from these previous versions.] Output ONLY the rewritten issue inside<rewrite></rewrite>tags. Include NO other text, explanations, or metadata. A.3. Annotation Process Example This section illustrates th...

work page
[17]

Version/Environment Information:What configuration is necessary? (e.g., dependency versions, OS details, config flags) • Examples: ”in the future (5.2) the structured array will be added as aColumn”; ”This is not critical for 5.1...”

work page
[18]

after #12644

External References:What external resources influence this? (e.g., API docs, datasets, upstream contracts, links, commit hashes) • Examples: “after #12644”; https://github.com/astropy/astropy/blob/main/CONTRIBUTING.md 3.Expected Behavior:What should happen instead? (e.g., intended output format, correct return values, desired state) • Examples: ”in the fu...

work page
[19]

Distributional Data Analysis Findings Here we present the complete results from our D5 analysis comparing answerable versus non-answerable clarification questions

Problem Statement: [Underspecified issue text] A.6. Distributional Data Analysis Findings Here we present the complete results from our D5 analysis comparing answerable versus non-answerable clarification questions. We conducted three analyses: (1) cross-model (pooling questions from all models), (2) GPT-5 within-model, and (3) GPT-Nano within-model. Each...

work page
[20]

For EACH question, determine:

work page
[21]

Can it be answered from the ORIGINAL issue? (true/false)

work page
[22]

provide more details

Can it be answered from the UNDERSPECIFIED issue? (true/false) A question that is generic like “provide more details”, “clarify requirements” and does not ask for specific information should be marked as NOT answerable from either (both false). Respond with ONLY this JSON (no extra text): 21 Asking What Matters: Reward-Driven Clarification for Software En...

work page
[23]

Can you provide more details?

For EACH question, evaluate whether it is relevant to solving the software engineering task described in the context. A question isrelevantif: • It seeks information needed to understand, reproduce, or fix the issue • It asks about technical details, error conditions, or implementation requirements • The answer would help a developer make progress on the ...

work page
[24]

Do spaces occur without surround- ing text?[R, repro]

Provide minimal RST snip- pet reproducing the issue with the inline role (e.g., :python:‘...‘). Do spaces occur without surround- ing text?[R, repro]

work page
[25]

Exact PDF build workflow? Commands, Makefile, latexmk usage, and .rst→.tex→.pdf se- quence? Custom scripts?[A, repro]

work page
[26]

Limited to Python role or oc- curs with other languages/roles and block highlights? Spaces in HTML output?[U, repro]

work page
[27]

Versions (Sphinx, Pyg- ments, TeX engine, OS) and build command for PDF? [A, ver]

work page
[28]

How is the role defined? Share conf.py config and RST snippet.[A, ver]

work page
[29]

Minimal project (one .rst + conf.py) and generated .tex snippet? Explicit spaces or TeX glue in .tex?[U, repro]

work page
[30]

When did this start? Which versions didn’t have the is- sue?[A, ver]

work page
[31]

Screenshots showing correct vs incorrect output?[A, err]

work page
[32]

Reproducible exam- ple with Python role highlighting?[A, re- pro]

work page
[33]

Does replacing string x-axis with numeric (e.g., [0,1]) or datetime reproduce? Tied to string labels or stacking/axis in- teraction?[R, repro]

Relevant software versions?[A, ver] matplotlib- 26208 1. Does replacing string x-axis with numeric (e.g., [0,1]) or datetime reproduce? Tied to string labels or stacking/axis in- teraction?[R, repro]

work page
[34]

Besides tested orders, tried other combinations (ax1.plot + ax2.stackplot, both on same axis)? Results?[R, repro]

work page
[35]

Does this occur with different backends (Qt5Agg, TkAgg) or specific to module backend? [U, ver]

work page
[36]

Affects only metadata (ax1.dataLim) or ren- dered result? What are ax1.get ybound(), ax1.get ylim() after print? Figure display correct?[U, err]

work page
[37]

Reproduce with numeric/- datetime x data instead of strings? Replace df1 index with [0,1] and share dataLim.[U, repro]

work page
[38]

Do ax1.relim() and ax1.autoscale view() re- store finite dataLims? Calling before/after ax2 plotting?[U, repro]

work page
[39]

Exact year formatting token convention (YYYY/yyyy al- ways 4-digit zero-padded, YY exactly 2 digits)? Other width- affecting tokens?[A, exp]

N/A django- 14373 1. Exact year formatting token convention (YYYY/yyyy al- ways 4-digit zero-padded, YY exactly 2 digits)? Other width- affecting tokens?[A, exp]

work page
[40]

How to handle edge years: BCE/astronomical (0000, - 0001), large years (10000+)? Fixed width or allow wider?[U, exp]

work page
[41]

Provide failing test cases: in- put dates, timezones, format strings, expected vs actual, lo- cale/calendar?[U, repro]

work page
[42]

Share examples (date/time, format string, current vs expected) for years 0–9, 10–99, 100, negative, ≥10000?[U, repro]

work page
[43]

Intended year field conven- tion: fixed/variable width, signed for BCE, how to rep- resent outside 0000–9999? Align with ISO 8601?[A, exp]

work page
[44]

Locale/timezone considera- tions? Tests assume fixed locale/timezone?[U, ver]

work page
[45]

Environments showing discrepancy? Backward- compatibility constraints? [U, ver]

work page
[46]

How does the issue show inconsistency between expected (0- padded) and actual (no padding)?[A, err]

work page
[47]

Which compo- nents (YearFormat, YearDeltaFormat) are affected?[A, err] Legend:A=Answerable, U=Unanswerable, R=Redundant.Types:err=error info, repro=reproduction, exp=expected behavior, ver=version env. 28

work page

[1] [1]

findings-emnlp.1123/

URL https://aclanthology.org/2025. findings-emnlp.1123/. Gan, Y ., Li, C., Xie, J., Wen, L., Purver, M., and Poesio, M. Clarq-llm: A benchmark for models clarifying and re- questing information in task-oriented dialog, 2024. URL https://arxiv.org/abs/2409.06097. Hou, B., Liu, Y ., Qian, K., Andreas, J., Chang, S., and Zhang, Y . Decomposing uncertainty fo...

work page doi:10.18653/v1/2025.crac-1.1 2025

[2] [2]

Zhang, W

URL https://aclanthology.org/2025. findings-naacl.306/. Zhang, M. J. Q., Knox, W. B., and Choi, E. Modeling future conversation turns to teach llms to ask clarifying questions, 2025a. URL https://arxiv.org/abs/ 2410.13788. Zhang, T., Qin, P., Deng, Y ., Huang, C., Lei, W., Liu, J., Jin, D., Liang, H., and Chua, T.-S. Clamber: A benchmark of identifying an...

work page arXiv 2025

[3] [3]

The category name (exactly as listed above)

work page

[4] [4]

Error Information

Specific examples of that information from the issue Output your analysis in JSON format with this structure: { "Error Information": { "present": true/false, "examples": ["specific quote 1", "specific quote 2", ...] }, ... } Be thorough and identify ALL categories that have relevant information. Include specific quotes/examples from the issue for each cat...

work page

[5] [5]

Remove EVERY mention of the information types listed above, except error information and expected behavior where if removing all mentions make the issue unnatural, then vaguely describe it, or remove important parts

work page

[6] [6]

Use the category mapping as reference (but note it may have extra or missing items - use your judgment)

work page

[7] [7]

If you remove reproduction steps, remove sufficient/ALL commands, inputs, and trigger conditions

work page

[8] [8]

If you remove error information, remove sufficient stack traces, error messages, and incorrect output descriptions

work page

[9] [9]

If you remove implementation details, remove important/ALL proposed solutions and approaches

work page

[10] [10]

If you remove version/environment info, remove most/ALL dependency versions, OS details, configs

work page

[11] [11]

If you remove external references, remove ALL links, API docs, dataset mentions, commit hashes, and descriptions of external reference content

work page

[12] [12]

If you remove expected behavior, remove sufficient descriptions of what should happen or correct behavior

work page

[13] [13]

Write like a REAL developer - natural, authentic, no theatrical language

work page

[14] [14]

Do NOT add extra formatting that real developers don’t use 15 Asking What Matters: Reward-Driven Clarification for Software Engineering Tasks

work page

[15] [15]

Do NOT mention what you removed or that information is missing

work page

[16] [16]

Make sure your rewrite is DIFFERENT from these previous versions.] Output ONLY the rewritten issue inside<rewrite></rewrite>tags

The result should sound like an incomplete but real issue [If applicable: Previous rewrites hid these categories: ... Make sure your rewrite is DIFFERENT from these previous versions.] Output ONLY the rewritten issue inside<rewrite></rewrite>tags. Include NO other text, explanations, or metadata. A.3. Annotation Process Example This section illustrates th...

work page

[17] [17]

Version/Environment Information:What configuration is necessary? (e.g., dependency versions, OS details, config flags) • Examples: ”in the future (5.2) the structured array will be added as aColumn”; ”This is not critical for 5.1...”

work page

[18] [18]

after #12644

External References:What external resources influence this? (e.g., API docs, datasets, upstream contracts, links, commit hashes) • Examples: “after #12644”; https://github.com/astropy/astropy/blob/main/CONTRIBUTING.md 3.Expected Behavior:What should happen instead? (e.g., intended output format, correct return values, desired state) • Examples: ”in the fu...

work page

[19] [19]

Distributional Data Analysis Findings Here we present the complete results from our D5 analysis comparing answerable versus non-answerable clarification questions

Problem Statement: [Underspecified issue text] A.6. Distributional Data Analysis Findings Here we present the complete results from our D5 analysis comparing answerable versus non-answerable clarification questions. We conducted three analyses: (1) cross-model (pooling questions from all models), (2) GPT-5 within-model, and (3) GPT-Nano within-model. Each...

work page

[20] [20]

For EACH question, determine:

work page

[21] [21]

Can it be answered from the ORIGINAL issue? (true/false)

work page

[22] [22]

provide more details

Can it be answered from the UNDERSPECIFIED issue? (true/false) A question that is generic like “provide more details”, “clarify requirements” and does not ask for specific information should be marked as NOT answerable from either (both false). Respond with ONLY this JSON (no extra text): 21 Asking What Matters: Reward-Driven Clarification for Software En...

work page

[23] [23]

Can you provide more details?

For EACH question, evaluate whether it is relevant to solving the software engineering task described in the context. A question isrelevantif: • It seeks information needed to understand, reproduce, or fix the issue • It asks about technical details, error conditions, or implementation requirements • The answer would help a developer make progress on the ...

work page

[24] [24]

Do spaces occur without surround- ing text?[R, repro]

Provide minimal RST snip- pet reproducing the issue with the inline role (e.g., :python:‘...‘). Do spaces occur without surround- ing text?[R, repro]

work page

[25] [25]

Exact PDF build workflow? Commands, Makefile, latexmk usage, and .rst→.tex→.pdf se- quence? Custom scripts?[A, repro]

work page

[26] [26]

Limited to Python role or oc- curs with other languages/roles and block highlights? Spaces in HTML output?[U, repro]

work page

[27] [27]

Versions (Sphinx, Pyg- ments, TeX engine, OS) and build command for PDF? [A, ver]

work page

[28] [28]

How is the role defined? Share conf.py config and RST snippet.[A, ver]

work page

[29] [29]

Minimal project (one .rst + conf.py) and generated .tex snippet? Explicit spaces or TeX glue in .tex?[U, repro]

work page

[30] [30]

When did this start? Which versions didn’t have the is- sue?[A, ver]

work page

[31] [31]

Screenshots showing correct vs incorrect output?[A, err]

work page

[32] [32]

Reproducible exam- ple with Python role highlighting?[A, re- pro]

work page

[33] [33]

Does replacing string x-axis with numeric (e.g., [0,1]) or datetime reproduce? Tied to string labels or stacking/axis in- teraction?[R, repro]

Relevant software versions?[A, ver] matplotlib- 26208 1. Does replacing string x-axis with numeric (e.g., [0,1]) or datetime reproduce? Tied to string labels or stacking/axis in- teraction?[R, repro]

work page

[34] [34]

Besides tested orders, tried other combinations (ax1.plot + ax2.stackplot, both on same axis)? Results?[R, repro]

work page

[35] [35]

Does this occur with different backends (Qt5Agg, TkAgg) or specific to module backend? [U, ver]

work page

[36] [36]

Affects only metadata (ax1.dataLim) or ren- dered result? What are ax1.get ybound(), ax1.get ylim() after print? Figure display correct?[U, err]

work page

[37] [37]

Reproduce with numeric/- datetime x data instead of strings? Replace df1 index with [0,1] and share dataLim.[U, repro]

work page

[38] [38]

Do ax1.relim() and ax1.autoscale view() re- store finite dataLims? Calling before/after ax2 plotting?[U, repro]

work page

[39] [39]

Exact year formatting token convention (YYYY/yyyy al- ways 4-digit zero-padded, YY exactly 2 digits)? Other width- affecting tokens?[A, exp]

N/A django- 14373 1. Exact year formatting token convention (YYYY/yyyy al- ways 4-digit zero-padded, YY exactly 2 digits)? Other width- affecting tokens?[A, exp]

work page

[40] [40]

How to handle edge years: BCE/astronomical (0000, - 0001), large years (10000+)? Fixed width or allow wider?[U, exp]

work page

[41] [41]

Provide failing test cases: in- put dates, timezones, format strings, expected vs actual, lo- cale/calendar?[U, repro]

work page

[42] [42]

Share examples (date/time, format string, current vs expected) for years 0–9, 10–99, 100, negative, ≥10000?[U, repro]

work page

[43] [43]

Intended year field conven- tion: fixed/variable width, signed for BCE, how to rep- resent outside 0000–9999? Align with ISO 8601?[A, exp]

work page

[44] [44]

Locale/timezone considera- tions? Tests assume fixed locale/timezone?[U, ver]

work page

[45] [45]

Environments showing discrepancy? Backward- compatibility constraints? [U, ver]

work page

[46] [46]

How does the issue show inconsistency between expected (0- padded) and actual (no padding)?[A, err]

work page

[47] [47]

Which compo- nents (YearFormat, YearDeltaFormat) are affected?[A, err] Legend:A=Answerable, U=Unanswerable, R=Redundant.Types:err=error info, repro=reproduction, exp=expected behavior, ver=version env. 28

work page