StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation
Pith reviewed 2026-05-10 11:51 UTC · model grok-4.3
The pith
Reformulating coding problems as natural language narratives improves LLM code generation by guiding structured reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Converting code problems into three-part narratives consisting of a task overview, constraints, and example test cases, selected according to algorithm and genre, supplies the structured context that lets language models reason more effectively and generate higher-quality code.
What carries the argument
The StoryCoder narrative reformulation that turns a raw problem statement into a coherent natural-language story built from task overview, constraints, and test cases, guided by algorithm choice and genre.
If this is right
- Models shift toward correct algorithmic choices instead of defaulting to incorrect ones.
- Implementation errors decrease because constraints and examples are presented in connected prose.
- Generated code becomes more modular and easier to maintain.
- The same structured-representation benefit appears across model scales and architectures.
- Structured problem representation matters independently of how large or capable the underlying model is.
Where Pith is reading between the lines
- The same narrative approach could be tested on non-code reasoning tasks such as math word problems to check whether coherence helps there too.
- Genre alignment may prove useful for tailoring prompts in other domains where problems have natural story-like structures.
- If narrative coherence is the active ingredient, automated ways to generate such stories without manual genre selection could be developed and measured.
Load-bearing premise
The performance gains stem specifically from the narrative structure and genre alignment rather than from simply making the prompt longer or adding more details.
What would settle it
Running the same experiments with length-matched prompts that lack narrative coherence and genre alignment but contain equivalent information, then observing whether the accuracy gains disappear.
Figures
read the original abstract
Effective code generation requires both model capability and a problem representation that carefully structures how models reason and plan. Existing approaches augment reasoning steps or inject specific structure into how models think, but leave scattered problem conditions unchanged. Inspired by the way humans organize fragmented information into coherent explanations, we propose StoryCoder, a narrative reformulation framework that transforms code generation questions into coherent natural language narratives, providing richer contextual structure than simple rephrasings. Each narrative consists of three components: a task overview, constraints, and example test cases, guided by the selected algorithm and genre. Experiments across 11 models on HumanEval, LiveCodeBench, and CodeForces demonstrate consistent improvements, with an average gain of 18.7% in zero-shot pass@10. Beyond accuracy, our analyses reveal that narrative reformulation guides models toward correct algorithmic strategies, reduces implementation errors, and induces a more modular code structure. The analyses further show that these benefits depend on narrative coherence and genre alignment, suggesting that structured problem representation is important for code generation regardless of model scale or architecture. Our code is available at https://github.com/gu-ni/StoryCoder.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes StoryCoder, a framework that reformulates code-generation problems into coherent natural-language narratives consisting of a task overview, constraints, and example test cases, with the narrative guided by a chosen algorithm and genre. It reports that this approach yields consistent gains over standard zero-shot prompting, with an average 18.7% improvement in pass@10 across 11 models evaluated on HumanEval, LiveCodeBench, and CodeForces. Additional qualitative and quantitative analyses claim that the narratives steer models toward correct algorithmic strategies, reduce implementation errors, and produce more modular code, with benefits depending on narrative coherence and genre alignment.
Significance. If the reported gains can be shown to arise specifically from narrative structure rather than from increased prompt length or the inclusion of test cases, the work would meaningfully advance prompt-engineering techniques for code generation by highlighting the value of coherent, genre-aligned problem representations. The broad evaluation across models and benchmarks plus the public release of code are clear strengths that support reproducibility.
major comments (3)
- [§4] §4 (Experimental Evaluation): The 18.7% average pass@10 improvement is measured exclusively against standard zero-shot prompts; no ablation matches total token count or presents the identical test cases and constraints in a non-narrative format (e.g., bullet list). Without these controls the causal attribution to narrative reformulation remains unisolated, directly undermining the central claim that the observed benefits stem from coherence and genre alignment rather than incidental factors.
- [§4.1–4.3] §4.1–4.3 (Results and Analyses): No statistical significance tests, confidence intervals, or details on exact data splits, baseline prompt lengths, or content-matched controls are reported. This absence makes it impossible to assess whether the average gain is robust or whether post-hoc choices of narrative wording or test-case selection drive the results.
- [§5] §5 (Ablation and Qualitative Analysis): The claim that benefits “depend on narrative coherence and genre alignment” is supported only by qualitative inspection and a limited set of coherence/genre variants; quantitative metrics (e.g., token-count-matched baselines or genre-mismatched controls) are not provided, leaving the dependence on narrative form unproven.
minor comments (2)
- [Abstract / §1] The abstract and introduction would benefit from a brief explicit statement of the total token budgets used for StoryCoder versus baseline prompts.
- [Tables/Figures] Figure captions and table headers should clarify whether pass@10 values are averaged over all problems or only solved problems.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects for strengthening the causal claims in our work. We address each major comment below and commit to incorporating the suggested analyses and controls in a revised version of the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Evaluation): The 18.7% average pass@10 improvement is measured exclusively against standard zero-shot prompts; no ablation matches total token count or presents the identical test cases and constraints in a non-narrative format (e.g., bullet list). Without these controls the causal attribution to narrative reformulation remains unisolated, directly undermining the central claim that the observed benefits stem from coherence and genre alignment rather than incidental factors.
Authors: We agree that additional controls are necessary to isolate the effect of narrative structure from potential confounds such as prompt length or the mere inclusion of test cases. Our current results demonstrate consistent improvements across diverse models and benchmarks, but we acknowledge the need for more rigorous ablations. In the revised manuscript, we will introduce new experiments that match total token counts and present the same information (task overview, constraints, and test cases) in a non-narrative bullet-point format. These controls will be added to §4 to better support the attribution to narrative reformulation. revision: yes
-
Referee: [§4.1–4.3] §4.1–4.3 (Results and Analyses): No statistical significance tests, confidence intervals, or details on exact data splits, baseline prompt lengths, or content-matched controls are reported. This absence makes it impossible to assess whether the average gain is robust or whether post-hoc choices of narrative wording or test-case selection drive the results.
Authors: We recognize the importance of statistical rigor and transparency in reporting. While some details on data splits and prompt construction are provided in the appendix, we will expand the main text in §4.1–4.3 to include statistical significance tests (e.g., paired t-tests), 95% confidence intervals for the pass@10 metrics, exact baseline prompt lengths, and clearer descriptions of content-matched controls. This will allow readers to better evaluate the robustness of the reported gains. revision: yes
-
Referee: [§5] §5 (Ablation and Qualitative Analysis): The claim that benefits “depend on narrative coherence and genre alignment” is supported only by qualitative inspection and a limited set of coherence/genre variants; quantitative metrics (e.g., token-count-matched baselines or genre-mismatched controls) are not provided, leaving the dependence on narrative form unproven.
Authors: The current support for the dependence on coherence and genre alignment relies on qualitative examples and a small number of variants. To address this, we will augment §5 with quantitative ablations, including token-count-matched baselines and genre-mismatched controls (e.g., using a mismatched genre while keeping other elements fixed). These additions will provide stronger quantitative evidence for the role of narrative form and will be included in the revised manuscript. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivation chain
full rationale
The paper proposes StoryCoder as a narrative reformulation technique and supports its value solely through zero-shot experiments on HumanEval, LiveCodeBench, and CodeForces across 11 models, reporting average pass@10 gains. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims about guiding algorithmic strategies or inducing modularity are presented as post-hoc observations from the same runs rather than quantities defined in terms of themselves. No self-citations are used to justify uniqueness theorems or ansatzes, and the method is not shown to reduce to its inputs by construction. The evaluation is externally falsifiable via the benchmarks and code release, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large language models benefit from coherent, structured natural-language input when performing multi-step reasoning tasks such as code generation.
- domain assumption The chosen algorithm and genre can be used to guide narrative construction without introducing new errors.
Reference graph
Works this paper leans on
-
[1]
https://huggingface.co/mistralai/ Mistral-Small-24B-Instruct-2501. OpenAI. 2025. Gpt-4.1 mini. https://platform. openai.com/docs/models/gpt-4.1-mini. Vahid Sadiri Javadi, Johanne Trippas, Yash Kumar Lal, and Lucie Flek. 2025. Can stories help LLMs rea- son? curating information space through narrative. InProceedings of the 2nd Workshop on Analogical Abstr...
work page 2025
-
[2]
Gemma 2: Improving Open Language Models at a Practical Size
Improving neural machine translation models with monolingual data. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Lin- guistics. Winston Tang. 2015. Leetcode. https://leetcode. com/. Online coding interview preparation platform. ...
work page internal anchor Pith review arXiv 2015
-
[3]
Planning in natural language improves LLM search for code generation. InThe Thirteenth Inter- national Conference on Learning Representations. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh Inte...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Review the major categories of coding test algorithms: - Graph Algorithms - Dynamic Programming - Greedy Algorithms - Sorting and Searching - String Algorithms - Data Structures - Mathematics and Number Theory - Simulation and Implementation
-
[5]
Decide which algorithm category the given problem most closely belongs to. Then, select a narrative genre that naturally aligns with the chosen algorithm. ### Output Format: You must write the output in the exact following order with the specified headers: - Algorithm Category: (one of the categories above) - Narrative Genre: (a fitting genre of your choi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.