pith. sign in

arxiv: 2605.21408 · v1 · pith:MWXXPYVFnew · submitted 2026-05-20 · 📊 stat.ME

TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering

Pith reviewed 2026-05-21 03:11 UTC · model grok-4.3

classification 📊 stat.ME
keywords two-level designstreatment cardinality constraintsnearly balanced designsgeneralized word-length patternbalanced concurrence deviationcoordinate exchange algorithmexperimental designLLM prompt engineering
0
0 comments X

The pith

Nearly balanced two-level designs under treatment cardinality constraints minimize the first two components of the generalized word-length pattern.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines two-level designs where each treatment includes a fixed number k of factors, known as TCARDs. Since exact balanced incomplete block designs cannot always be found, it focuses on nearly balanced versions. The authors prove that these nearly balanced TCARDs minimize the first two parts of the generalized word-length pattern. They also propose the Balanced Concurrence Deviation criterion to construct such designs by enforcing balanced factor appearances and uniform pairwise overlaps. This criterion links to several standard optimality ideas and is optimized using an efficient coordinate-exchange method, with examples in large language model prompt tuning.

Core claim

Nearly balanced TCARDs, which have constant row sums of k in the design matrix, achieve the minimal values for the initial two components in the generalized word-length pattern. Projection quality depends on each factor appearing equally often across treatments and every pair of factors appearing together the same number of times. The Φ_BCD objective is introduced to minimize deviations from these regularities and is shown to relate to M,S-optimality, centered UE(s^2), and Bayesian D-optimality.

What carries the argument

The Φ_BCD criterion, which penalizes replication imbalance and concurrence dispersion to produce nearly balanced designs.

Load-bearing premise

That good projection behavior is determined by balanced factor replications and uniform pairwise concurrences.

What would settle it

Constructing a TCARD for specific n, p, k values that has unbalanced replications or non-uniform concurrences yet still achieves lower values in the first two generalized word-length pattern components than the nearly balanced ones.

Figures

Figures reproduced from arXiv: 2605.21408 by Kexin Xie, Ryan Lekivetz, Xinwei Deng.

Figure 1
Figure 1. Figure 1: Nearly balanced TCARD of Example 1 for (p, k, n) = (6, 3, 7). (a) The Gram matrix X⊤X (equivalently, the concurrence matrix Λ padded by the replications on the diagonal). Its diagonal entries (3, 3, 3, 4, 4, 4) are the replications rj , and its off-diagonal entries are the pairwise concurrences λjℓ ∈ {1, 2} = {κ, κ + 1}. (b) Concurrence-excess graph G(X): vertices coloured by replication class, and edges j… view at source ↗
Figure 2
Figure 2. Figure 2: Aliasing metrics under the main comparison: [PITH_FULL_IMAGE:figures/full_fig_p035_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Projection-based information quality (average of [PITH_FULL_IMAGE:figures/full_fig_p036_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relative improvement over the greedy-rep-pair baseline in F1, precision, recall, and MSE across experimental settings ( [PITH_FULL_IMAGE:figures/full_fig_p037_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Design-quality diagnostic comparing two models of the same response. Each cell shows [PITH_FULL_IMAGE:figures/full_fig_p042_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Movement from baseline accuracy for each design, averaged across [PITH_FULL_IMAGE:figures/full_fig_p045_6.png] view at source ↗
read the original abstract

Modern experimental designs often face the so-called treatment cardinality constraint, which is the constraint on the number of included factors in each treatment. Experiments with such constraints are commonly encountered in engineering simulation, AI system tuning, and large-scale system verification. This calls for the development of adequate designs to enable statistical efficiency for modeling and analysis within feasible constraints. In this work, we study two-level designs under this $k$-treatment cardinality constraint (TCARD), where the design matrix $\mathbf{X} \in \{0,1\}^{n \times p}$ has constant row sums equal to $k$. Although TCARDs are closely related to balanced incomplete block designs (BIBDs), exact BIBD structure is unavailable for many practical $(n,p,k)$ combinations. This leads to the notion of nearly balanced TCARDs, which we prove minimize the first two components of the generalized word-length pattern. We also show that good projection behavior in this setting is governed by two count-based regularities: balanced factor replications and uniform pairwise concurrences. Motivated by this characterization, we then propose the Balanced Concurrence Deviation ($\Phi_{\mathrm{BCD}}$), a model-free objective that jointly penalizes replication imbalance and concurrence dispersion. We further show that this criterion is closely connected to classical optimality principles, including $(M,S)$-optimality, centered $\mathrm{UE}(s^2)$ criterion, and Bayesian $D$-optimality. To construct designs minimizing $\Phi_{\mathrm{BCD}}$, we develop a coordinate-exchange (CE) algorithm with efficient incremental updates, together with a simulation-based procedure for calibrating the criterion weights to the intended downstream task. Numerical experiments confirm that the proposed method compares favorably with existing alternatives across a range of problem sizes and constraint strengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces nearly balanced two-level designs under treatment cardinality constraints (TCARDs), where the n × p design matrix X has entries in {0,1} and constant row sums equal to k. It proves that such nearly balanced TCARDs minimize the first two components of the generalized word-length pattern. The authors characterize desirable projection properties via two count-based regularities—balanced factor replications and uniform pairwise concurrences—and propose the Balanced Concurrence Deviation criterion Φ_BCD that jointly penalizes replication imbalance and concurrence dispersion. Connections are established to (M,S)-optimality, centered UE(s²), and Bayesian D-optimality. A coordinate-exchange algorithm with incremental updates is developed to minimize Φ_BCD, together with a simulation-based weight calibration procedure. Numerical experiments and an application to LLM prompt engineering are presented.

Significance. If the central proofs and characterizations hold, the work supplies a practical, model-free route to efficient designs for experiments subject to cardinality constraints that arise in simulation, system verification, and AI tuning. The explicit links to classical optimality criteria and the efficient construction algorithm constitute clear strengths; the LLM application illustrates relevance beyond traditional DOE settings.

major comments (2)
  1. [Abstract and §2] Abstract and §2: The claim that good projection behavior is governed by balanced factor replications and uniform pairwise concurrences is used to motivate Φ_BCD. With constant row sums equal to k, however, factors are coupled within each row; this global constraint can induce higher-order dependencies in projections onto factor subsets that pairwise concurrence counts alone may not control. An explicit argument or small-scale counter-example showing that the two regularities remain sufficient under the cardinality constraint would be needed to secure the connection to optimality principles.
  2. [§3, Theorem 1] §3, Theorem 1: The proof that nearly balanced TCARDs minimize the first two generalized word-length pattern components is load-bearing for the subsequent development. The derivation should be checked to confirm that it fully incorporates the row-sum constraint rather than treating the counts as independent; any implicit assumption that higher-order terms vanish under the two regularities needs explicit statement.
minor comments (2)
  1. [§4] Notation: The precise definition of the weights inside Φ_BCD and the simulation-based calibration procedure would benefit from a dedicated algorithmic box or pseudocode to improve reproducibility.
  2. [§5] Figures 2–4: The boxplots comparing Φ_BCD against baselines would be clearer if the number of independent replications and the exact performance metric (e.g., average projection variance) were stated in the captions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments raise important points about the motivation for our criterion and the details of the central proof. We address each below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract and §2] The claim that good projection behavior is governed by balanced factor replications and uniform pairwise concurrences is used to motivate Φ_BCD. With constant row sums equal to k, however, factors are coupled within each row; this global constraint can induce higher-order dependencies in projections onto factor subsets that pairwise concurrence counts alone may not control. An explicit argument or small-scale counter-example showing that the two regularities remain sufficient under the cardinality constraint would be needed to secure the connection to optimality principles.

    Authors: We agree that the row-sum constraint introduces coupling among factors and that an explicit verification of sufficiency is warranted. In the revised manuscript we will add a short subsection in §2 containing a small-scale numerical example (n=6, p=4, k=2) that compares designs satisfying the two regularities against alternatives that violate them. The example will demonstrate that, under the constant row-sum constraint, the first two generalized word-length pattern components (and the associated projection discrepancies) are indeed minimized precisely when the replication and concurrence regularities hold, with no residual higher-order effects appearing in the low-order projections relevant to our optimality arguments. revision: yes

  2. Referee: [§3, Theorem 1] The proof that nearly balanced TCARDs minimize the first two generalized word-length pattern components is load-bearing for the subsequent development. The derivation should be checked to confirm that it fully incorporates the row-sum constraint rather than treating the counts as independent; any implicit assumption that higher-order terms vanish under the two regularities needs explicit statement.

    Authors: We have re-examined the proof of Theorem 1. The derivation begins from the definition of the generalized word-length pattern for a 0-1 matrix with fixed row sums equal to k and expresses the relevant inner products directly in terms of the constrained row totals; the counts are therefore not treated as independent. To make this transparent, the revised proof will include an additional paragraph that (i) explicitly substitutes the row-sum constraint into the expressions for the first two word-length components and (ii) states that, once the nearly-balanced conditions are imposed, all higher-order contributions to these components are identically zero by the algebraic identity used in the proof. We believe these clarifications will fully address the concern while leaving the result unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces nearly balanced TCARDs after noting the unavailability of exact BIBDs for many (n,p,k) tuples, then proves these minimize the first two GWP components and shows projection behavior is governed by balanced replications plus uniform pairwise concurrences. These are standard count-based characterizations in design theory; the subsequent definition of Φ_BCD as a penalty on imbalance and dispersion is motivated by but not identical to those counts, and the paper separately connects Φ_BCD to (M,S)-optimality, centered UE(s²), and Bayesian D-optimality via explicit arguments rather than by re-labeling. The simulation-based weight calibration is a practical tuning step for the LLM application and does not retroactively define the optimality claims. No quoted step equates a claimed result to its own inputs by construction, and no load-bearing premise collapses to a self-citation whose content is unverified outside the present work. The derivation therefore remains self-contained against external design-theoretic benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on the unavailability of exact BIBDs for practical (n,p,k), the count-based regularities governing projection, and simulation calibration of weights in Φ_BCD; no new physical entities postulated.

free parameters (1)
  • weights in Φ_BCD
    A simulation-based procedure for calibrating the criterion weights to the intended downstream task
axioms (1)
  • domain assumption TCARDs are closely related to balanced incomplete block designs (BIBDs)
    Stated as background in the abstract

pith-pipeline@v0.9.0 · 5863 in / 1403 out tokens · 49067 ms · 2026-05-21T03:11:00.160517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We also show that good projection behavior in this setting is governed by two count-based regularities: balanced factor replications and uniform pairwise concurrences. ... propose the Balanced Concurrence Deviation (Φ_BCD), a model-free objective that jointly penalizes replication imbalance and concurrence dispersion.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    nearly balanced TCARDs, which we prove minimize the first two components of the generalized word-length pattern. ... B1 is minimized if and only if NB1 ... B2 is minimized if and only if NB2

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 6 internal anchors

  1. [1]

    Statistica Sinica , pages=

    Minimum moment aberration for nonregular designs and supersaturated designs , author=. Statistica Sinica , pages=. 2003 , publisher=

  2. [2]

    Journal of statistical planning and inference , volume=

    Exploratory designs for computational experiments , author=. Journal of statistical planning and inference , volume=. 1995 , publisher=

  3. [3]

    Journal of statistical planning and inference , volume=

    Minimax and maximin distance designs , author=. Journal of statistical planning and inference , volume=. 1990 , publisher=

  4. [4]

    Biometrika , volume=

    Projective properties of certain orthogonal arrays , author=. Biometrika , volume=. 1996 , publisher=

  5. [5]

    Biometrika , volume=

    A minimum aberration-type criterion for selecting space-filling designs , author=. Biometrika , volume=. 2022 , publisher=

  6. [6]

    Biometrika , volume=

    Theory of J-characteristics for fractional factorial designs and projection justification of minimum G 2-aberration , author=. Biometrika , volume=. 2001 , publisher=

  7. [7]

    Technometrics , volume=

    Some systematic supersaturated designs , author=. Technometrics , volume=. 1962 , publisher=

  8. [8]

    Box, G. E. P. and Hunter, J. S. , title =. Technometrics , volume =

  9. [9]

    Box, G. E. P. and Meyer, R. D. , title =. Technometrics , volume =

  10. [10]

    the Annals of Statistics , volume=

    Minimum G\_2 -aberration for nonregular fractional factorial designs , author=. the Annals of Statistics , volume=. 1999 , publisher=

  11. [11]

    Statistica Sinica , pages=

    Generalized resolution and minimum aberration criteria for Plackett-Burman and other nonregular factorial designs , author=. Statistica Sinica , pages=. 1999 , publisher=

  12. [12]

    Journal of Statistical Planning and Inference , volume=

    Construction of component orthogonal arrays with any number of components , author=. Journal of Statistical Planning and Inference , volume=. 2021 , publisher=

  13. [13]

    Metrika , volume=

    A new method of finding component orthogonal arrays for order-of-addition experiments , author=. Metrika , volume=. 2021 , publisher=

  14. [14]

    Quality and Reliability Engineering International , volume=

    A general construction method for component orthogonal arrays , author=. Quality and Reliability Engineering International , volume=. 2024 , publisher=

  15. [15]

    ACM SIGACT News , volume=

    Combinatorial designs: constructions and analysis , author=. ACM SIGACT News , volume=. 2008 , publisher=

  16. [16]

    Technometrics , volume=

    Iterative construction of nearly balanced assignments I: categorical covariates , author=. Technometrics , volume=. 1981 , publisher=

  17. [17]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume=

    Optimum experimental designs , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1959 , publisher=

  18. [18]

    The annals of statistics , pages=

    On the theory of connected designs: characterization and optimality , author=. The annals of statistics , pages=. 1974 , publisher=

  19. [19]

    and Nachtsheim, C

    Jones, B. and Nachtsheim, C. J. , title =. Journal of Quality Technology , volume =

  20. [20]

    The Annals of Mathematical Statistics , pages=

    Optimality criteria for incomplete block designs , author=. The Annals of Mathematical Statistics , pages=. 1960 , publisher=

  21. [21]

    The Annals of Mathematical Statistics , volume=

    On the nonrandomized optimality and randomized nonoptimality of symmetrical designs , author=. The Annals of Mathematical Statistics , volume=. 1958 , publisher=

  22. [22]

    Biometrika , volume=

    Nearly balanced incomplete block designs , author=. Biometrika , volume=. 1981 , publisher=

  23. [23]

    Biometrics , volume=

    Large row-constrained supersaturated designs for high-throughput screening , author=. Biometrics , volume=. 2025 , publisher=

  24. [24]

    Journal of the American Statistical Association , volume=

    Optimal supersaturated designs , author=. Journal of the American Statistical Association , volume=. 2014 , publisher=

  25. [25]

    Journal of Statistical Planning and Inference , volume=

    Some sufficient conditions for establishing (M, S)-optimality , author=. Journal of Statistical Planning and Inference , volume=. 1980 , publisher=

  26. [26]

    Plackett, R. L. and Burman, J. P. , title =. Biometrika , volume =

  27. [27]

    Technometrics , volume=

    The coordinate-exchange algorithm for constructing exact optimal experimental designs , author=. Technometrics , volume=. 1995 , publisher=

  28. [28]

    Wu, C. F. J. , title =. Biometrika , volume =

  29. [29]

    American Journal of Physiology-Regulatory, Integrative and Comparative Physiology , volume=

    Multiple-objective criteria for optimal experimental design: application to ferrokinetics , author=. American Journal of Physiology-Regulatory, Integrative and Comparative Physiology , volume=. 1985 , publisher=

  30. [30]

    Wiley Interdisciplinary Reviews: Computational Statistics , volume=

    Optimal experimental design that targets meaningful information , author=. Wiley Interdisciplinary Reviews: Computational Statistics , volume=. 2017 , publisher=

  31. [31]

    Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

    Optimum design of experiments for statistical inference , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 2012 , publisher=

  32. [32]

    Journal of Biopharmaceutical Statistics , volume=

    Compound optimal design criteria for nonlinear models , author=. Journal of Biopharmaceutical Statistics , volume=. 2008 , publisher=

  33. [33]

    Journal of the American Statistical Association , volume=

    On the equivalence of constrained and compound optimal designs , author=. Journal of the American Statistical Association , volume=. 1994 , publisher=

  34. [34]

    Technometrics , volume=

    Optimization of designed experiments based on multiple criteria utilizing a Pareto frontier , author=. Technometrics , volume=. 2011 , publisher=

  35. [35]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    An optimal design framework for lasso sign recovery , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2025 , publisher=

  36. [36]

    Automatica , volume=

    Sensor selection strategies for state estimation in energy constrained wireless sensor networks , author=. Automatica , volume=. 2011 , publisher=

  37. [37]

    Optimization letters , volume=

    Optimization scheme for sensor coverage scheduling with bandwidth constraints , author=. Optimization letters , volume=. 2009 , publisher=

  38. [38]

    Current opinion in drug discovery & development , volume=

    Pooling in high-throughput drug screening , author=. Current opinion in drug discovery & development , volume=

  39. [39]

    Proceedings of the 16th International Working Conference on Variability Modelling of Software-Intensive Systems , pages=

    On the interaction of feature toggles , author=. Proceedings of the 16th International Working Conference on Variability Modelling of Software-Intensive Systems , pages=

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models. arXiv 2023 , author=. arXiv preprint arXiv:2302.13971 , volume=

  41. [41]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv. org/abs/2110.14168 , volume=

  42. [42]

    Technometrics , volume=

    A simple Bayesian modification of D-optimal designs to reduce dependence on an assumed model , author=. Technometrics , volume=. 1994 , publisher=

  43. [43]

    Journal of Statistical Planning and Inference , volume=

    Bayesian D-optimal supersaturated designs , author=. Journal of Statistical Planning and Inference , volume=. 2008 , publisher=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  46. [46]

    Advances in Neural Information Processing Systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in Neural Information Processing Systems , volume=

  47. [47]

    A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    A systematic survey of prompt engineering in large language models: Techniques and applications , author=. arXiv preprint arXiv:2402.07927 , year=

  48. [48]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Large language models are better reasoners with self-verification , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  49. [49]

    Advances in Neural Information Processing Systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

  50. [50]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

    When ``A Helpful Assistant'' is not really helpful: Personas in system prompts do not improve performances of large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , year=

  51. [51]

    International Conference on Learning Representations (ICLR) , year=

    Least-to-most prompting enables complex reasoning in large language models , author=. International Conference on Learning Representations (ICLR) , year=

  52. [52]

    The Llama 3 Herd of Models

    The Llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  53. [53]

    arXiv preprint arXiv:2311.04205 , year=

    Rephrase and respond: Let large language models ask better questions for themselves , author=. arXiv preprint arXiv:2311.04205 , year=

  54. [54]

    International Conference on Learning Representations (ICLR) , year=

    Chain-of-table: Evolving tables in the reasoning chain for table understanding , author=. International Conference on Learning Representations (ICLR) , year=

  55. [55]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Show your work: Scratchpads for intermediate computation with language models , author=. arXiv preprint arXiv:2112.00114 , year=

  56. [56]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

  57. [57]

    arXiv preprint arXiv:2502.18600 , year=

    Chain of draft: Thinking faster by writing less , author=. arXiv preprint arXiv:2502.18600 , year=

  58. [58]

    arXiv preprint arXiv:2311.08734 , year=

    Thread of thought unraveling chaotic contexts , author=. arXiv preprint arXiv:2311.08734 , year=

  59. [59]

    arXiv preprint arXiv:2410.21333 , year=

    Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse , author=. arXiv preprint arXiv:2410.21333 , year=

  60. [60]

    IEEE Transactions on Software Engineering , year=

    The impact of prompt programming on function-level code generation , author=. IEEE Transactions on Software Engineering , year=

  61. [61]

    Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

    Do language models understand measurements? , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

  62. [62]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

    Question-analysis prompting improves LLM performance in reasoning tasks , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) , pages=

  63. [63]

    arXiv e-prints , pages=

    Large language models are unconscious of unreasonability in math problems , author=. arXiv e-prints , pages=

  64. [64]

    Proceedings of the 40th International Conference on Machine Learning , series=

    Large Language Models Can Be Easily Distracted by Irrelevant Context , author=. Proceedings of the 40th International Conference on Machine Learning , series=. 2023 , publisher=

  65. [65]

    arXiv preprint arXiv:2402.14848 , year=

    Same Task, More Tokens: The Impact of Input Length on the Reasoning Performance of Large Language Models , author=. arXiv preprint arXiv:2402.14848 , year=

  66. [66]

    Curse of instructions: Large language models cannot follow multiple instructions at once , author=

  67. [67]

    Sprague, Zayne and Yin, Fangcong and Rodriguez, Juan Diego and Jiang, Dongwei and Wadhwa, Manya and Singhal, Prasann and Zhao, Xinyu and Ye, Xi and Mahowald, Kyle and Durrett, Greg , booktitle=. To. 2025 , note=

  68. [68]

    arXiv preprint arXiv:2506.14641 , year=

    Revisiting chain-of-thought prompting: Zero-shot can be stronger than few-shot , author=. arXiv preprint arXiv:2506.14641 , year=

  69. [69]

    Transactions of the Association for Computational Linguistics , volume=

    Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=

  70. [70]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  71. [71]

    Proceedings of the 40th International Conference on Machine Learning , series=

    Large Language Models Can Be Easily Distracted by Irrelevant Context , author=. Proceedings of the 40th International Conference on Machine Learning , series=