pith. sign in

arxiv: 2607.02512 · v1 · pith:EMOUFTYGnew · submitted 2026-07-02 · 💻 cs.LG · cs.AI· cs.CL

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

Pith reviewed 2026-07-04 03:30 UTC · model glm-5.2

classification 💻 cs.LG cs.AIcs.CL
keywords fuzzy functionsneural compilationLoRAparameter-efficient fine-tuninghypernetworkssmall language modelslocal inferenceprogram-as-weights
0
0 comments X

The pith

0.6B model matches 32B by compiling fuzzy functions into weights

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes that fuzzy functions — tasks like log triage, JSON repair, or intent classification that resist clean rule-based code — can be compiled into small neural artifacts rather than outsourced to a large language model API on every call. The central mechanism is Program-as-Weights (PAW): a 4B-parameter neural compiler reads a natural-language function specification and emits two things — a clean pseudo-program (a paraphrased description with examples) and a LoRA adapter (a small weight patch). These are loaded into a frozen 0.6B-parameter interpreter that then executes the function locally. The compiler is invoked once per function definition; all subsequent calls run offline on a lightweight model. The paper trains the LoRA compiler on FuzzyBench, a 10-million-example synthetic dataset spanning 800+ task categories, and shows that the resulting 0.6B interpreter executing PAW programs achieves 73.78% exact match on FuzzyBench, outperforming direct prompting of a 32B model (68.70%) at roughly one-fiftieth the inference memory. Quantized to 4-bit, the system runs at 30 tokens per second on a MacBook M3 with a 430 MB shared base and 23 MB per-program adapter. The paper also demonstrates that swapping the text compiler for a vision-language compiler extends the same paradigm to image-conditioned tasks without changing the interpreter.

Core claim

The paper's central claim is that a trained 4B-parameter compiler can, in a single forward pass, produce a LoRA adapter that specializes a frozen 0.6B-parameter language model to perform a specific fuzzy function at a level matching direct prompting of a 32B model — while the compiled artifact is a 23 MB file that runs offline. The compiler generates the adapter by reading a specification and a pseudo-program (a clean restatement with examples), extracting hidden states from learned prefix tokens, and projecting them through a shared-basis LoRA mapper into mixing coefficients over 64 rank-64 LoRA bases per module type. The paper finds that the simplest mapper design — mean-pooling hidden to

What carries the argument

Program-as-Weights (PAW): a compiler-interpreter pair where a 4B neural compiler emits a hybrid program (discrete pseudo-program + continuous LoRA adapter) from a natural-language specification, and a frozen 0.6B interpreter executes it. The LoRA mapper uses mean-pooled hidden states from the compiler projected into mixing coefficients over shared learnable bases (64 bases, rank 64), injecting ~38.5M parameters per function. FuzzyBench-10M: a synthetic dataset of 10M (specification, input, output) triples across 800+ fuzzy task categories, generated by gpt-5.2 with a verified test split.

If this is right

  • If PAW-style compilation works broadly, developers could replace per-input LLM API calls in their codebases with locally-executable neural functions — gaining reproducibility, offline capability, and ~50x memory savings.
  • The compiler-interpreter split suggests a new software engineering workflow: large models are invoked once at build time to produce small, versioned, distributable artifacts, while small models serve as a fixed runtime.
  • The finding that GPT-2 124M achieves 54% with compiler-generated LoRAs suggests the approach could push fuzzy-function capability into extremely small models suitable for browser or edge deployment.
  • The modality-generalization result (swapping only the compiler for a vision-language model) implies the paradigm is not text-specific and could extend to audio, video, or other modalities as compiler backbones improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the FuzzyBench distribution is systematically narrower or more learnable than real-world fuzzy tasks, the 73.78% vs 68.70% gap may reflect distributional artifacts rather than a genuine capability transfer. Controlled evaluation on held-out real-world task distributions would be needed to confirm production readiness.
  • The coupled compiler-interpreter constraint (switching interpreters requires retraining the compiler) may limit the paradigm's practical adoption: as base models improve rapidly, the cost of retraining compilers could erode the efficiency gains.
  • The finding that simpler LoRA mapper designs outperform more expressive variants is consistent with a regularization effect — the shared-basis bottleneck may prevent overfitting to individual specifications, but this hypothesis is not tested in the paper.
  • The 96.09% ceiling set by the data-generating model suggests that PAW's current performance headroom may be bounded more by training data quality than by compiler architecture, implying that better data sources could yield further gains without architectural changes.

Load-bearing premise

Both the training data and the evaluation benchmark for PAW are generated by the same model family (gpt-5.2), with verification by a smaller model from the same family. If this synthetic distribution does not reflect the fuzzy tasks developers actually encounter, or if its output conventions are systematically easier for a small model to learn than real-world tasks would be, the headline performance comparison may not transfer to production use.

What would settle it

If PAW programs compiled from specifications drawn from a held-out, human-curated distribution of real developer fuzzy tasks (rather than gpt-5.2-generated ones) perform no better than direct prompting of the 0.6B base model, the compiler's contribution would be shown to depend on distributional artifacts in the training data rather than genuine task-generalization capability.

Figures

Figures reproduced from arXiv: 2607.02512 by Liliana Hotsko, Pengyu Nie, Stuart Shieber, Wentao Zhang, Woojeong Kim, Yuntian Deng.

Figure 1
Figure 1. Figure 1: Overview of the Program-as-Weights paradigm. Top: compile once in the cloud. A natural-language description of a fuzzy function (here, “classify if this is urgent”) is fed to a neural compiler, which produces a neural program. Bottom: run locally. A small frozen neural interpreter loads the compiled program and runs the user’s input (“Need your signature by EOD!”) to produce the output (“urgent”). The comp… view at source ↗
Figure 2
Figure 2. Figure 2: Text-to-LoRA instantiation of PAW (Section 3.2). Left. The trained LoRA compiler reads the function specification, the pseudo-program produced by an off-the-shelf prompted pseudo compiler Cp (not depicted), and a fixed sequence of learned prefix tokens; it emits prefix-position hidden states H. Middle. The LoRA mapper mean-pools H, passes it through an MLP, and projects into mixing coefficients that compos… view at source ↗
Figure 3
Figure 3. Figure 3: FuzzyBench-10M task-family distribution. 29 incremental thematic versions are mapped to 7 high-level families. “Core text processing & NLP” is the largest family because the v1 base layer (2.5M examples; 277 base categories) covers parsing, classification, NER, coreference, and sentiment; the remaining 7.5M examples spread across the other six families. safety/verification. The full per-version timeline (2… view at source ↗
Figure 4
Figure 4. Figure 4: Developer interface. Left: the compiler translates a natural-language specification into a neural program. Right: the interpreter loads this program and exposes it as a local function [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Step 1: Compile a program from natural language. The user specifies a fuzzy function in natural language. Image inputs are also supported [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Step 2: Interactively test the compiled program. Users can provide test inputs and inspect the corresponding outputs, enabling rapid validation and refinement before download. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step 3: Execute the program locally via Python. Once compiled, the program can be loaded and invoked through a simple Python API; subsequent execution requires no internet access. B FuzzyBench Construction Prompts Figures 8 to 10 show the prompts used to generate the natural-language specifications. Half of the specifications are generated without exemplar examples ( [PITH_FULL_IMAGE:figures/full_fig_p017… view at source ↗
Figure 8
Figure 8. Figure 8: System prompt for generating specifications. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: User prompt for generating specifications (no exemplar examples). [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: User prompt for generating specifications, with exemplar input/output pairs. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: User prompt for generating input/output examples given a specification. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Compiler prompt, examples style. Used by the off-the-shelf reference compiler (Qwen3- 4B-Instruct-2507) to generate the rollouts used during training. {pseudo_program} [INPUT] {task_input} [END_INPUT] [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Interpreter prompt, minimal style. Role: PAW-Compiler. You will see an image. Produce a self-contained text representation of the image that a *blind* interpreter can later use to answer arbitrary questions about it. The interpreter sees only your output and never the image. Coverage requirements (apply all that exist in the image): - Transcribe every piece of legible text verbatim, in quotes, with its lo… view at source ↗
Figure 16
Figure 16. Figure 16: Compiler prompt for image-conditioned specifications. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Interpreter prompt for image-conditioned specifications. [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prefix-tuning precursor architecture (Section 3.3). (a) Compile. The user describes a fuzzy function (e.g., “extract the final answer”); the trained prefix compiler reads the description plus a handful of representative I/O examples and produces a per-example KV prefix — the “neural binary” that constitutes the compiled program. (b) Interpret. A small frozen interpreter loads the compiled KV prefix into i… view at source ↗
Figure 19
Figure 19. Figure 19: A library of compiled PAW programs. Three example natural-language function specifications (“Classify message urgency”, “Fix malformed JSON”, “Remove personal information”; left) are each compiled into a separate neural program (middle): a discrete pseudo-program in a fixed format plus a continuous per-example LoRA (depicted as red, blue, green adapters). At deployment time (right), all three programs are… view at source ↗
Figure 20
Figure 20. Figure 20: The Alien-Taboo case-study UI. The player describes the secret word (here, “moon”) in free text without using any of the listed taboo words (night, orbit, lunar, full); the alien “Zog” — a one-PAW-function compiled program — must guess the word from the description. Each player turn is served by a 0.6B Qwen3 PAW interpreter on a small server, with one PAW program (and per-program LoRA adapter) per languag… view at source ↗
read the original abstract

Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 7 minor

Summary. This paper introduces Program-as-Weights (PAW), a paradigm in which a natural-language specification of a 'fuzzy function' is compiled by a 4B-parameter neural compiler into a LoRA adapter (plus a discrete pseudo-program), which is then executed by a frozen 0.6B-parameter interpreter. The authors train the compiler on FuzzyBench-10M, a new 10M-example synthetic dataset they construct using gpt-5.2. The headline result is that the 0.6B PAW interpreter achieves 73.78% exact match on FuzzyBench, outperforming direct zero-shot prompting of Qwen3-32B (68.70%) at ~50x less memory. The paper also presents ablations (architectural variants, compiler-vs-no-compiler, noise robustness, quantization), multimodal extensions (swapping the compiler for a VLM), and five qualitative case studies.

Significance. The paper presents a well-engineered system with a clear and appealing conceptual framing: reframing the foundation model as a per-function tool builder rather than a per-input problem solver. The compiler-interpreter abstraction is clean, and the demonstration that a 4B compiler can emit LoRA adapters that specialize a 0.6B interpreter for arbitrary fuzzy functions is a meaningful contribution. Strengths include: (1) controlled ablations showing the compiler-generated LoRA substantially outperforms fixed LoRAs and full fine-tuning on the same base (Table 5); (2) thorough quantization sweeps demonstrating practical on-device deployment viability (Table 8, Appendix K); (3) release of code, a public demo, and a large-scale dataset; (4) the multimodal generalization experiment (Table 3) showing the abstraction holds when only the compiler is swapped. The work is relevant to the growing literature on hypernetworks, PEFT generation, and small-model deployment.

major comments (3)
  1. §6, Table 2: The headline claim that '0.6B PAW matches Qwen3-32B' is drawn exclusively from FuzzyBench (73.78% vs 68.70%), a benchmark the authors constructed. Table 2 also reports four external benchmarks (YouTube, SMS, Yelp, IMDB) where PAW (0.6B) underperforms Qwen3-32B on all four (e.g., YouTube 90.40% vs 93.60%, SMS 80.77% vs 89.04%). The abstract and introduction do not acknowledge this pattern; they present only the FuzzyBench comparison. The paper should explicitly state that the headline advantage holds on FuzzyBench but not on the external benchmarks, and discuss what this implies about the transferability of the claim.
  2. §5, §6: The FuzzyBench training data and test set are both generated by gpt-5.2. The test set is filtered for agreement between gpt-5-mini and gpt-5.2, but both the training labels and the verification standard come from the same model family. The PAW compiler is trained on 10M examples from this distribution (unseen specs, but same generative process), while Qwen3-32B is evaluated zero-shot. A system specialized to a distribution outperforming a generalist on that distribution is expected; the question is whether the distribution reflects real-world fuzzy functions. The paper should add a discussion of this confound and clarify that the FuzzyBench comparison is not an apples-to-apples evaluation of general capability.
  3. §6, Table 2: The choice of Qwen3-32B as the comparison point for the headline claim is favorable. Table 2 shows that gpt-oss-20B achieves 85.45% on FuzzyBench (zero-shot), substantially above PAW's 73.78%. The paper does not discuss this gap. While gpt-oss-20B is larger, it is an open-weight model that could also be quantized for local use. The paper should address why the 32B comparison is the most informative one, or at minimum acknowledge the gpt-oss-20B result.
minor comments (7)
  1. Table 2: The 'PS' (per-program shipping size) column reports 23 MB for PAW across all benchmarks, but the text in §9 mentions ~430 MB shared base plus 23 MB per-program. The table header or caption should clarify whether PS includes only the adapter or the total deployment footprint.
  2. §3.2, Eq. (3): The notation for mixing coefficients alpha^{A,B}_{l,m,n} uses a single superscript A,B but the summation in Eq. (3) uses separate alpha^A and alpha^B. Clarify whether these are distinct heads or a single head producing both.
  3. Table 1 vs Table 2: Table 1 reports Text-to-LoRA r=64 accuracy as 0.657, while Table 2 reports PAW (Qwen3 0.6B) at 0.7378. The text states Table 1 is at 'controlled comparison scale' but the relationship between these two numbers (same architecture, different training data scale?) should be stated explicitly.
  4. §7, Table 4: The default mapper accuracy is 0.6223, but Table 2 reports 0.7378 for the same configuration. Presumably Table 4 uses a subset or earlier checkpoint; this should be noted in the Table 4 caption.
  5. Appendix I, Table 12: The compiler-scaling study is labeled 'inconclusive' and uses only 0.6M training examples at epoch 1. This is fine as exploratory data, but the caption should note that the main results use 10M examples and 3 epochs, so these numbers are not directly comparable.
  6. §9: The five case studies are qualitative walkthroughs without controlled comparisons against Qwen3-32B on the same tasks. This is acceptable for illustration, but the text should state explicitly that these are demonstrations, not evaluations, so readers do not over-interpret the 93% ToolCall-15 score as a head-to-head result.
  7. Figure 3: The task-family distribution percentages sum to 100% but the example counts (2.95M + 1.80M + 1.50M + 1.25M + 1.25M + 0.75M + 0.50M = 10.0M) match. However, the figure caption says '29 incremental thematic versions are mapped to 7 high-level families' without explaining whether categories can belong to multiple families. Clarify whether these are post-deduplication counts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for a careful and fair reading of the manuscript. All three major comments identify legitimate issues with how the headline results are framed relative to what the data actually shows. We agree with the substance of each point and will revise the manuscript accordingly. Below we address each in turn.

read point-by-point responses
  1. Referee: §6, Table 2: The headline claim that '0.6B PAW matches Qwen3-32B' is drawn exclusively from FuzzyBench, a benchmark the authors constructed. On the four external benchmarks (YouTube, SMS, Yelp, IMDB), PAW (0.6B) underperforms Qwen3-32B on all four. The abstract and introduction do not acknowledge this pattern.

    Authors: The referee is correct. The current abstract and introduction present the FuzzyBench comparison without qualifying that the advantage does not hold on the four external benchmarks. This is a framing problem we will fix. Specifically, we will: (1) revise the abstract to state that the 0.6B PAW interpreter matches Qwen3-32B on FuzzyBench but underperforms it on external benchmarks, making clear that the headline comparison is distribution-specific; (2) add a paragraph in §6 explicitly noting the cross-benchmark pattern — PAW trails Qwen3-32B on YouTube (90.40% vs 93.60%), SMS (80.77% vs 89.04%), Yelp (95.82% vs 98.11%), and IMDB (90.64% vs 94.64%) — and discussing what this implies: PAW's advantage comes from compiler-generated specialization to the FuzzyBench task distribution, and on narrower, well-defined external tasks where a 32B model's general capability suffices, the specialization benefit does not overcome the capacity gap. We agree this qualification should have been in the original submission. revision: yes

  2. Referee: §5, §6: FuzzyBench training and test data are both generated by gpt-5.2. The PAW compiler is trained on 10M examples from this distribution while Qwen3-32B is evaluated zero-shot. A system specialized to a distribution outperforming a generalist on that distribution is expected.

    Authors: This is a fair and important point. We will add a dedicated discussion in §6 acknowledging this confound explicitly. The key facts are: (a) both FuzzyBench training labels and the test-set verification standard come from the gpt-5.2 model family, so the PAW compiler is trained on data drawn from the same distribution it is evaluated on (though test specifications are unseen); (b) Qwen3-32B is evaluated zero-shot with no distribution-specific adaptation; (c) therefore the FuzzyBench comparison is not an apples-to-apples evaluation of general capability — it measures whether compiler-generated specialization to a task distribution can substitute for raw model scale on that distribution. We will state this plainly. We note that the external benchmarks (YouTube, SMS, Yelp, IMDB), where PAW does not have this distributional advantage, provide a partial corrective: PAW underperforms Qwen3-32B on all four, which is consistent with the referee's expectation. We will also note in §5 (as we already do in Appendix N) that broader external validation is in progress and that the five case studies in §9 are an initial step toward real-world validation beyond the synthetic distribution. revision: yes

  3. Referee: §6, Table 2: gpt-oss-20B achieves 85.45% on FuzzyBench zero-shot, substantially above PAW's 73.78%. The paper does not discuss this gap. The 32B comparison is favorable.

    Authors: We agree the paper should address the gpt-oss-20B result. We will add discussion in §6 acknowledging that gpt-oss-20B (85.45%) substantially outperforms PAW (73.78%) on FuzzyBench in zero-shot prompting, and that this gap is real and significant. We will also explain our choice of Qwen3-32B as the primary comparison point: the comparison is within the same model family (Qwen3), which isolates the effect of PAW specialization versus parameter scale without confounding by architecture or training-data differences. This makes the 0.6B-vs-32B comparison cleaner as a controlled test of whether compiler-generated adapters can substitute for scale within a family. However, we concede that this rationale was not stated in the paper and that the gpt-oss-20B result weakens the practical deployment argument: a user who can run a 20B model locally would get better FuzzyBench performance from direct prompting than from PAW. We will add this caveat honestly. The deployment argument for PAW is strongest in the regime where the user cannot or will not run a 20B model — e.g., the ~430 MB quantized interpreter running at 30 tok/s on a MacBook M3 — but we should not imply that PAW dominates all local alternatives. We will revise the framing accordingly. revision: partial

Circularity Check

0 steps flagged

No formal circularity found; the evaluated system is architecturally distinct from the data generator, and concerns are about external validity rather than self-referential construction.

full rationale

The paper's central claim (0.6B PAW interpreter matches Qwen3-32B on FuzzyBench) is an empirical benchmark result, not a first-principles derivation or prediction. The training objective (Eq. 4) is standard supervised likelihood. The LoRA mapper (Eq. 3) is a standard shared-basis linear combination. FuzzyBench is generated by gpt-5.2, but the trained and evaluated system (Qwen3-4B compiler, Qwen3-0.6B interpreter) is a different model family, test specifications are held out (80/10/10 split), and the paper explicitly acknowledges the synthetic-data limitation. No load-bearing argument depends on a self-citation chain, no uniqueness theorem is invoked, and no ansatz is smuggled from prior self-authored work. The concerns about FuzzyBench being self-constructed and the external benchmarks showing PAW underperforming Qwen3-32B are external validity issues, not circularity. The paper transparently reports both wins and losses. Score 1 reflects the minor concern that the benchmark is author-constructed, but this does not constitute formal circularity.

Axiom & Free-Parameter Ledger

6 free parameters · 4 axioms · 3 invented entities

The free parameters are architectural and training choices selected empirically on validation data, not fitted to the test set. The axioms are domain assumptions about the representativeness of synthetic data and the sufficiency of PEFT methods. The invented entities (FuzzyBench, LoRA mapper, pseudo-program) are all described in reproducible detail and ablated. The main risk is that the benchmark is self-constructed and self-verified, which creates a soft circularity in the evaluation even though the training and evaluation models are from different families.

free parameters (6)
  • LoRA rank r = 64
    Chosen by empirical comparison (Table 1: r=18 gives 56.5%, r=64 gives 65.7%). Not fitted to the test set but selected on validation.
  • Number of shared bases N = 64
    Stated in Section 3.2 as the default; chosen empirically.
  • Prefix token count T = 64
    Stated in Section 3.2; the number of learned prefix tokens fed to the compiler.
  • Learning rate = 2e-5
    Stated in Appendix G; standard for fine-tuning a 4B model.
  • LoRA mapper MLP architecture = single residual MLP trunk
    Selected over more expressive alternatives (Table 4); the simplest design was strongest.
  • Compiler depth-aligned layers L = one per interpreter layer, spaced uniformly by depth ratio
    Stated in Section 3.2; architectural choice for extracting hidden states.
axioms (4)
  • domain assumption A frozen 0.6B language model, when injected with a compiler-generated LoRA adapter and a pseudo-program, can approximate fuzzy functions well enough to match a 32B model's direct prompting.
    This is the core empirical assumption tested by the paper. It is not proven from first principles; it is validated (or not) by the experimental results in Table 2.
  • domain assumption FuzzyBench's 10M gpt-5.2-generated examples are representative of the fuzzy functions developers actually encounter.
    Invoked implicitly throughout Section 5 and all main results. The paper argues for breadth (800+ categories, 29 thematic versions) but provides no comparison to real-world task distributions beyond the five qualitative case studies.
  • domain assumption The verified test set (gpt-5-mini and gpt-5.2 agreement) provides a fair evaluation standard.
    Invoked in Section 5 for test set construction. Both models are from the same family as the data generator, which could bias the evaluation toward patterns that family produces.
  • domain assumption Standard PEFT methods (LoRA, prefix-tuning) are sufficient as the continuous program form.
    Invoked in Section 3; the paper instantiates two PEFT methods and notes that 'future PEFTs possibly better still' (Section 1).
invented entities (3)
  • FuzzyBench-10M dataset independent evidence
    purpose: Training and evaluation data for the compiler; 10M (spec, input, output) triples across 800+ fuzzy task categories.
    Released publicly; the paper provides construction prompts (Appendix B) and per-version breakdown (Appendix F). The test set is verified by model agreement. However, the dataset is entirely synthetic and self-constructed.
  • LoRA mapper (shared-basis mixing architecture) independent evidence
    purpose: Converts compiler hidden states into per-example LoRA weights via learned shared bases and mixing coefficients (eq. 3).
    The architecture is described in detail (Section 3.2) and ablated against alternatives (Table 4). It makes falsifiable predictions about which architectural choices help.
  • Pseudo-program (discrete component) independent evidence
    purpose: A clean restatement of the user specification plus examples, generated by an off-the-shelf 4B model, fed to the interpreter alongside the LoRA.
    Tested via ablation (Table 7): removing the pseudo-program degrades performance, especially under noisy specifications. The prompt template is in Appendix C.

pith-pipeline@v1.0.0 · 28787 in / 3702 out tokens · 144477 ms · 2026-07-04T03:30:17.829581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    FANT o M : A Benchmark for Stress-testing Machine Theory of Mind in Interactions

    Kim, Hyunwoo and Sclar, Melanie and Zhou, Xuhui and Bras, Ronan and Kim, Gunhee and Choi, Yejin and Sap, Maarten. FANT o M : A Benchmark for Stress-testing Machine Theory of Mind in Interactions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.890

  10. [10]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Li, Xiang Lisa and Liang, Percy. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.353

  11. [11]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  12. [12]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  13. [13]

    Mankowitz and Esme Sutherland Robson and Pushmeet Kohli and Nando de Freitas and Koray Kavukcuoglu and Oriol Vinyals , title =

    Yujia Li and David Choi and Junyoung Chung and Nate Kushman and Julian Schrittwieser and Rémi Leblond and Tom Eccles and James Keeling and Felix Gimeno and Agustin Dal Lago and Thomas Hubert and Peter Choy and Cyprien de Masson d’Autume and Igor Babuschkin and Xinyun Chen and Po-Sen Huang and Johannes Welbl and Sven Gowal and Alexey Cherepanov and James M...

  14. [14]

    Locating and Editing Factual Associations in

    Kevin Meng and David Bau and Alex J Andonian and Yonatan Belinkov , booktitle=. Locating and Editing Factual Associations in. 2022 , url=

  15. [15]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  16. [16]

    Text-to-Lo

    Rujikorn Charakorn and Edoardo Cetin and Yujin Tang and Robert Tjarko Lange , booktitle=. Text-to-Lo. 2025 , url=

  17. [17]

    The Thirteenth International Conference on Learning Representations , year=

    Generative Adapter: Contextualizing Language Models in Parameters with A Single Forward Pass , author=. The Thirteenth International Conference on Learning Representations , year=

  18. [18]

    International Conference on Learning Representations , year=

    Continual learning with hypernetworks , author=. International Conference on Learning Representations , year=

  19. [19]

    Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

    Karimi Mahabadi, Rabeeh and Ruder, Sebastian and Dehghani, Mostafa and Henderson, James. Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Pap...

  20. [20]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  21. [21]

    2014 , eprint=

    Neural Turing Machines , author=. 2014 , eprint=

  22. [22]

    2016 , eprint=

    Neural Programmer-Interpreters , author=. 2016 , eprint=

  23. [23]

    2021 , eprint=

    Thinking Like Transformers , author=. 2021 , eprint=

  24. [24]

    A neural compiler , journal =

    Frédéric Gruau and Jean-Yves Ratajszczak and Gilles Wiber , abstract =. A neural compiler , journal =. 1995 , issn =. doi:https://doi.org/10.1016/0304-3975(94)00200-3 , url =

  25. [25]

    2025 , eprint=

    Small Language Models are the Future of Agentic AI , author=. 2025 , eprint=

  26. [26]

    AI Commun

    Rubio Manzano, Clemente , title =. AI Commun. , month = oct, pages =. 2012 , issue_date =

  27. [27]

    The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators , url =

    Huang, Tzu-Heng and Cao, Catherine and Bhargava, Vaishnavi and Sala, Frederic , booktitle =. The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators , url =. doi:10.52202/079017-2003 , editor =

  28. [28]

    , title =

    Deng, Yuntian and Kanervisto, Anssi and Ling, Jeffrey and Rush, Alexander M. , title =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =. 2017 , publisher =

  29. [29]

    and Hajishirzi, Hannaneh and Girshick, Ross and Farhadi, Ali and Kembhavi, Aniruddha , title =

    Deitke, Matt and Clark, Christopher and Lee, Sangho and Tripathi, Rohun and Yang, Yue and Park, Jae Sung and Salehi, Mohammadreza and Muennighoff, Niklas and Lo, Kyle and Soldaini, Luca and Lu, Jiasen and Anderson, Taira and Bransom, Erin and Ehsani, Kiana and Ngo, Huong and Chen, YenSung and Patel, Ajay and Yatskar, Mark and Callison-Burch, Chris and Hea...

  30. [30]

    2025 , eprint=

    MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks , author=. 2025 , eprint=

  31. [31]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  32. [32]

    2025 , eprint=

    AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model , author=. 2025 , eprint=

  33. [33]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches , author=. 2025 , eprint=

  35. [35]

    The Eleventh International Conference on Learning Representations , year=

    Large Language Models are Human-Level Prompt Engineers , author=. The Eleventh International Conference on Learning Representations , year=

  36. [36]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  37. [37]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  38. [38]

    Williams

    Williams, Ronald J. , title =. Mach. Learn. , month = may, pages =. 1992 , issue_date =. doi:10.1007/BF00992696 , abstract =

  39. [39]

    Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

    Sutton, Richard S and McAllester, David and Singh, Satinder and Mansour, Yishay , booktitle =. Policy Gradient Methods for Reinforcement Learning with Function Approximation , url =

  40. [40]

    2025 , eprint=

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild , author=. 2025 , eprint=

  41. [41]

    2021 , url=

    Jieyu Zhang and Yue Yu and Yinghao Li and Yujing Wang and Yaming Yang and Mao Yang and Alexander Ratner , booktitle=. 2021 , url=

  42. [42]

    Scaling text-rich image understanding via code-guided synthetic multimodal data generation.arXiv preprint arXiv:2502.14846, 2025

    Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation , author=. arXiv preprint arXiv:2502.14846 , year=

  43. [43]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Molmo and PixMo: Open weights and open data for state-of-the-art vision-language models , author=. arXiv preprint arXiv:2409.17146 , year=

  44. [44]

    2025 , eprint=

    Olmo 3 , author=. 2025 , eprint=

  45. [45]

    Python 3.14.2 , howpublished =

  46. [46]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Image-to-Markup Generation with Coarse-to-Fine Attention , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  47. [47]

    The Eleventh International Conference on Learning Representations , year=

    Markup-to-Image Diffusion Models with Scheduled Sampling , author=. The Eleventh International Conference on Learning Representations , year=

  48. [48]

    2024 , journal =

    HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

  49. [49]

    2025 , eprint=

    HyperSteer: Activation Steering at Scale with Hypernetworks , author=. 2025 , eprint=

  50. [50]

    The Eleventh International Conference on Learning Representations (ICLR) , year =

    Binding Language Models in Symbolic Languages , author =. The Eleventh International Conference on Learning Representations (ICLR) , year =

  51. [51]

    Learning to Generate Task-Specific Adapters from Task Description

    Ye, Qinyuan and Ren, Xiang. Learning to Generate Task-Specific Adapters from Task Description. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2021. doi:10.18653/v1/2021.acl-short.82

  52. [52]

    HINT : Hypernetwork Instruction Tuning for Efficient Zero- and Few-Shot Generalisation

    Ivison, Hamish and Bhagia, Akshita and Wang, Yizhong and Hajishirzi, Hannaneh and Peters, Matthew. HINT : Hypernetwork Instruction Tuning for Efficient Zero- and Few-Shot Generalisation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.631

  53. [53]

    2023 , volume =

    Phang, Jason and Mao, Yi and He, Pengcheng and Chen, Weizhu , booktitle =. 2023 , volume =

  54. [54]

    Advances in Neural Information Processing Systems , year =

    Learning to Compress Prompts with Gist Tokens , author =. Advances in Neural Information Processing Systems , year =

  55. [55]

    2024 , url =

    Li, Yichuan and Ma, Xiyao and Lu, Sixing and Lee, Kyumin and Liu, Xiaohu and Guo, Chenlei , booktitle =. 2024 , url =

  56. [56]

    2019 , publisher =

    Language Models are Unsupervised Multitask Learners , author =. 2019 , publisher =

  57. [57]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI , year =. 2508.10925 , archivePrefix =

  58. [58]

    2023 , url =

    Gerganov, Georgi and. 2023 , url =

  59. [59]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  60. [60]

    Singh, Amanpreet and Natarajan, Vivek and Shah, Meet and Jiang, Yu and Chen, Xinlei and Batra, Dhruv and Parikh, Devi and Rohrbach, Marcus , booktitle =. Towards

  61. [61]

    and Le, Quoc V

    Ha, David and Dai, Andrew M. and Le, Quoc V. , booktitle =. 2017 , url =

  62. [62]

    Parameter-Efficient Transfer Learning for

    Houlsby, Neil and Giurgiu, Andrei and Jastrzebski, Stanislaw and Morrone, Bruna and de Laroussilhe, Quentin and Gesmundo, Andrea and Attariyan, Mona and Gelly, Sylvain , booktitle =. Parameter-Efficient Transfer Learning for. 2019 , url =

  63. [63]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

    The Power of Scale for Parameter-Efficient Prompt Tuning , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

  64. [64]

    2022 , url =

    Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng Lam and Du, Zhengxiao and Yang, Zhilin and Tang, Jie , booktitle =. 2022 , url =

  65. [65]

    Advances in Neural Information Processing Systems , year =

    Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , author =. Advances in Neural Information Processing Systems , year =

  66. [66]

    Advances in Neural Information Processing Systems , year =

    Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , author =. Advances in Neural Information Processing Systems , year =

  67. [67]

    Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =

    Pfeiffer, Jonas and Kamath, Aishwarya and R. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =

  68. [68]

    2024 , url =

    Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung , booktitle =. 2024 , url =

  69. [69]

    2024 , url =

    Huang, Chengsong and Liu, Qian and Lin, Bill Yuchen and Pang, Tianyu and Du, Chao and Lin, Min , booktitle =. 2024 , url =

  70. [70]

    2023 , url =

    Zhang, Qingru and Chen, Minshuo and Bukharin, Alexander and Karampatziakis, Nikos and He, Pengcheng and Cheng, Yu and Chen, Weizhu and Zhao, Tuo , booktitle =. 2023 , url =

  71. [71]

    2023 , url =

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =

  72. [72]

    2023 , url =

    Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan , booktitle =. 2023 , url =

  73. [73]

    2024 , url =

    Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song , booktitle =. 2024 , url =

  74. [74]

    2602.15902 , archivePrefix =

    Charakorn, Rujikorn and Cetin, Edoardo and Uesaka, Shinnosuke and Lange, Robert Tjarko , year =. 2602.15902 , archivePrefix =

  75. [75]

    2026 , eprint =

    Latent Context Compilation: Distilling Long Context into Compact Portable Memory , author =. 2026 , eprint =

  76. [76]

    2026 , eprint =

    Trojan, Bartosz and G. 2026 , eprint =

  77. [77]

    SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

    Liu, Yewei and Wang, Xiyuan and Mao, Yansheng and Gelberg, Yoav and Maron, Haggai and Zhang, Muhan , year =. 2602.06358 , archivePrefix =

  78. [78]

    The Tenth International Conference on Learning Representations (ICLR) , year =

    Finetuned Language Models are Zero-Shot Learners , author =. The Tenth International Conference on Learning Representations (ICLR) , year =

  79. [79]

    The Tenth International Conference on Learning Representations (ICLR) , year =

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author =. The Tenth International Conference on Learning Representations (ICLR) , year =

  80. [80]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) , pages =. 2022 , url =

Showing first 80 references.