pith. sign in

arxiv: 2512.19701 · v2 · pith:KE5SEBCCnew · submitted 2025-12-08 · 💻 cs.LG · cs.AI

LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation

Pith reviewed 2026-05-21 17:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM regressionresource estimationruntime predictionsemi-structured dataworkflow schedulingcloud computingGitHub Actions
0
0 comments X

The pith

Fine-tuning language models on text-serialized workflow configurations produces accurate multi-target predictions of resource use and runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to predict compute needs and execution time for cloud jobs whose settings include shell commands, tool parameters, and dependency graphs. Standard machine-learning methods must flatten these details into tables, which discards relationships the authors say matter. LASER instead writes the entire job description as text, trains a language model to output the numbers, and adds two practical fixes: scientific notation so the model can handle values that differ by many orders of magnitude, and constrained decoding that forces every generated answer to be valid while cutting inference time by more than 30 percent. On chip-design workloads and a new benchmark built from more than half a million GitHub Actions runs, the resulting models beat both human experts and the best tabular baselines, and accuracy improves as models and data grow larger.

Core claim

LASER fine-tunes large language models on serialized representations of semi-structured workflow job configurations to perform multi-target regression for resource consumption and runtime. Scientific notation output encoding handles targets across many orders of magnitude, while constrained decoding with prefix filling guarantees output validity and reduces latency by over 30 percent. Full-attention fine-tuning outperforms sliding-window approaches on long contexts. The method is validated on large-scale chip design workloads and the GHARuntime benchmark derived from 580,000+ GitHub Actions runs, where it surpasses human experts and state-of-the-art tabular ML baselines and exhibits clear,

What carries the argument

The LASER framework that converts semi-structured workflow configurations into plain text, fine-tunes an LLM for numerical regression, and applies scientific-notation encoding together with constrained decoding to produce valid multi-target outputs.

If this is right

  • Cloud schedulers can allocate resources more tightly because the predicted values are closer to actual consumption.
  • Engineers no longer need to hand-craft features that flatten commands and graphs into fixed vectors.
  • Prediction quality continues to rise when larger models or more historical runs are used.
  • The same text-based regression approach can be applied to other workflow systems that store jobs as hierarchical or command-rich objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Serialization plus constrained decoding may become a standard pattern for any regression task whose inputs are naturally expressed as text or trees.
  • Public release of the GHARuntime benchmark could let other groups test whether the same gains appear outside chip design and GitHub Actions.
  • Full-attention fine-tuning on long job histories might also improve models that forecast queue wait times or failure probabilities.

Load-bearing premise

Writing the job configuration as text keeps enough of its original meaning and structure for the language model to learn accurate numerical predictions.

What would settle it

Measure prediction error on a held-out set of workflows after deliberately removing or randomizing dependency-graph information during the serialization step; a large accuracy drop would indicate the claim does not hold.

read the original abstract

Accurate prediction of resource consumption and runtime for cloud workflow jobs is critical for scheduling efficiency, yet remains challenging due to the semi-structured nature of job configurations -- comprising shell commands, tool-specific parameters, dependency graphs, and hierarchical metadata. Traditional ML approaches require brittle feature engineering to flatten this rich information into fixed-size vectors, losing critical semantic context. We present LASER, a framework that fine-tunes LLMs on serialized workflow job configurations for multi-target resource and runtime regression. To address the challenges of numerical regression via generation, we introduce scientific notation output encoding for targets spanning multiple orders of magnitude, and constrained decoding with prefix filling to enforce output validity while reducing inference latency by over 30%. We further show that full-attention fine-tuning improves accuracy over sliding-window LLMs on long job contexts. Validated on large-scale chip design workloads, and GHARuntime, a new public benchmark derived from 580,000+ GitHub Actions runs across 27,000+ repositories, LASER outperforms human experts and SOTA tabular ML baselines, with clear model- and data-scaling behavior, establishing a new paradigm for LLM-based regression on semi-structured workflow data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LASER, a framework that fine-tunes LLMs on text-serialized representations of semi-structured workflow configurations (shell commands, dependency graphs, hierarchical metadata) for multi-target regression of resource consumption and runtime. It proposes scientific notation output encoding and constrained decoding with prefix filling to enable valid numerical generation while cutting inference latency by over 30%. Full-attention fine-tuning is shown to outperform sliding-window approaches on long contexts. Evaluation on proprietary large-scale chip design workloads and the new public GHARuntime benchmark (derived from 580k+ GitHub Actions runs across 27k repositories) claims outperformance versus human experts and SOTA tabular ML baselines, together with model- and data-scaling trends.

Significance. If the quantitative claims are substantiated, the work offers a practical alternative to brittle feature engineering for semi-structured workflow data, with direct relevance to cloud scheduling and CI/CD optimization. The release of GHARuntime constitutes a reusable community resource. The constrained-decoding technique provides a measurable engineering contribution. Observed scaling behavior, if robust, would support further investment in LLM-based regression for this domain.

major comments (2)
  1. [§5.2, Table 3] §5.2 and Table 3: the central claim that serialization preserves semantic context sufficient for superior regression is load-bearing, yet the manuscript provides no ablation in which tabular baselines receive explicitly parsed graph or tree features (e.g., adjacency matrices or hierarchical encodings) derived from the same dependency metadata. Without this control, performance differences cannot be confidently attributed to the serialization strategy rather than model capacity or benchmark construction.
  2. [§4.1, §6] §4.1 and §6: the reported outperformance over human experts and tabular baselines lacks error bars, standard deviations across random seeds or data splits, and statistical significance tests. Given that the abstract asserts clear superiority on both chip-design and GHARuntime workloads, these omissions weaken the evidential basis for the scaling and superiority conclusions.
minor comments (2)
  1. [§3.3] §3.3: the description of prefix filling in constrained decoding would benefit from a short pseudocode listing or explicit token-level example to clarify how validity is enforced without introducing bias.
  2. [Figure 4] Figure 4: the scaling curves would be easier to interpret if the x-axis were labeled with exact model sizes or parameter counts rather than generic indices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below, providing clarifications and committing to revisions where appropriate to strengthen the evidential basis of our claims.

read point-by-point responses
  1. Referee: [§5.2, Table 3] §5.2 and Table 3: the central claim that serialization preserves semantic context sufficient for superior regression is load-bearing, yet the manuscript provides no ablation in which tabular baselines receive explicitly parsed graph or tree features (e.g., adjacency matrices or hierarchical encodings) derived from the same dependency metadata. Without this control, performance differences cannot be confidently attributed to the serialization strategy rather than model capacity or benchmark construction.

    Authors: We agree that an explicit ablation study comparing tabular baselines augmented with parsed graph and tree features would better isolate the benefits of our serialization approach. In the revised version, we will conduct and report such experiments on both the chip-design and GHARuntime datasets, incorporating features like adjacency matrices and hierarchical encodings into the tabular models for a fairer comparison. revision: yes

  2. Referee: [§4.1, §6] §4.1 and §6: the reported outperformance over human experts and tabular baselines lacks error bars, standard deviations across random seeds or data splits, and statistical significance tests. Given that the abstract asserts clear superiority on both chip-design and GHARuntime workloads, these omissions weaken the evidential basis for the scaling and superiority conclusions.

    Authors: We acknowledge the importance of reporting variability and statistical rigor. In the revision, we will include error bars representing standard deviations across multiple random seeds and data splits in Tables 3 and relevant figures. Additionally, we will perform and report statistical significance tests, such as paired t-tests, to substantiate the outperformance claims over baselines and experts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks and baselines

full rationale

The paper presents LASER as an empirical framework for fine-tuning LLMs on serialized semi-structured workflow data, with supporting techniques for numerical regression output. Its strongest claims involve outperformance on large-scale chip design workloads and the newly introduced GHARuntime benchmark (derived from 580k+ external GitHub Actions runs), plus comparisons to human experts and SOTA tabular ML baselines. No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided abstract or reader summary. The serialization choice and output encodings are methodological proposals evaluated against independent data, not tautological constructions. This leaves the derivation chain self-contained against external validation points.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of supervised fine-tuning and the premise that text serialization retains predictive signal; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (1)
  • domain assumption LLM fine-tuning on serialized text can capture semantic context from workflow configurations
    Central premise stated in the abstract when contrasting with brittle feature engineering

pith-pipeline@v0.9.0 · 5749 in / 1264 out tokens · 41220 ms · 2026-05-21T17:23:21.620395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Akhauri, X

    Y. Akhauri, X. Song, A. Wongpanich, B. Lewandowski, and M. S. Abdelfattah. Regression language models for code. arXiv preprint arXiv:2509.26476, 2025

  2. [2]

    Bavikadi, A

    S. Bavikadi, A. Dhavlle, A. Ganguly, A. Haridass, H. Hendy, C. Merkel, V. J. Reddi, P. R. Sutradhar, A. Joseph, and S. M. Pudukotai Dinakarrao. A survey on machine learning accelerators and evolutionary hardware platforms. IEEE Design & Test, 39 0 (3): 0 91--116, 2022. doi:10.1109/MDAT.2022.3161126

  3. [3]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

  4. [4]

    T.-W. Huang. Machine learning system-enabled gpu acceleration for eda. In 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pages 1--1, 2021. doi:10.1109/VLSI-DAT52063.2021.9427323

  5. [5]

    T. Kipf. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016

  6. [6]

    M. Liu, L. Pan, and S. Liu. Cost optimization for cloud storage from user perspectives: Recent advances, taxonomy, and survey. ACM Comput. Surv., 55 0 (13s), July 2023. ISSN 0360-0300. doi:10.1145/3582883. URL https://doi.org/10.1145/3582883

  7. [7]

    Song and D

    X. Song and D. Bahri. Decoding-based regression. Trans. Mach. Learn. Res., 2025. URL https://openreview.net/forum?id=avUQ8jguxg

  8. [8]

    X. Song, O. Li, C. Lee, B. Yang, D. Peng, S. Perel, and Y. Chen. Omnipred: Language models as universal regressors. Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum?id=t9c3pfrR1X

  9. [9]

    L. Stok. The next 25 years in eda: A cloudy future? IEEE Design & Test of Computers, 31 0 (02): 0 40--46, 2014

  10. [10]

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram \'e , M. Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

  11. [11]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  12. [12]

    L. Zhu, X. Ma, S. Hao, Y. Pan, and X. Guo. Elastic eda: Auto-scaling cloud resources for eda tasks via learning-based approaches. In 2024 IEEE 42nd International Conference on Computer Design (ICCD), pages 144--153, 2024. doi:10.1109/ICCD63220.2024.00031