LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation

Ajay Mohindra; Boxun Xu; Peng Li; Shengke Zhou; Yunjie Zhang; Yuxuan Yin

arxiv: 2512.19701 · v2 · pith:KE5SEBCCnew · submitted 2025-12-08 · 💻 cs.LG · cs.AI

LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation

Yuxuan Yin , Shengke Zhou , Yunjie Zhang , Ajay Mohindra , Boxun Xu , Peng Li This is my paper

Pith reviewed 2026-05-21 17:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM regressionresource estimationruntime predictionsemi-structured dataworkflow schedulingcloud computingGitHub Actions

0 comments

The pith

Fine-tuning language models on text-serialized workflow configurations produces accurate multi-target predictions of resource use and runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to predict compute needs and execution time for cloud jobs whose settings include shell commands, tool parameters, and dependency graphs. Standard machine-learning methods must flatten these details into tables, which discards relationships the authors say matter. LASER instead writes the entire job description as text, trains a language model to output the numbers, and adds two practical fixes: scientific notation so the model can handle values that differ by many orders of magnitude, and constrained decoding that forces every generated answer to be valid while cutting inference time by more than 30 percent. On chip-design workloads and a new benchmark built from more than half a million GitHub Actions runs, the resulting models beat both human experts and the best tabular baselines, and accuracy improves as models and data grow larger.

Core claim

LASER fine-tunes large language models on serialized representations of semi-structured workflow job configurations to perform multi-target regression for resource consumption and runtime. Scientific notation output encoding handles targets across many orders of magnitude, while constrained decoding with prefix filling guarantees output validity and reduces latency by over 30 percent. Full-attention fine-tuning outperforms sliding-window approaches on long contexts. The method is validated on large-scale chip design workloads and the GHARuntime benchmark derived from 580,000+ GitHub Actions runs, where it surpasses human experts and state-of-the-art tabular ML baselines and exhibits clear,

What carries the argument

The LASER framework that converts semi-structured workflow configurations into plain text, fine-tunes an LLM for numerical regression, and applies scientific-notation encoding together with constrained decoding to produce valid multi-target outputs.

If this is right

Cloud schedulers can allocate resources more tightly because the predicted values are closer to actual consumption.
Engineers no longer need to hand-craft features that flatten commands and graphs into fixed vectors.
Prediction quality continues to rise when larger models or more historical runs are used.
The same text-based regression approach can be applied to other workflow systems that store jobs as hierarchical or command-rich objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Serialization plus constrained decoding may become a standard pattern for any regression task whose inputs are naturally expressed as text or trees.
Public release of the GHARuntime benchmark could let other groups test whether the same gains appear outside chip design and GitHub Actions.
Full-attention fine-tuning on long job histories might also improve models that forecast queue wait times or failure probabilities.

Load-bearing premise

Writing the job configuration as text keeps enough of its original meaning and structure for the language model to learn accurate numerical predictions.

What would settle it

Measure prediction error on a held-out set of workflows after deliberately removing or randomizing dependency-graph information during the serialization step; a large accuracy drop would indicate the claim does not hold.

read the original abstract

Accurate prediction of resource consumption and runtime for cloud workflow jobs is critical for scheduling efficiency, yet remains challenging due to the semi-structured nature of job configurations -- comprising shell commands, tool-specific parameters, dependency graphs, and hierarchical metadata. Traditional ML approaches require brittle feature engineering to flatten this rich information into fixed-size vectors, losing critical semantic context. We present LASER, a framework that fine-tunes LLMs on serialized workflow job configurations for multi-target resource and runtime regression. To address the challenges of numerical regression via generation, we introduce scientific notation output encoding for targets spanning multiple orders of magnitude, and constrained decoding with prefix filling to enforce output validity while reducing inference latency by over 30%. We further show that full-attention fine-tuning improves accuracy over sliding-window LLMs on long job contexts. Validated on large-scale chip design workloads, and GHARuntime, a new public benchmark derived from 580,000+ GitHub Actions runs across 27,000+ repositories, LASER outperforms human experts and SOTA tabular ML baselines, with clear model- and data-scaling behavior, establishing a new paradigm for LLM-based regression on semi-structured workflow data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LASER fine-tunes LLMs for multi-target regression on serialized workflow data and releases a useful new benchmark, but the claimed edge over tabular baselines rests on thin controls for what the text format actually preserves.

read the letter

The main takeaway is that this work shows LLMs can be adapted for numerical prediction on semi-structured job descriptions without heavy feature engineering, using scientific notation for wide-ranging targets and constrained decoding to keep outputs valid and cut latency. They also built GHARuntime from over half a million GitHub Actions runs, which looks like a solid public resource for this kind of task. On chip design workloads the method reportedly beats both human experts and standard tabular models, with some scaling trends visible as models and data grow. That combination of a new benchmark and a practical regression setup is the clearest contribution here. The serialization approach is straightforward and avoids the usual flattening step, which is a reasonable starting point for preserving command and dependency context. Full attention during fine-tuning on longer contexts is a sensible choice and seems to help. The soft spot is the lack of direct evidence that the text format itself is what drives the gains. The stress-test point holds up: without an ablation that feeds the tabular baselines parsed graph or tree features instead of flat vectors, it is difficult to separate the effect of the LLM from model size or benchmark quirks. The abstract claims outperformance but the full paper would need clear error bars, data-split details, and statistical tests to make the central comparison convincing. Minor issues include the usual questions around how much the constrained decoding influences results versus the base model. This paper is aimed at researchers and engineers working on cloud scheduling, CI systems, and large-scale workflow optimization. Anyone already experimenting with LLMs on structured or semi-structured inputs will get something out of the output encoding and decoding tricks, and the new benchmark is worth checking regardless of the modeling claims. I would send it for peer review. The benchmark and the concrete application are enough to justify referee time, even if the comparisons need tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LASER, a framework that fine-tunes LLMs on text-serialized representations of semi-structured workflow configurations (shell commands, dependency graphs, hierarchical metadata) for multi-target regression of resource consumption and runtime. It proposes scientific notation output encoding and constrained decoding with prefix filling to enable valid numerical generation while cutting inference latency by over 30%. Full-attention fine-tuning is shown to outperform sliding-window approaches on long contexts. Evaluation on proprietary large-scale chip design workloads and the new public GHARuntime benchmark (derived from 580k+ GitHub Actions runs across 27k repositories) claims outperformance versus human experts and SOTA tabular ML baselines, together with model- and data-scaling trends.

Significance. If the quantitative claims are substantiated, the work offers a practical alternative to brittle feature engineering for semi-structured workflow data, with direct relevance to cloud scheduling and CI/CD optimization. The release of GHARuntime constitutes a reusable community resource. The constrained-decoding technique provides a measurable engineering contribution. Observed scaling behavior, if robust, would support further investment in LLM-based regression for this domain.

major comments (2)

[§5.2, Table 3] §5.2 and Table 3: the central claim that serialization preserves semantic context sufficient for superior regression is load-bearing, yet the manuscript provides no ablation in which tabular baselines receive explicitly parsed graph or tree features (e.g., adjacency matrices or hierarchical encodings) derived from the same dependency metadata. Without this control, performance differences cannot be confidently attributed to the serialization strategy rather than model capacity or benchmark construction.
[§4.1, §6] §4.1 and §6: the reported outperformance over human experts and tabular baselines lacks error bars, standard deviations across random seeds or data splits, and statistical significance tests. Given that the abstract asserts clear superiority on both chip-design and GHARuntime workloads, these omissions weaken the evidential basis for the scaling and superiority conclusions.

minor comments (2)

[§3.3] §3.3: the description of prefix filling in constrained decoding would benefit from a short pseudocode listing or explicit token-level example to clarify how validity is enforced without introducing bias.
[Figure 4] Figure 4: the scaling curves would be easier to interpret if the x-axis were labeled with exact model sizes or parameter counts rather than generic indices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below, providing clarifications and committing to revisions where appropriate to strengthen the evidential basis of our claims.

read point-by-point responses

Referee: [§5.2, Table 3] §5.2 and Table 3: the central claim that serialization preserves semantic context sufficient for superior regression is load-bearing, yet the manuscript provides no ablation in which tabular baselines receive explicitly parsed graph or tree features (e.g., adjacency matrices or hierarchical encodings) derived from the same dependency metadata. Without this control, performance differences cannot be confidently attributed to the serialization strategy rather than model capacity or benchmark construction.

Authors: We agree that an explicit ablation study comparing tabular baselines augmented with parsed graph and tree features would better isolate the benefits of our serialization approach. In the revised version, we will conduct and report such experiments on both the chip-design and GHARuntime datasets, incorporating features like adjacency matrices and hierarchical encodings into the tabular models for a fairer comparison. revision: yes
Referee: [§4.1, §6] §4.1 and §6: the reported outperformance over human experts and tabular baselines lacks error bars, standard deviations across random seeds or data splits, and statistical significance tests. Given that the abstract asserts clear superiority on both chip-design and GHARuntime workloads, these omissions weaken the evidential basis for the scaling and superiority conclusions.

Authors: We acknowledge the importance of reporting variability and statistical rigor. In the revision, we will include error bars representing standard deviations across multiple random seeds and data splits in Tables 3 and relevant figures. Additionally, we will perform and report statistical significance tests, such as paired t-tests, to substantiate the outperformance claims over baselines and experts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks and baselines

full rationale

The paper presents LASER as an empirical framework for fine-tuning LLMs on serialized semi-structured workflow data, with supporting techniques for numerical regression output. Its strongest claims involve outperformance on large-scale chip design workloads and the newly introduced GHARuntime benchmark (derived from 580k+ external GitHub Actions runs), plus comparisons to human experts and SOTA tabular ML baselines. No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided abstract or reader summary. The serialization choice and output encodings are methodological proposals evaluated against independent data, not tautological constructions. This leaves the derivation chain self-contained against external validation points.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of supervised fine-tuning and the premise that text serialization retains predictive signal; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (1)

domain assumption LLM fine-tuning on serialized text can capture semantic context from workflow configurations
Central premise stated in the abstract when contrasting with brittle feature engineering

pith-pipeline@v0.9.0 · 5749 in / 1264 out tokens · 41220 ms · 2026-05-21T17:23:21.620395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define a serialization function ψ(·) that transforms the structured job configuration X into a sequence of input tokens x_job = ψ(X). ... P_θ(y | x_job) = ∏ P_θ(y_t | x_job, y_<t)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce scientific notation representation ... dedicated tokens for the sign, mantissa, and exponent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Akhauri, X

Y. Akhauri, X. Song, A. Wongpanich, B. Lewandowski, and M. S. Abdelfattah. Regression language models for code. arXiv preprint arXiv:2509.26476, 2025

work page internal anchor Pith review arXiv 2025
[2]

Bavikadi, A

S. Bavikadi, A. Dhavlle, A. Ganguly, A. Haridass, H. Hendy, C. Merkel, V. J. Reddi, P. R. Sutradhar, A. Joseph, and S. M. Pudukotai Dinakarrao. A survey on machine learning accelerators and evolutionary hardware platforms. IEEE Design & Test, 39 0 (3): 0 91--116, 2022. doi:10.1109/MDAT.2022.3161126

work page doi:10.1109/mdat.2022.3161126 2022
[3]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

work page 2022
[4]

T.-W. Huang. Machine learning system-enabled gpu acceleration for eda. In 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pages 1--1, 2021. doi:10.1109/VLSI-DAT52063.2021.9427323

work page doi:10.1109/vlsi-dat52063.2021.9427323 2021
[5]

T. Kipf. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

M. Liu, L. Pan, and S. Liu. Cost optimization for cloud storage from user perspectives: Recent advances, taxonomy, and survey. ACM Comput. Surv., 55 0 (13s), July 2023. ISSN 0360-0300. doi:10.1145/3582883. URL https://doi.org/10.1145/3582883

work page doi:10.1145/3582883 2023
[7]

Song and D

X. Song and D. Bahri. Decoding-based regression. Trans. Mach. Learn. Res., 2025. URL https://openreview.net/forum?id=avUQ8jguxg

work page 2025
[8]

X. Song, O. Li, C. Lee, B. Yang, D. Peng, S. Perel, and Y. Chen. Omnipred: Language models as universal regressors. Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum?id=t9c3pfrR1X

work page 2024
[9]

L. Stok. The next 25 years in eda: A cloudy future? IEEE Design & Test of Computers, 31 0 (02): 0 40--46, 2014

work page 2014
[10]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram \'e , M. Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

L. Zhu, X. Ma, S. Hao, Y. Pan, and X. Guo. Elastic eda: Auto-scaling cloud resources for eda tasks via learning-based approaches. In 2024 IEEE 42nd International Conference on Computer Design (ICCD), pages 144--153, 2024. doi:10.1109/ICCD63220.2024.00031

work page doi:10.1109/iccd63220.2024.00031 2024

[1] [1]

Akhauri, X

Y. Akhauri, X. Song, A. Wongpanich, B. Lewandowski, and M. S. Abdelfattah. Regression language models for code. arXiv preprint arXiv:2509.26476, 2025

work page internal anchor Pith review arXiv 2025

[2] [2]

Bavikadi, A

S. Bavikadi, A. Dhavlle, A. Ganguly, A. Haridass, H. Hendy, C. Merkel, V. J. Reddi, P. R. Sutradhar, A. Joseph, and S. M. Pudukotai Dinakarrao. A survey on machine learning accelerators and evolutionary hardware platforms. IEEE Design & Test, 39 0 (3): 0 91--116, 2022. doi:10.1109/MDAT.2022.3161126

work page doi:10.1109/mdat.2022.3161126 2022

[3] [3]

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022

work page 2022

[4] [4]

T.-W. Huang. Machine learning system-enabled gpu acceleration for eda. In 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pages 1--1, 2021. doi:10.1109/VLSI-DAT52063.2021.9427323

work page doi:10.1109/vlsi-dat52063.2021.9427323 2021

[5] [5]

T. Kipf. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

M. Liu, L. Pan, and S. Liu. Cost optimization for cloud storage from user perspectives: Recent advances, taxonomy, and survey. ACM Comput. Surv., 55 0 (13s), July 2023. ISSN 0360-0300. doi:10.1145/3582883. URL https://doi.org/10.1145/3582883

work page doi:10.1145/3582883 2023

[7] [7]

Song and D

X. Song and D. Bahri. Decoding-based regression. Trans. Mach. Learn. Res., 2025. URL https://openreview.net/forum?id=avUQ8jguxg

work page 2025

[8] [8]

X. Song, O. Li, C. Lee, B. Yang, D. Peng, S. Perel, and Y. Chen. Omnipred: Language models as universal regressors. Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum?id=t9c3pfrR1X

work page 2024

[9] [9]

L. Stok. The next 25 years in eda: A cloudy future? IEEE Design & Test of Computers, 31 0 (02): 0 40--46, 2014

work page 2014

[10] [10]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram \'e , M. Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

L. Zhu, X. Ma, S. Hao, Y. Pan, and X. Guo. Elastic eda: Auto-scaling cloud resources for eda tasks via learning-based approaches. In 2024 IEEE 42nd International Conference on Computer Design (ICCD), pages 144--153, 2024. doi:10.1109/ICCD63220.2024.00031

work page doi:10.1109/iccd63220.2024.00031 2024