LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation
Pith reviewed 2026-05-21 17:23 UTC · model grok-4.3
The pith
Fine-tuning language models on text-serialized workflow configurations produces accurate multi-target predictions of resource use and runtime.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LASER fine-tunes large language models on serialized representations of semi-structured workflow job configurations to perform multi-target regression for resource consumption and runtime. Scientific notation output encoding handles targets across many orders of magnitude, while constrained decoding with prefix filling guarantees output validity and reduces latency by over 30 percent. Full-attention fine-tuning outperforms sliding-window approaches on long contexts. The method is validated on large-scale chip design workloads and the GHARuntime benchmark derived from 580,000+ GitHub Actions runs, where it surpasses human experts and state-of-the-art tabular ML baselines and exhibits clear,
What carries the argument
The LASER framework that converts semi-structured workflow configurations into plain text, fine-tunes an LLM for numerical regression, and applies scientific-notation encoding together with constrained decoding to produce valid multi-target outputs.
If this is right
- Cloud schedulers can allocate resources more tightly because the predicted values are closer to actual consumption.
- Engineers no longer need to hand-craft features that flatten commands and graphs into fixed vectors.
- Prediction quality continues to rise when larger models or more historical runs are used.
- The same text-based regression approach can be applied to other workflow systems that store jobs as hierarchical or command-rich objects.
Where Pith is reading between the lines
- Serialization plus constrained decoding may become a standard pattern for any regression task whose inputs are naturally expressed as text or trees.
- Public release of the GHARuntime benchmark could let other groups test whether the same gains appear outside chip design and GitHub Actions.
- Full-attention fine-tuning on long job histories might also improve models that forecast queue wait times or failure probabilities.
Load-bearing premise
Writing the job configuration as text keeps enough of its original meaning and structure for the language model to learn accurate numerical predictions.
What would settle it
Measure prediction error on a held-out set of workflows after deliberately removing or randomizing dependency-graph information during the serialization step; a large accuracy drop would indicate the claim does not hold.
read the original abstract
Accurate prediction of resource consumption and runtime for cloud workflow jobs is critical for scheduling efficiency, yet remains challenging due to the semi-structured nature of job configurations -- comprising shell commands, tool-specific parameters, dependency graphs, and hierarchical metadata. Traditional ML approaches require brittle feature engineering to flatten this rich information into fixed-size vectors, losing critical semantic context. We present LASER, a framework that fine-tunes LLMs on serialized workflow job configurations for multi-target resource and runtime regression. To address the challenges of numerical regression via generation, we introduce scientific notation output encoding for targets spanning multiple orders of magnitude, and constrained decoding with prefix filling to enforce output validity while reducing inference latency by over 30%. We further show that full-attention fine-tuning improves accuracy over sliding-window LLMs on long job contexts. Validated on large-scale chip design workloads, and GHARuntime, a new public benchmark derived from 580,000+ GitHub Actions runs across 27,000+ repositories, LASER outperforms human experts and SOTA tabular ML baselines, with clear model- and data-scaling behavior, establishing a new paradigm for LLM-based regression on semi-structured workflow data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LASER, a framework that fine-tunes LLMs on text-serialized representations of semi-structured workflow configurations (shell commands, dependency graphs, hierarchical metadata) for multi-target regression of resource consumption and runtime. It proposes scientific notation output encoding and constrained decoding with prefix filling to enable valid numerical generation while cutting inference latency by over 30%. Full-attention fine-tuning is shown to outperform sliding-window approaches on long contexts. Evaluation on proprietary large-scale chip design workloads and the new public GHARuntime benchmark (derived from 580k+ GitHub Actions runs across 27k repositories) claims outperformance versus human experts and SOTA tabular ML baselines, together with model- and data-scaling trends.
Significance. If the quantitative claims are substantiated, the work offers a practical alternative to brittle feature engineering for semi-structured workflow data, with direct relevance to cloud scheduling and CI/CD optimization. The release of GHARuntime constitutes a reusable community resource. The constrained-decoding technique provides a measurable engineering contribution. Observed scaling behavior, if robust, would support further investment in LLM-based regression for this domain.
major comments (2)
- [§5.2, Table 3] §5.2 and Table 3: the central claim that serialization preserves semantic context sufficient for superior regression is load-bearing, yet the manuscript provides no ablation in which tabular baselines receive explicitly parsed graph or tree features (e.g., adjacency matrices or hierarchical encodings) derived from the same dependency metadata. Without this control, performance differences cannot be confidently attributed to the serialization strategy rather than model capacity or benchmark construction.
- [§4.1, §6] §4.1 and §6: the reported outperformance over human experts and tabular baselines lacks error bars, standard deviations across random seeds or data splits, and statistical significance tests. Given that the abstract asserts clear superiority on both chip-design and GHARuntime workloads, these omissions weaken the evidential basis for the scaling and superiority conclusions.
minor comments (2)
- [§3.3] §3.3: the description of prefix filling in constrained decoding would benefit from a short pseudocode listing or explicit token-level example to clarify how validity is enforced without introducing bias.
- [Figure 4] Figure 4: the scaling curves would be easier to interpret if the x-axis were labeled with exact model sizes or parameter counts rather than generic indices.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point-by-point below, providing clarifications and committing to revisions where appropriate to strengthen the evidential basis of our claims.
read point-by-point responses
-
Referee: [§5.2, Table 3] §5.2 and Table 3: the central claim that serialization preserves semantic context sufficient for superior regression is load-bearing, yet the manuscript provides no ablation in which tabular baselines receive explicitly parsed graph or tree features (e.g., adjacency matrices or hierarchical encodings) derived from the same dependency metadata. Without this control, performance differences cannot be confidently attributed to the serialization strategy rather than model capacity or benchmark construction.
Authors: We agree that an explicit ablation study comparing tabular baselines augmented with parsed graph and tree features would better isolate the benefits of our serialization approach. In the revised version, we will conduct and report such experiments on both the chip-design and GHARuntime datasets, incorporating features like adjacency matrices and hierarchical encodings into the tabular models for a fairer comparison. revision: yes
-
Referee: [§4.1, §6] §4.1 and §6: the reported outperformance over human experts and tabular baselines lacks error bars, standard deviations across random seeds or data splits, and statistical significance tests. Given that the abstract asserts clear superiority on both chip-design and GHARuntime workloads, these omissions weaken the evidential basis for the scaling and superiority conclusions.
Authors: We acknowledge the importance of reporting variability and statistical rigor. In the revision, we will include error bars representing standard deviations across multiple random seeds and data splits in Tables 3 and relevant figures. Additionally, we will perform and report statistical significance tests, such as paired t-tests, to substantiate the outperformance claims over baselines and experts. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks and baselines
full rationale
The paper presents LASER as an empirical framework for fine-tuning LLMs on serialized semi-structured workflow data, with supporting techniques for numerical regression output. Its strongest claims involve outperformance on large-scale chip design workloads and the newly introduced GHARuntime benchmark (derived from 580k+ external GitHub Actions runs), plus comparisons to human experts and SOTA tabular ML baselines. No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided abstract or reader summary. The serialization choice and output encodings are methodological proposals evaluated against independent data, not tautological constructions. This leaves the derivation chain self-contained against external validation points.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM fine-tuning on serialized text can capture semantic context from workflow configurations
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define a serialization function ψ(·) that transforms the structured job configuration X into a sequence of input tokens x_job = ψ(X). ... P_θ(y | x_job) = ∏ P_θ(y_t | x_job, y_<t)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce scientific notation representation ... dedicated tokens for the sign, mantissa, and exponent
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Y. Akhauri, X. Song, A. Wongpanich, B. Lewandowski, and M. S. Abdelfattah. Regression language models for code. arXiv preprint arXiv:2509.26476, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
S. Bavikadi, A. Dhavlle, A. Ganguly, A. Haridass, H. Hendy, C. Merkel, V. J. Reddi, P. R. Sutradhar, A. Joseph, and S. M. Pudukotai Dinakarrao. A survey on machine learning accelerators and evolutionary hardware platforms. IEEE Design & Test, 39 0 (3): 0 91--116, 2022. doi:10.1109/MDAT.2022.3161126
-
[3]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1 0 (2): 0 3, 2022
work page 2022
-
[4]
T.-W. Huang. Machine learning system-enabled gpu acceleration for eda. In 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pages 1--1, 2021. doi:10.1109/VLSI-DAT52063.2021.9427323
-
[5]
T. Kipf. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
M. Liu, L. Pan, and S. Liu. Cost optimization for cloud storage from user perspectives: Recent advances, taxonomy, and survey. ACM Comput. Surv., 55 0 (13s), July 2023. ISSN 0360-0300. doi:10.1145/3582883. URL https://doi.org/10.1145/3582883
-
[7]
X. Song and D. Bahri. Decoding-based regression. Trans. Mach. Learn. Res., 2025. URL https://openreview.net/forum?id=avUQ8jguxg
work page 2025
-
[8]
X. Song, O. Li, C. Lee, B. Yang, D. Peng, S. Perel, and Y. Chen. Omnipred: Language models as universal regressors. Trans. Mach. Learn. Res., 2024, 2024. URL https://openreview.net/forum?id=t9c3pfrR1X
work page 2024
-
[9]
L. Stok. The next 25 years in eda: A cloudy future? IEEE Design & Test of Computers, 31 0 (02): 0 40--46, 2014
work page 2014
-
[10]
G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram \'e , M. Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
L. Zhu, X. Ma, S. Hao, Y. Pan, and X. Guo. Elastic eda: Auto-scaling cloud resources for eda tasks via learning-based approaches. In 2024 IEEE 42nd International Conference on Computer Design (ICCD), pages 144--153, 2024. doi:10.1109/ICCD63220.2024.00031
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.