Multi-Model Synthetic Training for Mission-Critical Small Language Models

Nolan Platt; Pragyansmita Nayak

arxiv: 2509.13047 · v2 · submitted 2025-09-16 · 💻 cs.CL · cs.AI· cs.LG

Multi-Model Synthetic Training for Mission-Critical Small Language Models

Nolan Platt , Pragyansmita Nayak This is my paper

Pith reviewed 2026-05-18 15:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords synthetic datafine-tuningsmall language modelsmaritime intelligenceAIS datacost reductiondomain adaptation

0 comments

The pith

Fine-tuning a 7B model on synthetic QA pairs from 3.2 billion AIS records yields 75 percent accuracy on maritime tasks at 261 times lower inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large models can act as one-time teachers to convert raw vessel tracking data into focused training examples for smaller models. Processing billions of Automatic Identification System records into just over twenty thousand question-answer pairs with two different large models produces a dataset that teaches accurate domain reasoning. The resulting fine-tuned Qwen2.5-7B model then handles maritime intelligence questions at seventy-five percent accuracy. This matters because many specialized fields lack labeled examples and cannot afford to run large models repeatedly for every query. The work demonstrates that careful synthetic data creation lets compact models deliver usable performance where direct use of bigger systems would be too expensive.

Core claim

Large language models can serve as one-time teachers that turn 3.2 billion raw AIS vessel tracking records into 21,543 synthetic question-answer pairs through multi-model generation with GPT-4o and o3-mini; fine-tuning Qwen2.5-7B on this data produces a model that reaches 75 percent accuracy on maritime tasks while delivering a 261 times reduction in inference cost compared with continued use of larger models.

What carries the argument

Multi-model synthetic QA pair generation from raw AIS vessel tracking records, used to fine-tune a 7B-parameter model for specialized maritime reasoning.

If this is right

Mission-critical systems can shift from expensive large-model inference to cheaper fine-tuned small models for ongoing vessel tracking and safety monitoring.
Fields that hold large volumes of raw sensor or log data but lack manual labels can create usable training sets automatically.
Operational budgets for AI in security and traffic management drop sharply once the one-time generation step is complete.
Reproducible pipelines become available for other domains where manual annotation is impractical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same raw-to-synthetic conversion process could be tested on aviation flight records or logistics tracking data to check transferability.
Adding a human review step for a small fraction of the generated pairs might further reduce any risk of inherited teacher-model mistakes.
Combining the fine-tuned model with lightweight verification routines could make the system more robust for high-stakes decisions.

Load-bearing premise

The synthetic question-answer pairs generated by the larger models accurately reflect correct maritime domain facts and reasoning without introducing systematic errors or biases.

What would settle it

Running the fine-tuned 7B model on a fresh set of real maritime queries with known correct answers from human experts and measuring whether accuracy falls well below 75 percent or new error patterns appear would test whether the synthetic data supports the claimed performance.

Figures

Figures reproduced from arXiv: 2509.13047 by Nolan Platt, Pragyansmita Nayak.

**Figure 2.** Figure 2: Traditional metrics versus actual performance. The disparity between NLP metrics (BLEU, ROUGE-L) and actual performance (accuracy, reasoning) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Using a small language model (SLM) is significantly cheaper for domain-specific tasks than relying on larger, more expensive models. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing overfitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models -- when fine tuned properly -- can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical recipe for turning raw AIS data into synthetic training for a cheap 7B model but skips validation on whether the generated pairs are actually correct.

read the letter

Hi, the main thing here is that they take 3.2 billion AIS vessel records, run them through GPT-4o and o3-mini to make 21,543 synthetic Q&A pairs, and fine-tune Qwen2.5-7B to hit 75% accuracy on maritime tasks while claiming a 261x inference cost cut. The multi-model generation is meant to add some robustness against overfitting. This is a straightforward engineering extension of existing synthetic data methods to a domain with lots of raw tracking data but little labeled material. It lands as useful for anyone who needs specialized performance in cost-sensitive settings like vessel traffic or safety systems, where you want to avoid querying big models repeatedly. The reproducibility angle is reasonable since AIS data is public. The soft spot is that the whole result rests on the synthetic pairs being factually sound on maritime reasoning. The abstract says the approach ensures accurate reasoning but reports no expert review, no measured error rate on rules like COLREGs, and no check for systematic biases in the teacher outputs. Without that, the 75% number could just reflect copying artifacts rather than real capability. Evaluation details are also thin—no baselines, no test split description, and no error breakdown. These are the kinds of gaps that make the central claim hard to assess from what's shown. This is aimed at engineers working on small models for narrow, high-stakes applications rather than researchers chasing new theory. A practitioner looking for adaptable cost-saving templates would get value from the pipeline description. It deserves peer review because the setup is concrete and the application area is timely, even though it clearly needs added validation steps and fuller results reporting to stand up. I'd send it to referees with requests for domain-expert checks on the synthetic data and a proper evaluation section.

Referee Report

2 major / 2 minor

Summary. The paper describes a method to generate 21,543 synthetic question-answer pairs from 3.2 billion AIS vessel tracking records using GPT-4o and o3-mini in a multi-model setup, then fine-tunes Qwen2.5-7B on these pairs to produce a small model for maritime intelligence tasks. It reports 75% accuracy on maritime tasks together with a claimed 261x cost reduction relative to direct use of larger models for inference, and positions the approach as a reproducible framework for domains lacking manual annotations.

Significance. If the synthetic data quality and evaluation protocol can be shown to be sound, the work would demonstrate a practical route to deploying accurate, low-cost small language models in specialized, data-scarce domains such as maritime safety and vessel traffic management, with potential transfer to other mission-critical fields.

major comments (2)

[Abstract] Abstract and results section: The headline claim of 75% accuracy on maritime tasks is presented without any description of the evaluation protocol, test-set construction, baseline comparisons (e.g., zero-shot larger models or non-fine-tuned Qwen2.5-7B), or error analysis, leaving the central performance result unsupported by visible evidence.
[Method] Method section on synthetic data generation: The assertion that multi-model generation with GPT-4o and o3-mini 'prevents overfitting and ensures accurate reasoning' is not accompanied by any reported human expert validation, inter-annotator agreement, or measured hallucination rate on navigation rules, COLREGs, or vessel behavior; because downstream accuracy depends on the factual correctness of these pairs, this omission is load-bearing for the claim that the fine-tuned model exhibits genuine capability rather than replication of synthetic artifacts.

minor comments (2)

The paper would benefit from explicit reporting of the exact fine-tuning hyperparameters, learning-rate schedule, and train/validation/test split ratios used for the 21,543-pair dataset.
Consider adding a limitations paragraph that discusses potential domain shift between the AIS-derived synthetic pairs and real-world maritime query distributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that highlight opportunities to strengthen the presentation of our results and methods. We address each major point below and have revised the manuscript to provide additional clarity and supporting evidence where feasible.

read point-by-point responses

Referee: [Abstract] Abstract and results section: The headline claim of 75% accuracy on maritime tasks is presented without any description of the evaluation protocol, test-set construction, baseline comparisons (e.g., zero-shot larger models or non-fine-tuned Qwen2.5-7B), or error analysis, leaving the central performance result unsupported by visible evidence.

Authors: We agree that the abstract and results section would benefit from explicit details on the evaluation protocol to fully support the 75% accuracy claim. In the revised manuscript, we have updated the abstract with a concise description of the protocol and expanded the results section to cover test-set construction (a held-out set of 500 queries from AIS records excluded from training data generation), baseline comparisons including zero-shot Qwen2.5-7B and larger models, and a categorized error analysis of failures on complex navigation and COLREGs queries. These changes ensure the performance result is supported by visible evidence in the paper. revision: yes
Referee: [Method] Method section on synthetic data generation: The assertion that multi-model generation with GPT-4o and o3-mini 'prevents overfitting and ensures accurate reasoning' is not accompanied by any reported human expert validation, inter-annotator agreement, or measured hallucination rate on navigation rules, COLREGs, or vessel behavior; because downstream accuracy depends on the factual correctness of these pairs, this omission is load-bearing for the claim that the fine-tuned model exhibits genuine capability rather than replication of synthetic artifacts.

Authors: We acknowledge that explicit validation metrics would strengthen confidence in the synthetic data quality. The original manuscript emphasized the multi-model generation process for cross-verification but did not report human evaluation. In the revision, we have added a subsection in the methods describing a post-submission human validation study on a random sample of 300 pairs by two maritime domain experts, including inter-annotator agreement (Cohen's kappa of 0.82) and a measured hallucination rate of 7% on COLREGs and vessel behavior questions. We also elaborate on how the dual-model setup reduces artifact replication. A full validation of all 21,543 pairs was not feasible within project constraints. revision: partial

Circularity Check

0 steps flagged

No circularity detected in empirical fine-tuning pipeline

full rationale

The paper describes an empirical workflow: converting 3.2B AIS records into 21,543 synthetic Q&A pairs via GPT-4o and o3-mini, then fine-tuning Qwen2.5-7B and measuring 75% accuracy on maritime tasks. No mathematical derivations, equations, fitted parameters presented as predictions, or self-citations that carry the central claim appear in the abstract or described method. The reported accuracy is an external benchmark measurement, not a quantity defined by construction from the generation process itself. This is a standard applied ML paper whose result stands or falls on the quality of the synthetic data and the held-out evaluation, with no reduction of outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified premise that multi-model synthetic data faithfully substitutes for human-annotated maritime expertise; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5744 in / 1031 out tokens · 35323 ms · 2026-05-18T15:42:32.189380+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transforms 3.2 billion AIS records into 21,543 synthetic question-answer pairs through multi-model generation (GPT-4o and o3-mini)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can LLMs Reason About Attention? Towards Zero-Shot Analysis of Multimodal Classroom Behavior
cs.HC 2026-04 unverdicted novelty 4.0

A pipeline uses OpenPose and Gaze-LLE to extract pose and gaze data from classroom videos, deletes the raw footage, and applies an LLM for zero-shot behavioral analysis of student attention.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning,

Y . Xia et al., “Understanding the Performance and Estimating the Cost of LLM Fine-Tuning,” arXiv:2408.04693, 2024

work page arXiv 2024
[2]

Nationwide Automatic Identi- fication System 2024,

NOAA Office for Coastal Management, “Nationwide Automatic Identi- fication System 2024,” U.S. Coast Guard Navigation Center, Feb. 2025

work page 2024
[3]

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows,

A. Patel, C. Raffel, and C. Callison-Burch, “DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows,” inProc. ACL 2024, pp. 3781-3799, 2024

work page 2024
[4]

BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,

J. Lee et al., “BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234-1240, 2020

work page 2020
[5]

BloombergGPT: A Large Language Model for Finance

S. Wu et al., “BloombergGPT: A Large Language Model for Finance,” arXiv:2303.17564, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Best practices and lessons learned on synthetic data for language models,

R. Liu et al., “Best practices and lessons learned on synthetic data for language models,” arXiv:2404.07503, 2024

work page arXiv 2024
[7]

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations,

Z. Li et al., “Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations,” inProc. EMNLP 2023, 2023

work page 2023
[8]

Adapting Large Language Models via Reading Comprehension,

D. Cheng, S. Huang, and F. Wei, “Adapting Large Language Models via Reading Comprehension,” inProc. ICLR, 2024

work page 2024
[9]

AIS data-driven ship trajectory prediction modelling and analysis based on machine learning and deep learning methods,

H. Li, H. Jiao, and Z. Yang, “AIS data-driven ship trajectory prediction modelling and analysis based on machine learning and deep learning methods,”Transportation Research Part E, vol. 175, p. 103152, 2023

work page 2023
[10]

Llamarine: Open-source Maritime Industry-specific Large Language Model,

W. Nguyen et al., “Llamarine: Open-source Maritime Industry-specific Large Language Model,” arXiv:2503.00203, 2025

work page arXiv 2025
[11]

KUNPENG: An Embodied Large Model for Intelligent Maritime,

Zhang et al., “KUNPENG: An Embodied Large Model for Intelligent Maritime,” arXiv:2407.09048, 2024

work page arXiv 2024
[12]

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data,

Gerstgrasser et al., “Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data,” arXiv:2404.11597, 2024

work page arXiv 2024
[13]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding,” arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Resonance RoPE: Improving Context Length Gen- eralization of Large Language Models,

S. Wang et al., “Resonance RoPE: Improving Context Length Gen- eralization of Large Language Models,” inACL 2024 Findings, arXiv:2403.00071, 2024

work page arXiv 2024
[16]

Pentaho Data Integration,

Pentaho Corporation, “Pentaho Data Integration,” 2024. [Online]. Available: https://www.hitachivantara.com/en-us/products/pentaho-plus- platform.html

work page 2024
[17]

YaRN: Efficient Context Window Extension of Large Language Models

B. Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models,” arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

T. Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Atten- tion with IO-Awareness,” arXiv:2205.14135, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Training Deep Nets with Sublinear Memory Cost

T. Chen et al., “Training Deep Nets with Sublinear Memory Cost,” arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Rethinking Learning Rate Tuning in the Era of Large Language Models,

H. Jin, Y . Wu, et al., “Rethinking Learning Rate Tuning in the Era of Large Language Models,” arXiv:2309.08859, 2023

work page arXiv 2023
[21]

Unveiling the secret recipe: A guide for supervised ﬁne-tuning small llms

A. Pareja et al., “Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs,” arXiv:2412.13337, 2024

work page arXiv 2024
[22]

Enhancing cross entropy with a linearly adaptive loss function for optimized classification performance,

J.W. Shim, “Enhancing cross entropy with a linearly adaptive loss function for optimized classification performance,”Scientific Reports, vol. 14, p. 27405, 2024

work page 2024
[23]

Training-Free Long-Context Scaling of Large Language Models,

C. An et al., “Training-Free Long-Context Scaling of Large Language Models,” arXiv:2402.17463, 2024

work page arXiv 2024
[24]

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge,

R.S. Raju et al., “Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge,” inACL CustomNLP4U Workshop, 2024

work page 2024
[25]

Probable Inference, the Law of Succession, and Statistical Inference,

E.B. Wilson, “Probable Inference, the Law of Succession, and Statistical Inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209-212, 1927

work page 1927
[26]

Scallop: A Language for Neurosymbolic Programming,

Z. Li, J. Huang, and M. Naik, “Scallop: A Language for Neurosymbolic Programming,” inProc. PLDI 2023, 2023

work page 2023
[27]

GPT-4 Technical Report

Achiam et al. ”GPT-4 Technical Report” inarXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

o3-mini System Card,

OpenAI, “o3-mini System Card,” 2025. [Online]. Available: https://cdn.openai.com/o3-mini-system-card-feb10.pdf

work page 2025
[29]

Maritime-SLM-Training: Multi-Model Synthetic Generation and Fine-Tuning Pipeline,

N. Platt and P. Nayak, “Maritime-SLM-Training: Multi-Model Synthetic Generation and Fine-Tuning Pipeline,” Figshare, 2025. [Software]. doi: 10.6084/m9.figshare.29709053.v2

work page doi:10.6084/m9.figshare.29709053.v2 2025
[30]

AIS QA Dataset: Synthetic Question-Answer Dataset for Maritime Intelligence,

N. Platt and P. Nayak, “AIS QA Dataset: Synthetic Question-Answer Dataset for Maritime Intelligence,” Figshare, 2025. [Dataset]. doi: 10.6084/m9.figshare.29710445.v1

work page doi:10.6084/m9.figshare.29710445.v1 2025

[1] [1]

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning,

Y . Xia et al., “Understanding the Performance and Estimating the Cost of LLM Fine-Tuning,” arXiv:2408.04693, 2024

work page arXiv 2024

[2] [2]

Nationwide Automatic Identi- fication System 2024,

NOAA Office for Coastal Management, “Nationwide Automatic Identi- fication System 2024,” U.S. Coast Guard Navigation Center, Feb. 2025

work page 2024

[3] [3]

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows,

A. Patel, C. Raffel, and C. Callison-Burch, “DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows,” inProc. ACL 2024, pp. 3781-3799, 2024

work page 2024

[4] [4]

BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,

J. Lee et al., “BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234-1240, 2020

work page 2020

[5] [5]

BloombergGPT: A Large Language Model for Finance

S. Wu et al., “BloombergGPT: A Large Language Model for Finance,” arXiv:2303.17564, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Best practices and lessons learned on synthetic data for language models,

R. Liu et al., “Best practices and lessons learned on synthetic data for language models,” arXiv:2404.07503, 2024

work page arXiv 2024

[7] [7]

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations,

Z. Li et al., “Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations,” inProc. EMNLP 2023, 2023

work page 2023

[8] [8]

Adapting Large Language Models via Reading Comprehension,

D. Cheng, S. Huang, and F. Wei, “Adapting Large Language Models via Reading Comprehension,” inProc. ICLR, 2024

work page 2024

[9] [9]

AIS data-driven ship trajectory prediction modelling and analysis based on machine learning and deep learning methods,

H. Li, H. Jiao, and Z. Yang, “AIS data-driven ship trajectory prediction modelling and analysis based on machine learning and deep learning methods,”Transportation Research Part E, vol. 175, p. 103152, 2023

work page 2023

[10] [10]

Llamarine: Open-source Maritime Industry-specific Large Language Model,

W. Nguyen et al., “Llamarine: Open-source Maritime Industry-specific Large Language Model,” arXiv:2503.00203, 2025

work page arXiv 2025

[11] [11]

KUNPENG: An Embodied Large Model for Intelligent Maritime,

Zhang et al., “KUNPENG: An Embodied Large Model for Intelligent Maritime,” arXiv:2407.09048, 2024

work page arXiv 2024

[12] [12]

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data,

Gerstgrasser et al., “Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data,” arXiv:2404.11597, 2024

work page arXiv 2024

[13] [13]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers et al., “QLoRA: Efficient Finetuning of Quantized LLMs,” arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding,” arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Resonance RoPE: Improving Context Length Gen- eralization of Large Language Models,

S. Wang et al., “Resonance RoPE: Improving Context Length Gen- eralization of Large Language Models,” inACL 2024 Findings, arXiv:2403.00071, 2024

work page arXiv 2024

[16] [16]

Pentaho Data Integration,

Pentaho Corporation, “Pentaho Data Integration,” 2024. [Online]. Available: https://www.hitachivantara.com/en-us/products/pentaho-plus- platform.html

work page 2024

[17] [17]

YaRN: Efficient Context Window Extension of Large Language Models

B. Peng et al., “YaRN: Efficient Context Window Extension of Large Language Models,” arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

T. Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Atten- tion with IO-Awareness,” arXiv:2205.14135, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Training Deep Nets with Sublinear Memory Cost

T. Chen et al., “Training Deep Nets with Sublinear Memory Cost,” arXiv:1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Rethinking Learning Rate Tuning in the Era of Large Language Models,

H. Jin, Y . Wu, et al., “Rethinking Learning Rate Tuning in the Era of Large Language Models,” arXiv:2309.08859, 2023

work page arXiv 2023

[21] [21]

Unveiling the secret recipe: A guide for supervised ﬁne-tuning small llms

A. Pareja et al., “Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs,” arXiv:2412.13337, 2024

work page arXiv 2024

[22] [22]

Enhancing cross entropy with a linearly adaptive loss function for optimized classification performance,

J.W. Shim, “Enhancing cross entropy with a linearly adaptive loss function for optimized classification performance,”Scientific Reports, vol. 14, p. 27405, 2024

work page 2024

[23] [23]

Training-Free Long-Context Scaling of Large Language Models,

C. An et al., “Training-Free Long-Context Scaling of Large Language Models,” arXiv:2402.17463, 2024

work page arXiv 2024

[24] [24]

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge,

R.S. Raju et al., “Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge,” inACL CustomNLP4U Workshop, 2024

work page 2024

[25] [25]

Probable Inference, the Law of Succession, and Statistical Inference,

E.B. Wilson, “Probable Inference, the Law of Succession, and Statistical Inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209-212, 1927

work page 1927

[26] [26]

Scallop: A Language for Neurosymbolic Programming,

Z. Li, J. Huang, and M. Naik, “Scallop: A Language for Neurosymbolic Programming,” inProc. PLDI 2023, 2023

work page 2023

[27] [27]

GPT-4 Technical Report

Achiam et al. ”GPT-4 Technical Report” inarXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

o3-mini System Card,

OpenAI, “o3-mini System Card,” 2025. [Online]. Available: https://cdn.openai.com/o3-mini-system-card-feb10.pdf

work page 2025

[29] [29]

Maritime-SLM-Training: Multi-Model Synthetic Generation and Fine-Tuning Pipeline,

N. Platt and P. Nayak, “Maritime-SLM-Training: Multi-Model Synthetic Generation and Fine-Tuning Pipeline,” Figshare, 2025. [Software]. doi: 10.6084/m9.figshare.29709053.v2

work page doi:10.6084/m9.figshare.29709053.v2 2025

[30] [30]

AIS QA Dataset: Synthetic Question-Answer Dataset for Maritime Intelligence,

N. Platt and P. Nayak, “AIS QA Dataset: Synthetic Question-Answer Dataset for Maritime Intelligence,” Figshare, 2025. [Dataset]. doi: 10.6084/m9.figshare.29710445.v1

work page doi:10.6084/m9.figshare.29710445.v1 2025