pith. sign in

arxiv: 2605.19394 · v1 · pith:NCB355APnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI

EmbGen: Teaching with Reassembled Corpora

Pith reviewed 2026-05-20 06:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords synthetic data generationdomain adaptationembedding similarityentity reassemblyQA pair generationinstruction tuningheterogeneous corporabinary accuracy
0
0 comments X

The pith

EmbGen generates higher-quality synthetic QA data by decomposing domain corpora into entity-description pairs and reassembling them via embedding similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to reduce the cost of adapting small instruction-tuned models to specialized domains by creating better synthetic training examples from existing corpora. Current synthetic data pipelines tend to homogenize outputs and miss dependencies that span multiple passages or documents. EmbGen addresses this by first breaking a corpus into entity and description pairs, then using embedding similarity to infer and reassemble semantic clusters. It next generates question-answer pairs through proximity, intra-cluster, and inter-cluster sampling, each guided by cluster-specific prompts. On the most semantically varied dataset, this yields clear gains in a combined factual-accuracy-and-completeness metric at both 5 million and 20 million token budgets while staying competitive on simpler datasets.

Core claim

EmbGen is a synthetic data pipeline that decomposes a domain corpus into entity-description pairs, reassembles them into clusters based on embedding similarity to capture cross-passage structure, and produces QA pairs via proximity sampling, intra-cluster sampling, and inter-cluster sampling with specialized system prompts; when evaluated against EntiGraph, InstructLab, and Knowledge-Instruct on three datasets of differing heterogeneity under fixed 5M and 20M token budgets using lexical overlap, LLM-as-judge, and Binary Accuracy metrics, it improves Binary Accuracy by 12.5 percent at 5M tokens and 88.9 percent at 20M tokens on the most heterogeneous dataset.

What carries the argument

Reassembly of entity-description pairs into semantic clusters using embedding similarity, followed by proximity, intra-cluster, and inter-cluster sampling for QA generation.

If this is right

  • Synthetic data can capture cross-passage and cross-document dependencies that existing pipelines miss.
  • Gains from the reassembly approach grow with larger token budgets on heterogeneous data.
  • The method delivers competitive results on lower-heterogeneity datasets without extra cost.
  • Fixed-budget comparisons isolate the effect of data composition quality rather than data volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition-and-reassembly step could be applied to generate other synthetic formats such as dialogues or reasoning traces.
  • Direct end-to-end fine-tuning experiments would provide a stronger test than the proxy metric alone.
  • Embedding-based clustering may offer a general way to reduce homogenization in any teacher-generated synthetic corpus.

Load-bearing premise

Higher scores on the composed Binary Accuracy metric under fixed token budgets will correspond to better downstream supervised fine-tuning performance of small instruction-tuned models.

What would settle it

Generate training sets with EmbGen and the baselines, fine-tune the same small model on each set, and measure actual task performance on a held-out domain-specific benchmark to test whether the reported accuracy gains appear in the final model.

Figures

Figures reproduced from arXiv: 2605.19394 by Andrea Nicastro, Anna Leontjeva, Arun K Lenin, Kai Rouse.

Figure 1
Figure 1. Figure 1: EmbGen decomposition-and-reassembling pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Clustering and distribution diagnostics for POPQA [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Clustering and distribution diagnostics for SQuAD [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Clustering and distribution diagnostics for [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System and user prompts used in entity extraction. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: System and user prompts used in entity consolida [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: System and user prompts used in entity contradic [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: System and user prompts used in cluster context [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System and user prompts used in context consoli [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System and user prompts used in cluster-specific [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Base system prompt used as the foundation for [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System and user prompts used in proximity-based [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: System and user prompts used in multi-group QA [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: System and user prompts used in single-entity [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: System and user prompts used for LLM-as-a-judge [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: System prompt used for POPQA evaluation QA [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: System prompt used for WikiText evaluation QA [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt used for SQuAD-20 topic selection. [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗
read the original abstract

Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EmbGen, a synthetic data generation pipeline that decomposes a domain corpus into entity-description pairs, reassembles them via embedding similarity to capture semantic structure, and generates QA pairs through proximity, intra-cluster, and inter-cluster sampling with cluster-specialized prompts. It evaluates this approach against EntiGraph, InstructLab, and Knowledge-Instruct on three datasets of varying semantic heterogeneity under fixed 5M and 20M token budgets, using lexical overlap, LLM-as-judge rubrics, and Binary Accuracy (a composite of Factual Accuracy and Completeness), claiming relative improvements of 12.5% at 5M tokens and 88.9% at 20M tokens on the most heterogeneous dataset while remaining competitive elsewhere.

Significance. If the proxy metrics used for evaluation correlate with improved downstream performance, EmbGen could advance efficient domain adaptation of small instruction-tuned models by producing more diverse synthetic QA data that better preserves cross-passage and cross-document dependencies, reducing reliance on expensive human-curated SFT datasets.

major comments (2)
  1. [Evaluation] Evaluation section: The central claim that EmbGen yields superior synthetic data for SFT is supported only by proxy metrics (Binary Accuracy, lexical overlap, LLM-as-judge); the manuscript reports no experiments that actually perform supervised fine-tuning on the generated corpora and measure resulting model performance (e.g., accuracy or F1 on held-out domain queries), leaving the link between higher proxy scores and better adapted models untested.
  2. [Abstract and Results] Abstract and §4 (Results): The reported 12.5% and 88.9% relative Binary Accuracy gains on the most heterogeneous dataset lack accompanying details on dataset characteristics (size, domain, heterogeneity quantification), exact baseline implementations, statistical significance testing, or controls for prompt engineering variations, weakening the data-to-claim connection.
minor comments (2)
  1. [Evaluation] The exact formula or weighting scheme for the composed Binary Accuracy metric (Factual Accuracy plus Completeness) should be stated explicitly, including any normalization or thresholds applied.
  2. [Results] Figure captions and table headers could more clearly indicate the token budget and dataset heterogeneity level for each reported result to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. We have carefully considered each major comment and provide our responses below, along with indications of planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claim that EmbGen yields superior synthetic data for SFT is supported only by proxy metrics (Binary Accuracy, lexical overlap, LLM-as-judge); the manuscript reports no experiments that actually perform supervised fine-tuning on the generated corpora and measure resulting model performance (e.g., accuracy or F1 on held-out domain queries), leaving the link between higher proxy scores and better adapted models untested.

    Authors: We agree that direct evaluation through supervised fine-tuning (SFT) on the generated data and subsequent measurement of model performance on held-out queries would provide the most compelling evidence for the superiority of EmbGen. The current work focuses on developing and evaluating the synthetic data generation pipeline using established proxy metrics that have been used in prior work on synthetic data for instruction tuning. We acknowledge this as a limitation and, in the revised manuscript, will include a new subsection in the Discussion or Limitations section explicitly stating that downstream SFT experiments are left for future work. We will also elaborate on why the chosen proxies (particularly Binary Accuracy) are expected to correlate with SFT performance based on their design to measure factual accuracy and completeness. revision: partial

  2. Referee: [Abstract and Results] Abstract and §4 (Results): The reported 12.5% and 88.9% relative Binary Accuracy gains on the most heterogeneous dataset lack accompanying details on dataset characteristics (size, domain, heterogeneity quantification), exact baseline implementations, statistical significance testing, or controls for prompt engineering variations, weakening the data-to-claim connection.

    Authors: We will revise the abstract and Section 4 to provide additional details. Specifically, we will include the sizes and domains of the three datasets, a measure of semantic heterogeneity (such as the variance in embedding similarities or number of clusters formed), more precise descriptions of how the baselines were implemented (including any shared prompt templates), results of statistical significance tests for the reported gains, and a note on controlling for prompt variations by using consistent prompting strategies across methods. These additions will strengthen the connection between the data and our claims. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline evaluated on external baselines

full rationale

The paper presents EmbGen as a procedural pipeline that decomposes a corpus into entity-description pairs, reassembles them via embedding similarity, and generates QA pairs through proximity/intra-cluster/inter-cluster sampling with specialized prompts. Evaluation relies on lexical overlap, LLM-as-judge rubric, and Binary Accuracy (Factual Accuracy + Completeness) under fixed token budgets, compared directly against independent external baselines (EntiGraph, InstructLab, Knowledge-Instruct) on separate datasets of varying heterogeneity. No equations, fitted parameters, self-definitional quantities, or load-bearing self-citations appear in the derivation or claims; results are reported as empirical improvements relative to those baselines rather than reductions to inputs by construction. The central claims therefore remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that embedding similarity reliably reflects semantic structure suitable for reassembly and that cluster-specialized prompts improve QA quality; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Embedding similarity can be used to infer semantic structure for reassembling entity-description pairs across passages or documents.
    This premise underpins the reassembly step and is invoked to address limitations of existing pipelines.

pith-pipeline@v0.9.0 · 5749 in / 1342 out tokens · 57567 ms · 2026-05-20T06:30:10.179824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · 12 internal anchors

  1. [1]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization(Ann Arbor, Michigan, 2005-06), Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss (Eds.). ...

  2. [2]

    Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. InAdvances in Knowledge Discovery and Data Mining(Berlin, Heidelberg, 2013), Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda, and Guandong Xu (Eds.). Springer, 160–172. doi:10.1007/978-3-642-37456-2_14

  3. [3]

    Hanzhu Chen, Xu Shen, Jie Wang, Zehao Wang, Qitan Lv, Junjie He, Rong Wu, Feng Wu, and Jieping Ye. 2025. Knowledge Graph Finetuning Enhances Knowledge Manipulation in Large Language Models. InThe Thirteenth Interna- tional Conference on Learning Representations. https://openreview.net/forum?id= oMFOKjwaRS

  4. [4]

    Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, and Nanqing Dong. 2025. GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation. arXiv:2505.20416 doi:10.48550/arXiv.2505.20416

  5. [5]

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Distill- ing Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. arXiv:2305.02301 doi:10.48550/arXiv.2305.02301

  6. [6]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 doi:10.48550/arXiv.2106.09685

  7. [7]

    Yue Huang, Siyuan Wu, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Jianfeng Gao, Chaowei Xiao, Lichao Sun, and Xiangliang Zhang

  8. [8]

    arXiv:2406.18966 doi:10.48550/arXiv.2406.18966

    DataGen: Unified Synthetic Dataset Generation via Large Language Models. arXiv:2406.18966 doi:10.48550/arXiv.2406.18966

  9. [10]

    Ketchen Jr

    David J. Ketchen Jr. and Christopher L. Shook. 1996. THE APPLICATION OF CLUSTER ANALYSIS IN STRATEGIC MANAGEMENT RESEARCH: AN ANALYSIS AND CRITIQUE. 17, 6 (1996), 441–458. doi:10.1002/(SICI)1097- 0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G

  10. [11]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out(Barcelona, Spain, 2004-07). Association for Computational Linguistics, 74–81. https://aclanthology.org/W04-1013/

  11. [12]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 doi:10.48550/arXiv.2303.16634

  12. [13]

    S. Lloyd. 1982. Least squares quantization in PCM. 28, 2 (1982), 129–137. doi:10. 1109/TIT.1982.1056489

  13. [14]

    Edward Loper and Steven Bird. 2002. NLTK: the Natural Language Toolkit. InProceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics -(Philadelphia, Pennsylvania, 2002), Vol. 1. Association for Computational Linguistics, 63–70. doi:10.3115/1118108.1118117

  14. [15]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101 doi:10.48550/arXiv.1711.05101

  15. [16]

    Shengjie Ma, Xuhui Jiang, Chengjin Xu, Cehao Yang, Liyu Zhang, and Jian Guo. 2025. Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models. arXiv:2505.00979 doi:10.48550/ arXiv.2505.00979

  16. [17]

    Dheeraj Mekala, Tu Vu, Timo Schick, and Jingbo Shang. 2022. Leveraging QA Datasets to Improve Generative Data Augmentation. doi:10.48550/ARXIV.2205. 12604

  17. [18]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training lan- guage models to follow instructions with human...

  18. [19]

    Oded Ovadia, Meni Brief, Rachel Lemberg, and Eitam Sheetrit. 2025. Knowledge- Instruct: Effective Continual Pre-training from Limited Data using Instructions. arXiv:2504.05571 doi:10.48550/arXiv.2504.05571

  19. [20]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02 (Philadelphia, Pennsylvania, 2002-07). Association for Computational Linguistics,

  20. [21]

    doi:10.3115/1073083.1073135

  21. [22]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. doi:10.48550/ ARXIV.1606.05250

  22. [23]

    Arij Riabi, Thomas Scialom, Rachel Keraron, Benoît Sagot, Djamé Seddah, and Jacopo Staiano. 2021. Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering. arXiv:2010.12643 doi:10.48550/arXiv.2010.12643

  23. [24]

    Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. 2020. BLEURT: Learning Robust Metrics for Text Generation. arXiv:2004.04696 doi:10.48550/arXiv.2004. 04696

  24. [25]

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2024. The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv:2305.17493 doi:10.48550/arXiv.2305.17493

  25. [27]

    Cox, and Akash Srivastava

    Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D. Cox, and Akash Srivastava. 2024. LAB: Large-Scale Alignment for ChatBots. doi:10.48550/ARXIV.2403.01081

  26. [28]

    Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhong- nan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, and Mi Zhang

  27. [29]
  28. [30]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. doi:10.48550/ARXIV.2212.10560

  29. [31]

    Yangfan Wang, Jie Liu, Chen Tang, Lian Yan, and Jingchi Jiang. 2025. KCS: Di- versify Multi-hop Question Generation with Knowledge Composition Sampling. arXiv:2508.20567 doi:10.48550/arXiv.2508.20567

  30. [32]

    Rui Xue, Xipeng Shen, Ruozhou Yu, and Xiaorui Liu. 2024. Efficient End-to-end Language Model Fine-tuning on Graphs. arXiv:2312.04737 doi:10.48550/arXiv. 2312.04737

  31. [33]

    Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto. 2024. Synthetic continued pretraining. arXiv:2409.07431 doi:10. 48550/arXiv.2409.07431

  32. [34]

    Liqin Ye, Agam Shah, Chao Zhang, and Sudheer Chava. 2025. Calibrating Pre- trained Language Classifiers on LLM-generated Noisy Labels via Iterative Refine- ment. arXiv:2505.19675 doi:10.48550/arXiv.2505.19675

  33. [35]

    Haohan Yuan, Sukhwa Hong, and Haopeng Zhang. 2026. StrucSum: Graph- Structured Reasoning for Long Document Extractive Summarization with LLMs. arXiv:2505.22950 doi:10.48550/arXiv.2505.22950

  34. [36]

    Ziqiang Yuan, Kaiyuan Wang, Shoutai Zhu, Ye Yuan, Jingya Zhou, Yanlin Zhu, and Wenqi Wei. 2024. FinLLMs: A Framework for Financial Reasoning Dataset Generation with Large Language Models. doi:10.48550/ARXIV.2401.10744

  35. [37]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi

  36. [38]

    BERTScore: Evaluating Text Generation with BERT

    BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 doi:10.48550/arXiv.1904.09675

  37. [39]

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. doi:10.48550/ARXIV.2304.11277

  38. [40]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. doi:10.48550/ARXIV.2306.05685 Conference’17, July 2017, Washington, DC, USA Arun Lenin, Kai Rouse, Andrea Nicastro...

  39. [41]

    Core concepts and frameworks: domain theories, methodologies, business models, technical architectures, design patterns

  40. [42]

    Concrete named entities: organizations, products, services, systems, platforms, tools, technologies, key people

  41. [43]

    Processes and procedures: workflows, protocols, operational methods, decision frameworks, multi-step processes

  42. [44]

    Domain terminology: technical terms, industry jargon, acronyms with specialized meanings requiring context

  43. [45]

    Critical facts and parameters: regulatory requirements, thresholds, limits, standards, compliance rules, service-level agreements

  44. [46]

    Explanation requirements for Q&A generation: Each explanation must enable generation of two to four distinct, meaningful questions

    Key relationships: dependencies, interactions, hierarchies, and cause-effect relationships between major components. Explanation requirements for Q&A generation: Each explanation must enable generation of two to four distinct, meaningful questions. Explanations must go beyond simple definitions and support functional, contextual, relational, and applicati...

  45. [47]

    Definition (one sentence): a clear statement of what the entity is in domain context

  46. [48]

    Domain context (one to two sentences): significance, role, where or when it is used, and why it matters

  47. [49]

    Key characteristics (one to three sentences): important properties, constraints, parameters, components, or distinguishing features

  48. [50]

    Relationships (if critical, one sentence): how the entity connects to other domain entities, including dependencies or interactions

  49. [51]

    entity":

    Practical application (if relevant, one sentence): real-world implications, use cases, typical scenarios, edge cases, or outcomes explicitly described in the document. Length calibration: - Simple entities: two to three sentences minimum. - Standard entities: three to five sentences. - Complex entities: five to seven sentences. Q&A generation test: Each e...

  50. [52]

    Process the entire document comprehensively

  51. [53]

    Scale the number of entities with document length

  52. [54]

    Prioritize entities with high Q&A generation potential

  53. [55]

    Ensure explanations are self-contained

  54. [56]

    Return only valid JSON

    Optimize for training value. Return only valid JSON. Do not include any text outside the JSON structure.You are a knowledge expert who has a deep understanding of a variety of interconnected domains. Your ability to recall and answer has been built over the course of extensive research into these domains, and enjoy helping people with any questions they m...

  55. [57]

    If definitions vary, select the most domain-specific version

    Merge core definition: Identify the most complete and precise definition across sources. If definitions vary, select the most domain-specific version

  56. [58]

    Combine context and significance: Aggregate all unique information describing the role, importance, and domain positioning

  57. [59]

    Aggregate characteristics: Merge all distinct properties, constraints, parameters, components, or capabilities mentioned across sources

  58. [60]

    Unify relationships: Consolidate information about dependencies, interactions, or connections to other entities without duplication

  59. [61]

    Preserve applications: Include all practical implications, use cases, outcomes, or impacts mentioned

  60. [62]

    Eliminate redundancy: Remove repetitive information while preserving all unique substantive content

  61. [63]

    Output requirements: Training quality standards: - Length: three to seven sentences, calibrated to complexity and information density

    Maintain Q&A potential: Ensure the consolidated explanation supports two to four distinct question types suitable for training. Output requirements: Training quality standards: - Length: three to seven sentences, calibrated to complexity and information density. - Structure: definition, domain context, key characteristics, relationships, applications. - T...

  62. [64]

    Capture all unique information from every source

  63. [65]

    Eliminate repetitive or overlapping content

  64. [66]

    Maintain domain-specific accuracy and terminology

  65. [67]

    Support generation of two to four distinct, meaningful question-answer pairs

  66. [68]

    Flow naturally and professionally

  67. [69]

    Hello",

    Be suitable as training content. Approach: - Start with the most complete definition. - Incorporate context and significance from all sources. - Merge all distinct characteristics, constraints, or parameters. - Consolidate relationship information without duplication. - Include all practical applications or implications. - Ensure logical flow and readabil...

  68. [70]

    LVR above 80% requires LMI

    Specificity - Detailed, specific information over vague generalizations Example: "LVR above 80% requires LMI" over "High LVR may need insurance"

  69. [71]

    Returns HTTP 429 status code

    Technical Accuracy - Technically precise statements over approximations Example: "Returns HTTP 429 status code" over "Returns an error"

  70. [72]

    Comprehensiveness - Complete explanations over partial descriptions Example: Full process with steps over single-step description EmbGen: Teaching with Reassembled Corpora Conference’17, July 2017, Washington, DC, USA

  71. [73]

    Credit risk assessment using FICO scores

    Domain Alignment - Explanations using proper domain terminology over generic language Example: "Credit risk assessment using FICO scores" over "Checking if customer is reliable"

  72. [74]

    1000 requests per hour

    Quantitative Data - Explanations with specific numbers over those without Example: "1000 requests per hour" over "limited requests"

  73. [75]

    Hello! How can I assist you today?

    Consistency - Information that aligns with other known domain facts STEP 3: MERGE NON-CONTRADICTORY INFORMATION Include ALL information from sources that doesn't conflict, even if one source is more detailed than others in certain areas. STEP 4: MAINTAIN Q&A GENERATION POTENTIAL Ensure the resolved explanation supports 2-4 distinct question types for effe...

  74. [76]

    Identify specific points of contradiction (definitions, facts, parameters, relationships, applications)

  75. [77]

    Apply resolution criteria: specificity > technical accuracy > comprehensiveness > domain alignment > quantitative data

  76. [78]

    Choose the most accurate and detailed version for each conflicting point

  77. [79]

    Merge all non-contradictory information from all sources

  78. [80]

    Ensure factual consistency throughout

  79. [81]

    No greetings

    Maintain Q&A generation potential (support 2-4 question types) Your resolved explanation must: - Be definitive and authoritative (no hedging or uncertainty) - Resolve all contradictions using the priority hierarchy - Preserve all non-contradictory information - Use precise domain terminology - Support generation of diverse Q&A pairs - Be suitable as train...

  80. [82]

    Knowledge domain: What field or area of knowledge does this represent?

Showing first 80 references.