pith. sign in

arxiv: 2606.29243 · v1 · pith:NWNTAANWnew · submitted 2026-06-28 · 💻 cs.LG

KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory

Pith reviewed 2026-06-30 08:26 UTC · model grok-4.3

classification 💻 cs.LG
keywords Bengali agricultural datasetcitation-grounded QAinstruction tuningfarmer benchmarkknowledge nodesRAG knowledge basecrop advisorylow-resource language
0
0 comments X

The pith

KrishokChat is a Bengali agricultural dataset where every training instance carries a verified citation from official manuals, ensuring full provenance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KrishokChat, the first citation-grounded dataset for Bengali crop advisory. It extracts 290 knowledge nodes from 129 manuals into 145,500 QA pairs across 18 crops, each with citation headers. Fine-tuning improves response structure but models fail to generalize exact chemical dosages. This positions the dataset as a reliable source for retrieval-augmented systems in low-resource settings where farmers need trustworthy advice.

Core claim

We present KrishokChat, the first citation-grounded Bengali agricultural instruction-tuning dataset for crop advisory in low-resource settings. We establish a foundation of 290 hierarchical Knowledge Nodes, extracting disease symptoms, management practices, chemical dosages, and verbatim citations from 129 domain-filtered agricultural manuals. Every training instance inherits a verified citation header, guaranteeing 100% citation provenance. Using a Partitioned Seed Generation Matrix, these nodes are expanded into 139,200 supervised fine-tuning pairs, and augmented with 5,300 chemical safety and 1,000 adversarial safety instances, yielding 145,500 QA pairs across 18 crop categories. To evalu

What carries the argument

290 hierarchical Knowledge Nodes extracted from 129 agricultural manuals, each carrying verbatim citations that propagate to all generated QA pairs via the Partitioned Seed Generation Matrix.

If this is right

  • Fine-tuning on KrishokChat improves structured formatting of agricultural advisory responses.
  • Standalone models still struggle with exact chemical dosage generalization even after fine-tuning.
  • The dataset functions best as a verified knowledge base for retrieval-augmented generation rather than for parametric memorization alone.
  • The Farmer Benchmark of 1,001 real queries offers a realistic evaluation of advisory performance across 18 crop categories.
  • Addition of 5,300 chemical safety and 1,000 adversarial safety instances addresses risks in generated advice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The citation provenance approach could extend to other low-resource language domains to reduce hallucination in specialized advice.
  • Pairing the dataset with retrieval systems might mitigate the observed dosage generalization failures in standalone models.
  • The benchmark could serve as a template for testing AI tools in additional agricultural or health domains where source traceability matters.
  • Widespread adoption might support development of mobile apps that deliver traceable crop guidance to farmers in Bengali-speaking regions.

Load-bearing premise

The 129 domain-filtered agricultural manuals are accurate, complete, and free of contradictions, and the extraction process into 290 knowledge nodes preserves all necessary details without introducing errors or losing citation links.

What would settle it

A manual audit that identifies any QA pair whose citation header references a manual section lacking the stated chemical dosage or symptom description.

read the original abstract

We present KrishokChat, the first citation-grounded Bengali agricultural instruction-tuning dataset for crop advisory in low-resource settings. We establish a foundation of 290 hierarchical Knowledge Nodes, extracting disease symptoms, management practices, chemical dosages, and verbatim citations from 129 domain-filtered agricultural manuals. Every training instance inherits a verified citation header, guaranteeing 100% citation provenance. Using a Partitioned Seed Generation Matrix, these nodes are expanded into 139,200 supervised fine-tuning pairs, and augmented with 5,300 chemical safety and 1,000 adversarial safety instances, yielding 145,500 QA pairs across 18 crop categories. To evaluate real-world performance, we introduce the Farmer Benchmark, comprising 1,001 authentic farmer queries curated from field surveys and digital portals. Empirical evaluation on Gemma-4-E2B reveals that while fine-tuning on KrishokChat vastly improves structured formatting, standalone models still struggle with exact chemical dosage generalization. This highlights the dataset's true value as a verified knowledge base for retrieval-augmented generation (RAG) rather than mere parametric memorization. All data, code, and benchmarks are released under CC-BY-4.0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces KrishokChat, the first citation-grounded Bengali agricultural instruction-tuning dataset. It extracts disease symptoms, management practices, chemical dosages, and verbatim citations from 129 domain-filtered manuals to form 290 hierarchical Knowledge Nodes. These are expanded via a Partitioned Seed Generation Matrix into 139,200 supervised fine-tuning pairs, augmented with 5,300 chemical safety and 1,000 adversarial instances to reach 145,500 QA pairs across 18 crops. The work claims every instance inherits a verified citation header guaranteeing 100% provenance. It also releases the Farmer Benchmark of 1,001 authentic farmer queries and reports that fine-tuning Gemma-4-E2B on the dataset improves structured formatting while standalone models struggle with exact chemical dosage generalization, positioning the dataset primarily as a verified knowledge base for RAG.

Significance. If the extraction fidelity and citation-provenance guarantee hold, the dataset would constitute a meaningful resource for low-resource-language agricultural advisory systems, particularly by enabling RAG over parametric memorization. The public release of data, code, and benchmarks under CC-BY-4.0 is a clear strength that supports reproducibility and downstream use.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (dataset construction): the central claim that 'every training instance inherits a verified citation header, guaranteeing 100% citation provenance' is unsupported because no verification mechanism—manual review, automated checks, sampling protocol, or inter-annotator agreement—is described for header attachment during extraction from the 129 manuals into the 290 nodes or during the Partitioned Seed Generation Matrix expansion to 139,200 pairs.
  2. [§3] §3 (extraction pipeline): the weakest assumption that the 129 manuals are accurate, complete, and free of contradictions and that the mapping to 290 nodes preserves all details without introducing errors or dropping citation links receives no quantitative validation, error analysis, or fidelity metrics, rendering the 100% provenance guarantee an untested pipeline assumption.
  3. [§4] §4 (evaluation): the statement that fine-tuning 'vastly improves structured formatting' and that models 'still struggle with exact chemical dosage generalization' is presented without any quantitative results, baselines, or error breakdowns, preventing assessment of the claimed performance patterns or the benchmark's utility.
minor comments (1)
  1. [Abstract] The abstract states 139,200 pairs from the matrix plus 6,300 safety/adversarial instances to reach 145,500; a brief consistency check or explicit breakdown table would clarify the arithmetic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies key areas where the manuscript's claims require additional methodological detail and empirical support. We address each major comment below and commit to revisions that strengthen the paper without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (dataset construction): the central claim that 'every training instance inherits a verified citation header, guaranteeing 100% citation provenance' is unsupported because no verification mechanism—manual review, automated checks, sampling protocol, or inter-annotator agreement—is described for header attachment during extraction from the 129 manuals into the 290 nodes or during the Partitioned Seed Generation Matrix expansion to 139,200 pairs.

    Authors: We agree the verification process must be described explicitly. Citation headers were attached verbatim by domain experts during manual node construction from the source manuals; the Partitioned Seed Generation Matrix then carries these headers forward unchanged. We will add a dedicated subsection in §3 describing the manual extraction protocol, expert review steps, and post-construction sampling for header fidelity. This will directly support the provenance claim. revision: yes

  2. Referee: [§3] §3 (extraction pipeline): the weakest assumption that the 129 manuals are accurate, complete, and free of contradictions and that the mapping to 290 nodes preserves all details without introducing errors or dropping citation links receives no quantitative validation, error analysis, or fidelity metrics, rendering the 100% provenance guarantee an untested pipeline assumption.

    Authors: This observation is correct; the manuscript treats the official manuals as authoritative without quantitative fidelity checks on the node mapping. We will revise §3 and add a limitations paragraph acknowledging the assumption and its implications. If feasible, we will include a small-scale manual fidelity sample; otherwise we will moderate the '100% guarantee' phrasing to reflect direct extraction from authoritative sources. revision: partial

  3. Referee: [§4] §4 (evaluation): the statement that fine-tuning 'vastly improves structured formatting' and that models 'still struggle with exact chemical dosage generalization' is presented without any quantitative results, baselines, or error breakdowns, preventing assessment of the claimed performance patterns or the benchmark's utility.

    Authors: We acknowledge the evaluation lacks the requested quantitative detail. Internal experiments on the Farmer Benchmark produced concrete metrics (formatting adherence rates and dosage exact-match scores) that were summarized qualitatively. We will expand §4 with tables reporting these metrics, baseline comparisons, and error breakdowns to enable full assessment of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction with no derivation or fitted predictions

full rationale

The paper constructs and releases a Bengali agricultural QA dataset from 129 manuals via knowledge nodes and a generation matrix. No equations, model fits, or predictions appear that could reduce to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The provenance guarantee is an assertion about the pipeline rather than a mathematical reduction, and the work is self-contained as data release. This matches the default expectation of no significant circularity for non-derivational papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the reliability of the source manuals and the fidelity of the extraction and expansion process; these are domain assumptions rather than derived quantities.

axioms (2)
  • domain assumption The 129 agricultural manuals are accurate and authoritative sources for crop symptoms, management practices, and chemical dosages.
    All knowledge nodes and citations are extracted directly from these manuals.
  • ad hoc to paper The Partitioned Seed Generation Matrix expands the 290 nodes into consistent, non-contradictory QA pairs while preserving citation headers.
    This is the stated mechanism for scaling from nodes to 139,200 pairs.

pith-pipeline@v0.9.1-grok · 5740 in / 1286 out tokens · 40609 ms · 2026-06-30T08:26:33.968497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 8 canonical work pages

  1. [1]

    Bangladesh: Data and statistics,

    World Bank, “Bangladesh: Data and statistics,” https://data.worldbank.org/country/bangladesh, 2024

  2. [2]

    Annual report 2023,

    Department of Agricultural Extension, “Annual report 2023,” Ministry of Agriculture, Government of Bangladesh, Tech. Rep., 2023

  3. [3]

    BEnQA: A question answering and reasoning benchmark for bengali and english,

    S. Shafayat, H. M. Hasan, M. R. C. Mahim, R. A. Putri, J. Thorne, and A. Oh, “BEnQA: A question answering and reasoning benchmark for bengali and english,” arXiv preprint arXiv:2403.10900, 2024

  4. [4]

    BanglaQuAD: A bengali open-domain question answering dataset,

    M. R. A. H. Rony, S. K. Shaha, R. A. Hasan, S. K. Dey, A. H. Rafi, A. H. Sirajee, and J. Lehmann, “BanglaQuAD: A bengali open-domain question answering dataset,” arXiv preprint arXiv:2410.10229, 2024

  5. [5]

    AgriGPT: a Large Language Model Ecosystem for Agriculture

    B. Yang, Y. Zhang, L. Feng, Y. Chen, J. Zhang, X. Xu, others, and S. Li, “AgriGPT: A large language model ecosystem for agriculture,” arXiv preprint arXiv:2508.08632, 2025

  6. [6]

    AgroGPT: Efficient agricultural vision-language model with expert tuning,

    M. Awais, A. H. S. A. Alharthi, A. Kumar, H. Cholakkal, and R. M. Anwer, “AgroGPT: Efficient agricultural vision-language model with expert tuning,” in 2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (W ACV). IEEE, February 2025, pp. 5687–5696

  7. [7]

    Farmer.Chat: Scaling AI- powered agricultural services for smallholder farmers,

    N. Singh, J. Wang’ombe, N. Okanga, T. Zelenska, J. Repishti, S. Mishra, others, and A. Nambi, “Farmer.Chat: Scaling AI- powered agricultural services for smallholder farmers,” arXiv preprint arXiv:2409.08916, 2024

  8. [8]

    Kr- ishokBondhu: A retrieval-augmented voice-based agricultural advisory call center for bengali farmers,

    M. R. Ameen, A. Islam, F. Aktar, and M. S. Rafat, “Kr- ishokBondhu: A retrieval-augmented voice-based agricultural advisory call center for bengali farmers,” in 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN). IEEE, April 2026, pp. 1–6

  9. [9]

    AgriLLM: harnessing transformers for framer queries,

    K. Didwania, P. Seth, A. Kasliwal, and A. Agarwal, “AgriLLM: harnessing transformers for framer queries,” in Proceedings of the Third Workshop on NLP for Positive Impact, November 2024, pp. 179–187

  10. [10]

    AgroLLM: Connecting farmers and agricultural practices through large language models for enhanced knowledge transfer and practical application,

    D. J. S. Ravindran, I. Skarga-Bandurova, S. V, M. Awais, and M. S, “AgroLLM: Connecting farmers and agricultural practices through large language models for enhanced knowledge transfer and practical application,” AgriEngineering, vol. 8, no. 1, p. 38, 2026

  11. [11]

    AgriBERT: Knowledge-infused agricultural language models for matching food and nutrition,

    S. Rezayi, Z. Liu, Z. Wu, C. Dhakal, B. Ge, C. Zhen, others, and S. Li, “AgriBERT: Knowledge-infused agricultural language models for matching food and nutrition,” in IJCAI, vol. 2022, no. 2, July 2022, p. 3

  12. [12]

    Self-Instruct: Aligning lan- guage models with self-generated instructions,

    Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-Instruct: Aligning lan- guage models with self-generated instructions,” in Proceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), July 2023, pp. 13 484–13 508

  13. [13]

    Stanford Alpaca: An instruction- following LLaMA model,

    R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, others, and T. B. Hashimoto, “Stanford Alpaca: An instruction- following LLaMA model,” Stanford University, Tech. Rep., March 2023

  14. [14]

    WizardLM: Empowering large pre-trained lan- guage models to follow complex instructions,

    C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, others, and D. Jiang, “WizardLM: Empowering large pre-trained lan- guage models to follow complex instructions,” in International Conference on Learning Representations, vol. 2024, May 2024, pp. 30 745–30 766

  15. [15]

    Leveraging synthetic data for question answering with multilingual LLMs in the agricultural domain,

    R. Kaur, A. S. Bhankhar, J. S. Salh, S. Rajput, K. Mahendra, B. Berwal, others, and S. Ranathunga, “Leveraging synthetic data for question answering with multilingual LLMs in the agricultural domain,” arXiv preprint arXiv:2507.16974, 2025

  16. [16]

    Data diversity matters for robust instruction tuning,

    A. Bukharin, S. Li, Z. Wang, J. Yang, B. Yin, X. Li, others, and H. Jiang, “Data diversity matters for robust instruction tuning,” in Findings of the Association for Computational Linguistics: EMNLP 2024, November 2024, pp. 3411–3425

  17. [17]

    Evaluating the diversity and quality of LLM generated content,

    A. Shypula, S. Li, B. Z. employment, V. Padmakumar, K. Yin, and O. Bastani, “Evaluating the diversity and quality of LLM generated content,” arXiv preprint arXiv:2504.12522, 2025

  18. [18]

    Measuring data diversity for instruction tuning: A systematic analysis and a reliable metric,

    Y. Yang, Y. Nan, J. Ye, S. Dou, X. Wang, S. Li, others, and X. J. Huang, “Measuring data diversity for instruction tuning: A systematic analysis and a reliable metric,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2025, pp. 18 530– 18 549

  19. [19]

    Cost-efficient cross-lingual retrieval-augmented generation for low-resource languages: A case study in bengali agricultural advisory,

    M. A. Hossain, N. Subhan, M. R. Mahi, and J. F. Nabila, “Cost-efficient cross-lingual retrieval-augmented generation for low-resource languages: A case study in bengali agricultural advisory,” arXiv preprint arXiv:2601.02065, 2026

  20. [20]

    Gemini 3.1 Flash-Lite — google deepmind,

    Google DeepMind, “Gemini 3.1 Flash-Lite — google deepmind,” https://deepmind.google/models/gemini/flash-lite/, 2025

  21. [21]

    GPT-5.5: Next-generation language model,

    OpenAI, “GPT-5.5: Next-generation language model,” https://openai.com/blog/gpt-5-5, 2026

  22. [22]

    A coefficient of agreement for nominal scales,

    J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960

  23. [23]

    QLoRA: Efficient finetuning of quantized LLMs,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” Advances in Neural Information Processing Systems, vol. 36, pp. 10 088– 10 115, 2023

  24. [24]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,

    L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, others, and I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023

  25. [25]

    Deepseek- v4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

    A. Xu, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, others, and S. Wu, “DeepSeek-V4: Towards highly efficient million-token context intelligence,” arXiv preprint arXiv:2606.19348, 2026

  26. [26]

    A simple sequentially rejective multiple test proce- dure,

    S. Holm, “A simple sequentially rejective multiple test proce- dure,” Scandinavian Journal of Statistics, pp. 65–70, 1979. Appendix A: Bengali Agricultural Glossary A. Glossary Composition The complete 1,705-term agricultural glossary com- prises four categories: 10 • Disease terms (668): Common and scientific names of crop diseases in English with their ...

  27. [27]

    Layout-aware PDF parsing (Marker framework)

  28. [28]

    AST-based semantic extraction (header isolation, span extraction)

  29. [29]

    Semantic boundary enforcement (100–500 token range)

  30. [30]

    Cryptographic lineage injection (SHA-256 hashing)

  31. [31]

    Automated quality validation (5 deterministic gates)

  32. [32]

    LLM-in-the-loop semantic verification (gpt-5.5)

  33. [33]

    Four-stage QC (format, length, redundancy, chemical)

    Contextual & epistemic tagging (spatiotemporal + corroboration) c) Stage 3: PSGM Instruction Synthesis: 290 Know- ledge Nodes × 32 thematic seeds × 15 query registers. Four-stage QC (format, length, redundancy, chemical). Final yield: 139,200 SFT pairs + 5,300 chemical safety + 1,000 adversarial safety = 145,500 total. F. Failure Analysis and Error Taxono...

  34. [34]

    Cross-document dosage contradictions (42%): Dif- ferent agencies recommended different application rates for the same chemical on the same crop. These cases were escalated to an expert adjudica- tion protocol — an agronomist reviewed conflicting sources and selected the DAE-recommended dosage — informing the 400-ingredient chemical whitelist. 11

  35. [35]

    apply 2 g/L

    Context-collapsed chemical applications (31%): A management practice was extracted without its chemical context (e.g., “apply 2 g/L” without spec- ifying the active ingredient), making the node unus- able for safety-critical advisory

  36. [36]

    Layout-induced symptom–treatment mismatches (18%): Table parsing errors caused treatment recom- mendations from one row to be paired with symptom descriptions from an adjacent row in multi-column layouts

  37. [37]

    Figure 3 visualizes the rejection taxonomy

    Translation glossary failures (9%): Low-frequency English technical terms not covered by the 1,705- term glossary, resulting in Bengali transliterations that diverged from DAE conventions. Figure 3 visualizes the rejection taxonomy. Figure 3. Distribution of failure modes among 145 rejected candidate nodes. These patterns informed the chemical whitelist, ...