pith. sign in

arxiv: 2606.05522 · v1 · pith:L76AZT4Fnew · submitted 2026-06-03 · 💻 cs.SD · cs.AI· eess.AS

Exploring LLMs for South Asian Music Understanding and Generation

Pith reviewed 2026-06-28 04:18 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords large language modelsSouth Asian classical musicragatalamusic understandingmusic generationcultural AI
0
0 comments X

The pith

Frontier LLMs achieve 85-90% accuracy on South Asian classical music understanding but only 40% stylistic faithfulness in generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts the first systematic evaluation of large language models on South Asian classical music, which relies on raga and tala structures rather than Western harmony. It introduces a 504-question benchmark to test understanding of raga grammar, cultural knowledge, and notation, where top models score 85-90% while most open-source ones score 23-40%. For generation, a five-level prompting setup shows that even the best model achieves stylistic faithfulness only 40% of the time. This demonstrates that structural correctness and cultural style are separate skills for LLMs in this domain.

Core claim

We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful ou

What carries the argument

504-question benchmark for understanding and five-level controlled prompting framework for generation, grounded in Hindustani classical theory and Bengali classical forms.

If this is right

  • Frontier models significantly outperform open-source models on South Asian music understanding tasks.
  • Stylistic faithfulness in music generation remains a distinct and harder objective than structural validity.
  • Current LLMs face an open challenge in culturally grounded music modeling for low-resource traditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • LLM training data likely underrepresents South Asian musical traditions compared to Western ones.
  • The benchmark and prompting methods could be extended to assess other regional music systems.

Load-bearing premise

The 504-question benchmark and five-level prompting framework accurately measure competence in raga- and tala-based traditions without cultural bias or incomplete coverage of the musical grammar.

What would settle it

If a new benchmark with broader coverage shows open-source models performing closer to frontier ones or if generation faithfulness exceeds 60% for top models, the performance gap and distinct objectives claim would be weakened.

Figures

Figures reproduced from arXiv: 2606.05522 by Faria Binte Kader, Mohtasim Hadi Rafi, Santu Karmaker, Shah Wasif Sajjad.

Figure 1
Figure 1. Figure 1: LLMs’ South Asian Musical Capabilities Evaluation Pipeline [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Heatmap of 33 model accuracy scores across three South Asian music understanding subtasks. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmaps of automatic evaluation metrics across models and prompt levels. Rows correspond [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pearson and Spearman correlations between each automatic evaluation metric and human Style [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pearson correlation between Music understanding tasks vs Human evaluation crite￾ria scores on the generation task for the top 9 models. MTU=Music Theory Understand￾ing, MGK=Music General Knowledge, MC=Music Continuation; the x-axis presents structure, emo￾tion, adherence to instruction, harmoniousness, genre, and style accuracy, respectively. 5.2.3 Limitations in Automatic Evaluation (Kader and Karmaker, 2… view at source ↗
Figure 7
Figure 7. Figure 7: Number of questions per sub-theme B Prompts for Music Generation [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to deliver the first systematic evaluation of LLMs on South Asian classical music (Hindustani and Bengali traditions including Rabindra and Nazrul Sangeet). It introduces a 504-question benchmark spanning raga grammar, cultural knowledge, and symbolic notation, reports that frontier models (e.g., Gemini 2.5 Pro) reach 85-90% accuracy while most open-source models score 23-40%, and uses a five-level controlled prompting framework for generation that yields only 40% stylistically faithful outputs even for the strongest model, concluding that structural validity and stylistic faithfulness are distinct objectives.

Significance. If the benchmark and scoring protocols prove robust, the work supplies the first quantitative baseline for LLM performance on raga- and tala-governed traditions, exposing a clear performance gap between frontier and open-source models and isolating an open modeling challenge for culturally grounded music generation.

major comments (2)
  1. [Abstract and evaluation-design paragraphs] Abstract and evaluation-design paragraphs: the headline accuracy figures (85-90% frontier vs. 23-40% open-source) and the 40% faithfulness rate are load-bearing for the central claims, yet the manuscript supplies no description of question sourcing, coverage audit of raga grammar (swara rules, pakad, vadi-samvadi), tala cycles, inter-rater agreement, or leakage checks; without these the measured competence cannot be distinguished from benchmark artifacts.
  2. [Generation framework section] Generation framework section: the claim that structural validity and stylistic faithfulness are distinct objectives rests on the five-level prompting results, but the manuscript does not define the scoring rubric for 'stylistically faithful,' report expert validation process, or provide statistical comparison between the two metrics; this leaves the distinction unsupported.
minor comments (1)
  1. [Abstract] The abstract states the benchmark is 'grounded in Hindustani classical theory and Bengali classical forms' without indicating how this grounding shaped question selection or answer keys.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional methodological details where needed to strengthen the presentation of our benchmark and results.

read point-by-point responses
  1. Referee: [Abstract and evaluation-design paragraphs] Abstract and evaluation-design paragraphs: the headline accuracy figures (85-90% frontier vs. 23-40% open-source) and the 40% faithfulness rate are load-bearing for the central claims, yet the manuscript supplies no description of question sourcing, coverage audit of raga grammar (swara rules, pakad, vadi-samvadi), tala cycles, inter-rater agreement, or leakage checks; without these the measured competence cannot be distinguished from benchmark artifacts.

    Authors: We agree that the current manuscript would benefit from expanded methodological transparency on the benchmark construction. In the revised version, we will add a dedicated subsection under the evaluation design that details: question sourcing from standard references in Hindustani and Bengali music theory; a coverage audit confirming inclusion of swara rules, pakad, vadi-samvadi relations, and tala structures; inter-rater agreement metrics obtained from expert reviewers; and leakage checks performed via manual curation to avoid overlap with common training corpora. These additions will directly address the concern without changing the reported accuracy figures. revision: yes

  2. Referee: [Generation framework section] Generation framework section: the claim that structural validity and stylistic faithfulness are distinct objectives rests on the five-level prompting results, but the manuscript does not define the scoring rubric for 'stylistically faithful,' report expert validation process, or provide statistical comparison between the two metrics; this leaves the distinction unsupported.

    Authors: We acknowledge the need for greater precision in defining and validating the stylistic faithfulness metric. The revised generation framework section will include: an explicit scoring rubric for stylistic faithfulness (covering raga-specific melodic adherence, tala alignment, and idiomatic phrasing); a description of the expert validation process involving domain specialists in Hindustani and Bengali traditions; and statistical comparisons (such as agreement rates or correlation coefficients) between structural validity and stylistic faithfulness scores. This will provide stronger empirical support for the distinction between the two objectives. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivations or self-referential predictions

full rationale

The paper is an empirical evaluation introducing a 504-question benchmark and five-level prompting framework for assessing LLMs on South Asian classical music. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. Central claims rest on direct model accuracy measurements (85-90% for frontier models) and generation faithfulness rates rather than any chain that reduces to its own inputs by construction. The work is therefore self-contained as a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are introduced or relied upon beyond standard assumptions in LLM benchmarking.

pith-pipeline@v0.9.1-grok · 5766 in / 1181 out tokens · 31718 ms · 2026-06-28T04:18:03.599532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 3 linked inside Pith

  1. [1]

    In ISMIR, pages 29–34

    A comparative study of indian and west- ern music forms. In ISMIR, pages 29–34. Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, and 1 others. 2023. Mu- siclm: Generating music from text. arXiv preprint arXiv:2301.11325. Yogesh Agrawal, Aniruddha Dut...

  2. [2]

    Preprint, arXiv:2605.26322

    Omnitom: Benchmarking theory of mind in llms via explicit belief modeling . Preprint, arXiv:2605.26322. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107....

  3. [3]

    arXiv preprint arXiv:2407.21783

    The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Dirk Groeneveld, Iz Beltagy, Evan Walsh, Ak- shita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, and 1 others. 2024. Olmo: Accelerating the science of language models. In Proceedings of the 62nd annual meeting of the association for computational ...

  4. [4]

    arXiv preprint arXiv:2209.15352

    Audiogen: Textually guided audio gen- eration. arXiv preprint arXiv:2209.15352. Klaus Krippendorff. 2011. Computing krippen- dorff’s alpha-reliability. Solomon Kullback. 1951. Kullback-leibler diver- gence. Winston E. Langley. 2009. Kazi Nazrul Islam: The Voice of Poetry and the Struggle for Human Wholeness. Nazrul Institute, Dhaka, Bangladesh. Jean-Marie...

  5. [5]

    Advances in neural information processing systems, 35:3843–3857

    Solving quantitative reasoning problems with language models. Advances in neural information processing systems, 35:3843–3857. Jiajia Li, Lu Yang, Mingni Tang, Cong Chen, Zuchao Li, Ping Wang, and Hai Zhao. 2024. The music maestro or the musically challenged, a massive music evaluation benchmark for large language models . Preprint, arXiv:2406.15885. Jiaw...

  6. [6]

    Preprint, arXiv:2305.11430

    Teler: A general taxonomy of llm prompts for benchmarking complex tasks . Preprint, arXiv:2305.11430. Pasin Sawaengsawangarom, Suparoek Phongoen, and Papis Wongchaisuwat. 2025. Deep learn- ing for music genre classification: A case study of thai music. In Proceedings of the 2025 9th International Conference on Control Engineering and Artificial Intelligen...

  7. [7]

    arXiv preprint arXiv:2503.08638

    Yue: Scaling open foundation models for long-form music generation. arXiv preprint arXiv:2503.08638. Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, and 1 others. 2024. Chatmusician: Understanding and generating music intrinsically with llm. arXiv preprint arXiv:2402.16153. Ruibin Yuan, Yi...

  8. [8]

    [[Answer: C]]

    for music generation in ABC notation. The meta-data that was provided were lyrics, theme, time signature from the original songs. Since the official Bengali Swaralipi notation system represents pitches using solfège sylla- bles (সা, ের, গা, মা, পা, ধা, িন) relative to a tonic without specifying an absolute pitch, a mapping convention was required during th...

  9. [9]

    Ensure the melody adheres to a Indian classical music style

  10. [10]

    It should sound like a [Rabindra/Nazrul] Song

  11. [11]

    Use Scale C and [time signature] time signature

  12. [12]

    Strictly generate just notations, avoid any comments and extra line

  13. [13]

    Cover the lyrical theme: [theme], ensuring the output spans ∼2–3 minutes in play length

  14. [14]

    Audio files

    The music should feature instruments used in indian classical music. [Lyrics] Level 4 – Significant Details + User Ex- pectation Level 3 prompt + A good output should: • Be musically coherent and playable. • Clearly reflect the theme [theme]. • Maintain stylistic authenticity of Indian classical music. • Be structured in valid ABC notation syntax. Lyrics:...

  15. [15]

    Audio files

    F older structure: Inside the “Audio files” folder, there are 9 subfolders. Each sub- folder corresponds to a different large language model (LLM) and contains 20 audio files gen- erated by that model. The names of these 9 subfolders exactly match the sheet names in the Excel file

  16. [16]

    Folder 1

    Matching folders and sheets: In an- notator#.xlsx, there are 9 sheets, each named after one of the subfolders. When you work on a specific folder, make sure you are also filling out the sheet with the same name. Example: If you are annotating audios from the folder “Folder 1”, fill in the sheet named “Folder 1”

  17. [17]

    song no

    Understanding the Excel file: Each row in a sheet represents one audio file. The “song no. ” column specifies the exact file name you need to work on. The full file path is shown as: folder_name/song_name.wav. Ex- ample: “Folder 1/5.wav” means the file is in- side “Folder 1”

  18. [18]

    song no

    Annotation procedure: Find the audio file mentioned in the “song no. ” column. Read the prompt in the same row of the Excel sheet, this describes what the model was asked to generate. Listen carefully to the correspond- ing audio file. Fill in the remaining columns in that row based on the instructions given be- low. Note: Even if the lyrics are well-know...