Recognition: 2 theorem links
· Lean TheoremPEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
Pith reviewed 2026-05-15 13:53 UTC · model grok-4.3
The pith
PEEM uses a nine-criteria rubric scored by an LLM to jointly evaluate prompts and responses, aligning with standard accuracy and enabling up to 11.7-point accuracy gains via feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a structured nine-axis rubric evaluated by an LLM produces scores and rationales that accurately reflect prompt and response quality, correlate strongly with conventional accuracy metrics, and can be used directly as feedback to optimize prompts in a zero-shot setting, achieving substantial improvements over baselines.
What carries the argument
The PEEM 9-axis rubric consisting of three prompt criteria (clarity/structure, linguistic quality, fairness) and six response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), with an LLM evaluator producing 1-5 scores and criterion-specific rationales.
If this is right
- PEEM accuracy scores preserve model rankings across tasks with high correlation to ground truth accuracy.
- Prompts can be optimized iteratively using only PEEM feedback without any labeled data or training.
- The evaluation remains consistent when different LLMs serve as judges and stable under meaning-preserving changes to prompts.
- Semantic adversarial attacks cause measurable drops in PEEM scores while paraphrases do not.
Where Pith is reading between the lines
- This method could be applied to create more transparent prompt engineering tools for non-experts.
- Future work might test whether adding domain-specific criteria further improves optimization for specialized tasks.
- The reliance on LLM judges suggests potential for hybrid systems combining automated and human review.
Load-bearing premise
The LLM judge applies the rubric in a reliable, unbiased way that truly captures the intended quality aspects without hallucinations or systematic errors.
What would settle it
A replication study on new benchmarks where PEEM scores show low correlation with actual accuracy or where the feedback loop produces no accuracy improvement would falsify the central claims.
Figures
read the original abstract
Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PEEM, a unified 9-axis rubric (3 prompt criteria: clarity/structure, linguistic quality, fairness; 6 response criteria: accuracy, coherence, relevance, objectivity, clarity, conciseness) evaluated by an LLM judge that outputs 1-5 Likert scores plus criterion-specific rationales. Across 7 benchmarks and 5 task models, the accuracy axis aligns with conventional accuracy (aggregate Spearman ρ ≈ 0.97, Pearson r ≈ 0.94, p < 0.001) while preserving rankings; a multi-evaluator study shows LLM-LLM consistency (pairwise ρ = 0.68-0.85); prompt perturbations and semantic attacks produce expected score changes; and a zero-shot rewriting loop using only PEEM scores/rationales as feedback improves downstream accuracy by up to 11.7 points, outperforming supervised and RL baselines.
Significance. If the LLM judge produces faithful, unbiased scores and rationales, PEEM supplies an interpretable, criterion-driven protocol that links prompt formulation to response behavior and supports systematic diagnosis plus optimization. The reported accuracy-axis alignment and the rewriting-loop gains constitute concrete, reproducible evidence of utility beyond black-box metrics. The multi-evaluator consistency further supports evaluator-agnostic use. These strengths would make the work a useful contribution to prompt-engineering methodology provided the human-grounding gap is addressed.
major comments (2)
- [Multi-Evaluator Study] Multi-Evaluator Study: The reported pairwise Spearman ρ = 0.68-0.85 establishes only LLM-LLM agreement. No human inter-annotator agreement, correlation with human judgments, or validation is provided for the eight non-accuracy axes. Because the rewriting-loop claim (up to +11.7 accuracy points) rests on the rationales supplying genuine quality signals, the absence of human grounding is load-bearing and must be remedied before the optimization results can be accepted at face value.
- [Prompt Rewriting Experiment] Prompt Rewriting Experiment: The zero-shot loop that feeds PEEM scores and rationales back to the model is presented as outperforming supervised and RL baselines, yet the manuscript supplies no details on the exact evaluator prompt template, temperature settings, or controls for judge-model bias (e.g., whether the same model family is used for both evaluation and rewriting). Without these, it remains possible that observed gains are artifacts of self-reinforcement rather than external prompt-quality information.
minor comments (2)
- [Robustness Analysis] The robustness rate (76.7-80.6 %) under meaning-preserving paraphrases is cited in the abstract but its exact definition, formula, and per-axis breakdown are not stated in the main text; this should be added for reproducibility.
- [Methods] The paper would benefit from an explicit table or appendix listing the precise system prompt and few-shot examples (if any) used by the PEEM evaluator LLM, together with the exact Likert-scale anchoring instructions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Multi-Evaluator Study: The reported pairwise Spearman ρ = 0.68-0.85 establishes only LLM-LLM agreement. No human inter-annotator agreement, correlation with human judgments, or validation is provided for the eight non-accuracy axes. Because the rewriting-loop claim (up to +11.7 accuracy points) rests on the rationales supplying genuine quality signals, the absence of human grounding is load-bearing and must be remedied before the optimization results can be accepted at face value.
Authors: We agree that the current multi-evaluator study demonstrates only LLM-LLM consistency and that direct human validation for the non-accuracy axes is absent. While the accuracy axis is grounded via strong alignment with conventional (human-annotated) accuracy, the other axes rely on rubric-based LLM judgments without explicit human correlation. In the revised manuscript we will add a human annotation study on a representative subset of examples across the benchmarks, reporting inter-annotator agreement and Pearson/Spearman correlations between PEEM scores and human judgments for all nine axes. This will also provide direct evidence that the rationales used in the rewriting loop convey genuine quality signals. revision: yes
-
Referee: Prompt Rewriting Experiment: The zero-shot loop that feeds PEEM scores and rationales back to the model is presented as outperforming supervised and RL baselines, yet the manuscript supplies no details on the exact evaluator prompt template, temperature settings, or controls for judge-model bias (e.g., whether the same model family is used for both evaluation and rewriting). Without these, it remains possible that observed gains are artifacts of self-reinforcement rather than external prompt-quality information.
Authors: We acknowledge that the original manuscript omitted key implementation details. In the revision we will include the complete evaluator prompt template in the appendix, state that all evaluations used temperature 0 for determinism, and clarify the model separation: the judge model belongs to a different family from both the task models and the rewriting model. We will also add an ablation that repeats the rewriting loop with an alternative judge model to confirm that gains are not due to self-reinforcement. revision: yes
Circularity Check
No significant circularity; claims rest on external benchmarks
full rationale
The paper defines the 9-axis rubric independently and validates the accuracy axis through direct correlation with conventional accuracy metrics across 7 benchmarks (Spearman rho ~0.97). The zero-shot rewriting loop uses PEEM scores/rationales only as input feedback; success is measured by independent downstream accuracy gains (up to +11.7 points) on the same benchmarks, not by re-evaluating with PEEM itself. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. LLM-LLM consistency (rho 0.68-0.85) is reported as supporting evidence but is not the foundation of the accuracy alignment or rewriting results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based evaluators can produce reliable, rubric-grounded scores and rationales for both prompts and responses
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output scalar scores on a 1-5 Likert scale and criterion-specific natural-language rationales.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman ρ≈0.97, Pearson r≈0.94, p<0.001).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,”Adv. Neu- ral Inf. Process. Syst., vol. 33, pp. 1877–1901, 2020
work page 1901
-
[2]
Large language models are human-level prompt engineers,
Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” inProc. 11th Int. Conf. Learn. Repre- sent. (ICLR), 2022
work page 2022
-
[3]
Chain-of-thought prompt- ing elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompt- ing elicits reasoning in large language models,”Adv. Neu- ral Inf. Process. Syst., vol. 35, pp. 24824–24837, 2022
work page 2022
-
[4]
E. G. Santana Jr., G. Benjamin, M. Araujo, H. San- tos, D. Freitas, E. Almeida, P. A. da Neto, J. Li, J. Chun, and I. Ahmed, “Which prompting technique should I use? An empirical investigation of prompting tech- niques for software engineering tasks,”arXiv preprint arXiv:2506.05614, 2025
-
[5]
Effi- cient multi-prompt evaluation of LLMs,
F. M. Polo, R. Xu, L. Weber, M. Silva, O. Bhardwaj, L. Choshen, A. de Oliveira, Y . Sun, and M. Yurochkin, “Effi- cient multi-prompt evaluation of LLMs,”Adv. Neural Inf. Process. Syst., vol. 37, pp. 22483–22512, 2024
work page 2024
-
[6]
K. Chen, Y . Zhou, X. Zhang, and H. Wang, “Prompt sta- bility matters: Evaluating and optimizing auto-generated prompt in general-purpose systems,”arXiv preprint arXiv:2505.13546, 2025
-
[7]
Automatic chain of thought prompting in large language models,
Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” arXiv preprint arXiv:2210.03493, 2022
-
[8]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inProc. Int. Conf. Learn. Representa- tions (ICLR), 2023
work page 2023
-
[9]
GPTScore: Evaluate as You Desire , publisher =
J. Fu, S.-K. Ng, Z. Jiang, and P. Liu, “Gptscore: Evaluate as you desire,”arXiv preprint arXiv:2302.04166, 2023
-
[10]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: NLG evaluation using GPT-4 with better human alignment,”arXiv preprint arXiv:2303.16634, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
A closer look into automatic evaluation using large language models,
C.-H. Chiang and H.-y. Lee, “A closer look into automatic evaluation using large language models,”arXiv preprint arXiv:2310.05657, 2023
-
[12]
A survey of prompt engineer- ing methods in large language models for different NLP tasks,
S. Vatsal and H. Dubey, “A survey of prompt engineer- ing methods in large language models for different NLP tasks,”arXiv preprint arXiv:2407.12994, 2024
-
[13]
Benchmarking prompt sensitiv- ity in large language models,
A. Razavi, M. Soltangheis, N. Arabzadeh, S. Salamat, M. Zihayat, and E. Bagheri, “Benchmarking prompt sensitiv- ity in large language models,” inProc. Eur. Conf. Inf. Re- trieval, 2025, pp. 303–313
work page 2025
-
[14]
Prompt engineering and its implications on the energy consumption of Large Language Models,
R. Rubei, A. Moussaid, C. Di Sipio, and D. Di Ruscio, “Prompt engineering and its implications on the energy consumption of Large Language Models,”arXiv preprint arXiv:2501.05899, 2025
-
[15]
Is ChatGPT a good NLG evaluator? A preliminary study,
J. Wang, Y . Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, “Is ChatGPT a good NLG evaluator? A preliminary study,”arXiv preprint arXiv:2303.04048, 2023
-
[16]
POSIX: A prompt sensitivity index for large language models,
A. Chatterjee, H. K. Renduchintala, S. Bhatia, and T. Chakraborty, “POSIX: A prompt sensitivity index for large language models,”arXiv preprint arXiv:2410.02185, 2024
-
[17]
ProSA: Assessing and understanding the prompt sensitiv- ity of LLMs,
J. Zhuo, S. Zhang, X. Fang, H. Duan, D. Lin, and K. Chen, “ProSA: Assessing and understanding the prompt sensitiv- ity of LLMs,”arXiv preprint arXiv:2410.12405, 2024
-
[18]
C. Malaviya, S. Lee, D. Roth, and M. Yatskar, “What if you said that differently?: How explanation formats af- fect human feedback efficacy and user perception,”arXiv preprint arXiv:2311.09558, 2023
-
[19]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4,
S. M. Bsharat, A. Myrzakhan, and Z. Shen, “Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4,”arXiv preprint arXiv:2312.16171, 2023
-
[21]
Character-level con- volutional networks for text classification,
X. Zhang, J. Zhao, and Y . LeCun, “Character-level con- volutional networks for text classification,”Advances in Neural Information Processing Systems, vol. 28, 2015
work page 2015
-
[22]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and Ø. Tafjord, “Think you have solved ques- tion answering? Try ARC, the AI2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
M. Suzgun, N. Scales, N. Sch ¨arli, S. Gehrmann, Y . Tay, H. W. Chung, A. Chowdhery, Q. V . Le, E. H. Chi, D. Zhou,et al., “Challenging BIG-Bench tasks and whether chain-of-thought can solve them,”arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. 23
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring mas- sive multitask language understanding,”arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[26]
Recursive deep models for semantic compositionality over a sentiment treebank,
R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Man- ning, A. Y . Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proc. EMNLP, 2013, pp. 1631–1642
work page 2013
-
[27]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ram ´eet al., “Gemma 2: Improving open language models at a practical size,”arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan,et al., “The LLaMA 3 herd of models,”arXiv e-prints, p. arXiv:2407, 2024
work page 2024
-
[29]
Qwen2.5-Coder Technical Report
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu,et al., “Qwen2.5-coder technical report,”arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Prewrite: Prompt rewriting with reinforce- ment learning,
W. Kong, S. A. Hombaiah, M. Zhang, Q. Mei, and M. Bendersky, “Prewrite: Prompt rewriting with reinforce- ment learning,”arXiv preprint arXiv:2401.08189, 2024
-
[31]
Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts,
T. Shin, Y . Razeghi, R. L. Logan IV , E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts,” arXiv preprint arXiv:2010.15980, 2020
-
[32]
RLPrompt: Optimizing discrete text prompts with reinforcement learning,
M. Deng, J. Wang, C.-P. Hsieh, Y . Wang, H. Guo, T. Shu, M. Song, E. P. Xing, and Z. Hu, “RLPrompt: Optimizing discrete text prompts with reinforcement learning,”arXiv preprint arXiv:2205.12548, 2022
-
[33]
Tempera: Test-time prompting via reinforce- ment learning,
T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and J. E. Gonzalez, “Tempera: Test-time prompting via reinforce- ment learning,”arXiv preprint arXiv:2211.11890, 2022
-
[34]
R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep- ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chenet al., “Palm 2 technical report,”arXiv preprint arXiv:2305.10403, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Judging the judges: A systematic study of position bias in LLM-as-a- judge,
Y . Shi, D. Y . Zhang, F. Yang, and P. S. Yu, “Judging the judges: A systematic study of position bias in LLM-as-a- judge,”arXiv preprint arXiv:2406.07791, 2024
-
[36]
Self-preference bias in LLM-as-a-judge,
W. Zheng, Y . Zhang, L. Zhu, M. Shen, X. Chen, and Y . Gao, “Self-preference bias in LLM-as-a-judge,” inProc. OpenReview NeurIPS Workshop, 2024
work page 2024
-
[37]
Humans or LLMs as the judge? A study on judgement bias,
X. Chen, Y . Wang, M. Lin, and W. Che, “Humans or LLMs as the judge? A study on judgement bias,” inProc. EMNLP, 2024, pp. 3912–3927
work page 2024
-
[38]
The intricacies of evaluating large lan- guage models with LLM-as-a-judge,
A. Veetil, “The intricacies of evaluating large lan- guage models with LLM-as-a-judge,”arXiv preprint arXiv:2409.08231, 2024
-
[39]
Cross-lingual auto evaluation for assessing multilingual LLMs,
K. Yang, M. Zhang, J. Li, and H. Zhao, “Cross-lingual auto evaluation for assessing multilingual LLMs,”arXiv preprint arXiv:2410.13394, 2024
-
[40]
UMP- Net: Uncertainty-Aware Mixture of Prompts Network for Efficient Instruction Tuning,
F. Daneshfar, A. abas, M. Abdar, and P. Lio, “UMP- Net: Uncertainty-Aware Mixture of Prompts Network for Efficient Instruction Tuning,”Trans. Mach. Learn. Res. (TMLR), 2025
work page 2025
-
[41]
A human-AI compara- tive analysis of prompt sensitivity in LLM-based relevance judgment,
N. Arabzadeh and C. L. A. Clarke, “A human-AI compara- tive analysis of prompt sensitivity in LLM-based relevance judgment,”arXiv preprint arXiv:2504.12408, 2025
-
[42]
A uni- fied evaluation-instructed framework for query-dependent prompt optimization,
K. Chen, Y . Wang, H. Almosapeeh, and H. Wang, “A uni- fied evaluation-instructed framework for query-dependent prompt optimization,”arXiv preprint arXiv:2511.19829, 2025
-
[43]
PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,
K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Z. Gong, and X. Xie, “PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,”arXiv preprint arXiv:2306.04528, 2023
-
[44]
A. Mei, S. Levy, and W. Y . Wang, “ASSERT: Auto- mated safety scenario red teaming for evaluating the robustness of large language models,”arXiv preprint arXiv:2310.09624, 2023
-
[45]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasu- pat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen,et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Prompt- ception: How Sensitive Are Large Multimodal Models to Prompts?,
M. I. Ismithdeen, M. U. Khattak, and S. Khan, “Prompt- ception: How Sensitive Are Large Multimodal Models to Prompts?,” inFindings of the Association for Computa- tional Linguistics: EMNLP 2025, Suzhou, China, 2025, pp. 23950–23985, Association for Computational Lin- guistics, doi: 10.18653/v1/2025.findings-emnlp.1302. 24
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.