From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
Pith reviewed 2026-05-10 14:03 UTC · model grok-4.3
The pith
Vibe-testing LLMs involves personalizing both test prompts and judgment criteria, and formalizing this process can change which model is preferred in coding benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Based on survey data and in-the-wild reports, vibe-testing is formalized as a two-part process where users personalize the prompts used to test models and the subjective criteria for judging responses. A proof-of-concept evaluation pipeline implements this by generating personalized prompts and comparing outputs with user-aware criteria. On coding benchmarks, this combination changes which model is preferred, indicating that vibe-testing plays a significant role in real-world model selection beyond standard benchmarks.
What carries the argument
The two-part formalization of vibe-testing, consisting of personalized prompt generation for testing and user-aware subjective criteria for evaluation.
If this is right
- Combining personalized prompts and user-aware evaluation alters model preferences on coding benchmarks.
- This suggests formalized vibe-testing can bridge the gap between benchmark scores and real-world usefulness.
- The approach supports systematic and reproducible analysis of informal evaluations.
- User workflows can be better reflected in model comparisons through personalization.
Where Pith is reading between the lines
- Evaluation frameworks for LLMs could incorporate user personalization to better predict practical adoption.
- Further work might apply this to other domains beyond coding to test generalizability.
- Model developers might adjust training based on common patterns identified through formalized vibe-tests.
Load-bearing premise
The two-part formalization of personalizing what to test and how to judge accurately represents the essential elements of real-world vibe-testing.
What would settle it
Applying the personalized prompt and user-aware evaluation pipeline to the same coding benchmarks and finding no change in which model is preferred would challenge the claim that this approach captures the practical impact of vibe-testing.
Figures
read the original abstract
Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes user 'vibe-testing' of LLMs via a survey of evaluation practices and a corpus of in-the-wild model comparison reports. It formalizes vibe-testing as a two-part process in which users personalize both the choice of test cases and the subjective criteria for judging outputs. A proof-of-concept pipeline is introduced that generates personalized prompts and applies user-aware criteria via an LLM judge. Experiments on coding benchmarks show that this combination can alter which model is preferred relative to standard evaluation, suggesting the formalization captures aspects of real-world usage.
Significance. If the two-part formalization is shown to be faithful to user behavior and the observed preference shifts prove robust, the work could help bridge the gap between aggregate benchmarks and individualized user experience. The empirical resources (survey + report corpus) and the explicit pipeline constitute reusable contributions that other researchers could extend or stress-test.
major comments (2)
- [Section 5] Section 5 (experimental evaluation): the claim that personalized prompts plus user-aware criteria change model preference is load-bearing for the central thesis, yet the manuscript supplies no survey sample size, no statistical significance tests, no effect-size reporting, and no controls (e.g., comparison against non-personalized baselines or blinded human validation of the LLM judge outputs). Without these, it is impossible to determine whether the observed shift reflects genuine vibe-testing or prompt-engineering artifacts.
- [Section 4] Section 4 (formalization): the two-part model (personalize what to test + how to judge) is derived from the survey and reports, but the paper provides no independent validation that the operationalized user-aware subjective criteria match the criteria real users actually employ. A human-subject study comparing the pipeline's judgments against users' own ratings on the same tasks would be required to support the claim that the formalization accurately captures vibe-testing.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one quantitative detail (e.g., survey N or number of reports analyzed) so readers can immediately gauge the scale of the empirical grounding.
- [Section 4.2] Notation for the user-aware criteria (e.g., how subjectivity is encoded in the judge prompt) should be made fully explicit, perhaps with a short pseudocode listing in Section 4.2.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas where the empirical support for our claims can be strengthened. We respond to each major comment below, indicating planned revisions where feasible.
read point-by-point responses
-
Referee: [Section 5] Section 5 (experimental evaluation): the claim that personalized prompts plus user-aware criteria change model preference is load-bearing for the central thesis, yet the manuscript supplies no survey sample size, no statistical significance tests, no effect-size reporting, and no controls (e.g., comparison against non-personalized baselines or blinded human validation of the LLM judge outputs). Without these, it is impossible to determine whether the observed shift reflects genuine vibe-testing or prompt-engineering artifacts.
Authors: We agree that the experimental evaluation in Section 5 requires more rigorous statistical reporting to substantiate the observed preference shifts. We will revise the section to explicitly include the survey sample size and methodology details (currently presented in Section 3), add statistical significance tests for the model ranking changes, report effect sizes, and incorporate controls such as comparisons to non-personalized prompt baselines. A full blinded human validation of the LLM judge outputs would require new data collection and is not feasible in this revision; we will instead note this as a limitation and outline plans for future validation. These changes will be incorporated into the revised manuscript. revision: partial
-
Referee: [Section 4] Section 4 (formalization): the two-part model (personalize what to test + how to judge) is derived from the survey and reports, but the paper provides no independent validation that the operationalized user-aware subjective criteria match the criteria real users actually employ. A human-subject study comparing the pipeline's judgments against users' own ratings on the same tasks would be required to support the claim that the formalization accurately captures vibe-testing.
Authors: The two-part formalization is derived from the empirical analysis of the survey and in-the-wild report corpus, as described in Sections 3 and 4. We acknowledge that an independent human-subject study would offer stronger confirmation that the operationalized criteria align with real user judgments. However, conducting such a study involves substantial new participant recruitment and experimental design that exceeds the scope of the current proof-of-concept work. We will revise the manuscript to expand the discussion of this limitation in Section 4 and the conclusions, while clarifying the grounding of the formalization in the collected resources. revision: no
- Conducting a dedicated human-subject study to independently validate the user-aware subjective criteria by comparing pipeline judgments against real users' own ratings on the same tasks
Circularity Check
No circularity: formalization from external survey/reports; experiments are independent proof-of-concept
full rationale
The paper derives its two-part formalization (personalizing what to test and how to judge) directly from analysis of an external user survey and in-the-wild model comparison reports, then implements a proof-of-concept pipeline whose outputs are evaluated empirically on coding benchmarks. No equations, fitted parameters, or self-citations reduce the observed preference shifts to inputs by construction; the results are presented as empirical findings rather than tautological restatements. The derivation chain is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vibe-testing consists of personalizing both the choice of test cases and the judgment criteria
Reference graph
Works this paper leans on
-
[1]
Lmunit: Fine-grained evaluation with natural language unit tests, 2024
URLhttps://api.semanticscholar.org/CorpusId:269804692. Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaˇs, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks for data-centric ai development.Advances in Neural Information Processing Systems, 36:5320–5347, 2023. Meta. Llama 3.3 mod...
-
[2]
No explanations, no comments, no markdown fences
Output must be a single valid JSON object. No explanations, no comments, no markdown fences
-
[3]
Root object has exactly one key:"changes by field". . . . Now, produce the JSON for all fields of the given user profile in exactly this format. Verify the output is a valid JSON object! Figure 15:Change-identification prompt.To operationalize a persona profile into actionable prompt edits, the LLM proposes 2–3 concrete modification options for a fixed se...
-
[4]
Carefully read the original prompt and the list of changes
-
[5]
Rewrite the prompt to apply all changes cohesively
-
[6]
The new prompt must lead to the same solution
DO NOT alter the core requirements of the programming task. The new prompt must lead to the same solution
-
[8]
Your output MUST only contain the prompt text and nothing else. Figure 16:Personalized prompt composition.Given an original benchmark prompt and the selected modifications, the LLM generates a personalized version that preserves the underlying programming task. The prompt is written in the persona voice (first person), avoids explicit references to the pr...
-
[9]
The length should not be longer than 2–4 short sentences
-
[10]
Your output MUST only contain the prefix text and nothing else. Figure 17:Prompt for HumanEval+ prefix composition.For HumanEval+ style prompts that include code context and docstrings, only a short persona prefix is produced and concatenated to the original prompt, avoiding perturbation of code formatting while still injecting persona-relevant framing. P...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.