From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Eliya Habba; Gabriel Stanovsky; Itay Itzhak; Yonatan Belinkov

arxiv: 2604.14137 · v2 · submitted 2026-04-15 · 💻 cs.CL · cs.AI· cs.LG

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Itay Itzhak , Eliya Habba , Gabriel Stanovsky , Yonatan Belinkov This is my paper

Pith reviewed 2026-05-10 14:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords vibe-testingLLM evaluationpersonalized promptsuser-aware criteriacoding benchmarksinformal evaluationmodel comparisonbenchmarking

0 comments

The pith

Vibe-testing LLMs involves personalizing both test prompts and judgment criteria, and formalizing this process can change which model is preferred in coding benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how users evaluate large language models through informal vibe-testing rather than standard benchmarks alone. Analysis of a user survey and real-world model comparison reports shows that this practice consists of personalizing both the tasks assigned to models and the criteria used to assess their outputs. The authors introduce a pipeline that generates such personalized prompts and applies user-aware subjective judgment to compare models. Experiments on coding benchmarks demonstrate that this combination can shift which model ranks highest. The work aims to turn ad hoc user experiences into a more structured, analyzable form that better reflects practical usefulness.

Core claim

Based on survey data and in-the-wild reports, vibe-testing is formalized as a two-part process where users personalize the prompts used to test models and the subjective criteria for judging responses. A proof-of-concept evaluation pipeline implements this by generating personalized prompts and comparing outputs with user-aware criteria. On coding benchmarks, this combination changes which model is preferred, indicating that vibe-testing plays a significant role in real-world model selection beyond standard benchmarks.

What carries the argument

The two-part formalization of vibe-testing, consisting of personalized prompt generation for testing and user-aware subjective criteria for evaluation.

If this is right

Combining personalized prompts and user-aware evaluation alters model preferences on coding benchmarks.
This suggests formalized vibe-testing can bridge the gap between benchmark scores and real-world usefulness.
The approach supports systematic and reproducible analysis of informal evaluations.
User workflows can be better reflected in model comparisons through personalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation frameworks for LLMs could incorporate user personalization to better predict practical adoption.
Further work might apply this to other domains beyond coding to test generalizability.
Model developers might adjust training based on common patterns identified through formalized vibe-tests.

Load-bearing premise

The two-part formalization of personalizing what to test and how to judge accurately represents the essential elements of real-world vibe-testing.

What would settle it

Applying the personalized prompt and user-aware evaluation pipeline to the same coding benchmarks and finding no change in which model is preferred would challenge the claim that this approach captures the practical impact of vibe-testing.

Figures

Figures reproduced from arXiv: 2604.14137 by Eliya Habba, Gabriel Stanovsky, Itay Itzhak, Yonatan Belinkov.

**Figure 1.** Figure 1: Anatomy of a “vibe-test”. In practice, users evaluate LLMs by “vibe-testing” them: – writing personalized prompts that test specific behaviors and judging models’ responses using personal subjective criteria. We analyze recurring patterns of vibe-testing in real-world user comparisons, formalize them into a two-part structure, and present a proof-of-concept pipeline for automated vibe-testing. Example take… view at source ↗

**Figure 2.** Figure 2: Benchmarks vs. vibe-testing in practice. Left: What benchmarks miss. Survey participants selected real-world qualities that benchmarks fail to capture (multi-select), including workflow and style fit, handling ambiguity, stability, clarity, and trust. Right: How users test models. Common strategies include trying tasks from one’s own workflow, sideby-side comparisons, probing style, stress-testing ambigui… view at source ↗

**Figure 3.** Figure 3: Automatic Vibe-Testing Pipeline: Given a user description, the pipeline (A) constructs a user profile P (composed of input Pin and output Pout) preferences, (B) rewrites benchmark samples into a personalized prompts aligned with Pin, and (C) compare responses using Pout to produce per-dimension head-to-head model comparisons. both responses are correct. In the next Section, we present an evaluation pipeli… view at source ↗

**Figure 4.** Figure 4: Personalization changes model preferences. Head-to-head win rates for GPT5.1 vs. GPT-OSS-20B on MBPP+, broken down by dimensions. Left: original benchmark prompts. Right: persona-specific rewrites averaged over four personas. Several dimensions favor different models depending on the prompt form, showing that benchmark prompts can mask user-relevant differences beyond correctness. Pairwise judging and agg… view at source ↗

**Figure 5.** Figure 5: Example of expertise-level personalized prompts generated by GPT-5.1 for an MBPP+ coding task. Given a single original problem statement, the pipeline produces personalized prompts for the user profiles: Beginner, Intermediate, AI Researcher, and Advanced. Each of them reflects different assumptions about prior knowledge, desired explanation depth, and code style preferences. 32 [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 6.** Figure 6: Example of expertise-level personalized prompts generated by Qwen3-32B for an MBPP+ coding task. Given a single original problem statement, the pipeline produces personalized prompts for the user profiles: Beginner, Intermediate, AI Researcher, and Advanced. Each reflects different assumptions about prior knowledge, desired explanation depth, and code style preferences. 33 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 7.** Figure 7: Personalization changes model preferences GPT-5.1 vs. GPT-4o. Head-to-head win rates for on MBPP+, broken down by dimensions. Left: original benchmark prompts. Right: persona-specific rewrites averaged over four personas. Several dimensions favor different models depending on the prompt form, showing that benchmark prompts can mask user-relevant differences beyond correctness. 0% 25% 50% 75% 100% Pass@1 Pa… view at source ↗

**Figure 8.** Figure 8: Personalization changes model preferences for Gemini-3-Pro vs. Gemma-3-4B. Head-to-head win rates on MBPP+, broken down by dimensions. Left: original benchmark prompts. Right: persona-specific rewrites averaged over four personas. Several dimensions favor different models depending on the prompt form, showing that benchmark prompts can mask user-relevant differences beyond correctness. 0% 25% 50% 75% 100% … view at source ↗

**Figure 9.** Figure 9: Personalization changes model preferences for Qwen3-32B vs. Qwen3-14B. Head-to-head win rates on MBPP+, broken down by dimensions. Left: original benchmark prompts. Right: persona-specific rewrites averaged over four personas. Several dimensions favor different models depending on the prompt form, showing that benchmark prompts can mask user-relevant differences beyond correctness. 34 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 10.** Figure 10: Full preamble and informed-consent text shown to survey participants before the [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: LLM prompt used for vibe-test extraction and labeling from YouTube transcripts and Reddit threads. Minor formatting adjustments were made between YouTube and Reddit to reflect available metadata. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: LLM prompt used for dimension-based re-annotation. Each vibe-test instance re-annotated with the fixed dimension set. Prompt: Consistency Check and Gap Analysis You will be provided with (1) the draft paper definitions and (2) a JSON of labeled vibe-testing instances. Your objective is to conduct a rigorous consistency check and gap analysis between the theoretical framework and the empirical data: verify… view at source ↗

**Figure 13.** Figure 13: LLM prompt used for the final consistency check and gap analysis. Provided together with the current definitions and the consolidated JSON as inputs. Prompt: Persona Parsing You are an expert user experience researcher. Your task is to analyze the following user description and generate a structured JSON profile based on it. User Description: “{description}” Based on this description, create a JSON objec… view at source ↗

**Figure 14.** Figure 14: Persona-parsing prompt. Given a short natural-language user description, the LLM produces a structured JSON profile describing input and output preferences. The model must output a single JSON object and nothing else. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗

**Figure 15.** Figure 15: Change-identification prompt. To operationalize a persona profile into actionable prompt edits, the LLM proposes 2–3 concrete modification options for a fixed set of fields, while explicitly disallowing changes that alter the task itself. This stage outputs a single JSON object with a list of changes keyed by profile fields. Prompt: Personalized Prompt Composition You are an expert in rewriting programmin… view at source ↗

**Figure 16.** Figure 16: Personalized prompt composition. Given an original benchmark prompt and the selected modifications, the LLM generates a personalized version that preserves the underlying programming task. The prompt is written in the persona voice (first person), avoids explicit references to the profile schema, and is constrained to a short length. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt for HumanEval+ prefix composition. For HumanEval+ style prompts that include code context and docstrings, only a short persona prefix is produced and concatenated to the original prompt, avoiding perturbation of code formatting while still injecting persona-relevant framing. Prompt: Semantic-Preservation Verification You are an expert in verifying programming problem statements. You will be given a… view at source ↗

**Figure 18.** Figure 18: Semantic-preservation verifier prompt. To ensure personalized prompts remain faithful to the original benchmark intent, the verifier checks (i) whether the end goal is identical and (ii) whether the ground-truth solution set is preserved. The verifier returns two booleans and an error string if either check fails. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗

read the original abstract

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes vibe-testing into personalized prompts and criteria, then shows it can flip model rankings on coding benchmarks, but the experiments stay too high-level to pin down the effect.

read the letter

The main takeaway is that this work turns an informal user habit into a two-part structure—personalizing the test cases and the judgment rules—then runs a pipeline that actually changes which model wins on coding benchmarks. That preference shift is the concrete result they highlight. They ground the formalization in a survey of user practices plus scraped comparison reports from blogs and social media, which gives the idea some empirical footing instead of pure speculation. The approach does a clean job of showing why standard benchmarks miss workflow-specific judgments that matter to real users. The pipeline itself is simple enough to replicate as a proof of concept. The soft spots sit in the execution details. The abstract gives no sample size for the survey, no description of how the user-aware criteria were extracted or validated against actual human judgments, and no stats on the size or significance of the preference change. Without controls for prompt artifacts or a check that the automated criteria track real user preferences, the shift could be driven more by implementation choices than by the formalization itself. This is aimed at people working on LLM evaluation who want to move past fixed leaderboards toward something more user-contextual. A reader looking for a new framing or ideas for user-aware benchmarks will get value from it. It deserves a serious referee because the direction is practical and the core claim is falsifiable with more data, even though the current version needs tighter methods and validation to stand on its own. I'd send it to review and ask for the missing survey details, effect sizes, and a human validation step on the criteria.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes user 'vibe-testing' of LLMs via a survey of evaluation practices and a corpus of in-the-wild model comparison reports. It formalizes vibe-testing as a two-part process in which users personalize both the choice of test cases and the subjective criteria for judging outputs. A proof-of-concept pipeline is introduced that generates personalized prompts and applies user-aware criteria via an LLM judge. Experiments on coding benchmarks show that this combination can alter which model is preferred relative to standard evaluation, suggesting the formalization captures aspects of real-world usage.

Significance. If the two-part formalization is shown to be faithful to user behavior and the observed preference shifts prove robust, the work could help bridge the gap between aggregate benchmarks and individualized user experience. The empirical resources (survey + report corpus) and the explicit pipeline constitute reusable contributions that other researchers could extend or stress-test.

major comments (2)

[Section 5] Section 5 (experimental evaluation): the claim that personalized prompts plus user-aware criteria change model preference is load-bearing for the central thesis, yet the manuscript supplies no survey sample size, no statistical significance tests, no effect-size reporting, and no controls (e.g., comparison against non-personalized baselines or blinded human validation of the LLM judge outputs). Without these, it is impossible to determine whether the observed shift reflects genuine vibe-testing or prompt-engineering artifacts.
[Section 4] Section 4 (formalization): the two-part model (personalize what to test + how to judge) is derived from the survey and reports, but the paper provides no independent validation that the operationalized user-aware subjective criteria match the criteria real users actually employ. A human-subject study comparing the pipeline's judgments against users' own ratings on the same tasks would be required to support the claim that the formalization accurately captures vibe-testing.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one quantitative detail (e.g., survey N or number of reports analyzed) so readers can immediately gauge the scale of the empirical grounding.
[Section 4.2] Notation for the user-aware criteria (e.g., how subjectivity is encoded in the judge prompt) should be made fully explicit, perhaps with a short pseudocode listing in Section 4.2.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which identify key areas where the empirical support for our claims can be strengthened. We respond to each major comment below, indicating planned revisions where feasible.

read point-by-point responses

Referee: [Section 5] Section 5 (experimental evaluation): the claim that personalized prompts plus user-aware criteria change model preference is load-bearing for the central thesis, yet the manuscript supplies no survey sample size, no statistical significance tests, no effect-size reporting, and no controls (e.g., comparison against non-personalized baselines or blinded human validation of the LLM judge outputs). Without these, it is impossible to determine whether the observed shift reflects genuine vibe-testing or prompt-engineering artifacts.

Authors: We agree that the experimental evaluation in Section 5 requires more rigorous statistical reporting to substantiate the observed preference shifts. We will revise the section to explicitly include the survey sample size and methodology details (currently presented in Section 3), add statistical significance tests for the model ranking changes, report effect sizes, and incorporate controls such as comparisons to non-personalized prompt baselines. A full blinded human validation of the LLM judge outputs would require new data collection and is not feasible in this revision; we will instead note this as a limitation and outline plans for future validation. These changes will be incorporated into the revised manuscript. revision: partial
Referee: [Section 4] Section 4 (formalization): the two-part model (personalize what to test + how to judge) is derived from the survey and reports, but the paper provides no independent validation that the operationalized user-aware subjective criteria match the criteria real users actually employ. A human-subject study comparing the pipeline's judgments against users' own ratings on the same tasks would be required to support the claim that the formalization accurately captures vibe-testing.

Authors: The two-part formalization is derived from the empirical analysis of the survey and in-the-wild report corpus, as described in Sections 3 and 4. We acknowledge that an independent human-subject study would offer stronger confirmation that the operationalized criteria align with real user judgments. However, conducting such a study involves substantial new participant recruitment and experimental design that exceeds the scope of the current proof-of-concept work. We will revise the manuscript to expand the discussion of this limitation in Section 4 and the conclusions, while clarifying the grounding of the formalization in the collected resources. revision: no

standing simulated objections not resolved

Conducting a dedicated human-subject study to independently validate the user-aware subjective criteria by comparing pipeline judgments against real users' own ratings on the same tasks

Circularity Check

0 steps flagged

No circularity: formalization from external survey/reports; experiments are independent proof-of-concept

full rationale

The paper derives its two-part formalization (personalizing what to test and how to judge) directly from analysis of an external user survey and in-the-wild model comparison reports, then implements a proof-of-concept pipeline whose outputs are evaluated empirically on coding benchmarks. No equations, fitted parameters, or self-citations reduce the observed preference shifts to inputs by construction; the results are presented as empirical findings rather than tautological restatements. The derivation chain is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from a survey and in-the-wild reports plus the assumption that the derived two-part model can be operationalized faithfully in a pipeline; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Vibe-testing consists of personalizing both the choice of test cases and the judgment criteria
Extracted from analysis of survey responses and model comparison reports as the basis for formalization.

pith-pipeline@v0.9.0 · 5524 in / 1245 out tokens · 45314 ms · 2026-05-10T14:03:45.680817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Lmunit: Fine-grained evaluation with natural language unit tests, 2024

URLhttps://api.semanticscholar.org/CorpusId:269804692. Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaˇs, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks for data-centric ai development.Advances in Neural Information Processing Systems, 36:5320–5347, 2023. Meta. Llama 3.3 mod...

work page arXiv 2023
[2]

No explanations, no comments, no markdown fences

Output must be a single valid JSON object. No explanations, no comments, no markdown fences

work page
[3]

changes by field

Root object has exactly one key:"changes by field". . . . Now, produce the JSON for all fields of the given user profile in exactly this format. Verify the output is a valid JSON object! Figure 15:Change-identification prompt.To operationalize a persona profile into actionable prompt edits, the LLM proposes 2–3 concrete modification options for a fixed se...

work page
[4]

Carefully read the original prompt and the list of changes

work page
[5]

Rewrite the prompt to apply all changes cohesively

work page
[6]

The new prompt must lead to the same solution

DO NOT alter the core requirements of the programming task. The new prompt must lead to the same solution

work page
[8]

Your output MUST only contain the prompt text and nothing else. Figure 16:Personalized prompt composition.Given an original benchmark prompt and the selected modifications, the LLM generates a personalized version that preserves the underlying programming task. The prompt is written in the persona voice (first person), avoids explicit references to the pr...

work page
[9]

The length should not be longer than 2–4 short sentences

work page
[10]

same end goal

Your output MUST only contain the prefix text and nothing else. Figure 17:Prompt for HumanEval+ prefix composition.For HumanEval+ style prompts that include code context and docstrings, only a short persona prefix is produced and concatenated to the original prompt, avoiding perturbation of code formatting while still injecting persona-relevant framing. P...

work page

[1] [1]

Lmunit: Fine-grained evaluation with natural language unit tests, 2024

URLhttps://api.semanticscholar.org/CorpusId:269804692. Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaˇs, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, et al. Dataperf: Benchmarks for data-centric ai development.Advances in Neural Information Processing Systems, 36:5320–5347, 2023. Meta. Llama 3.3 mod...

work page arXiv 2023

[2] [2]

No explanations, no comments, no markdown fences

Output must be a single valid JSON object. No explanations, no comments, no markdown fences

work page

[3] [3]

changes by field

Root object has exactly one key:"changes by field". . . . Now, produce the JSON for all fields of the given user profile in exactly this format. Verify the output is a valid JSON object! Figure 15:Change-identification prompt.To operationalize a persona profile into actionable prompt edits, the LLM proposes 2–3 concrete modification options for a fixed se...

work page

[4] [4]

Carefully read the original prompt and the list of changes

work page

[5] [5]

Rewrite the prompt to apply all changes cohesively

work page

[6] [6]

The new prompt must lead to the same solution

DO NOT alter the core requirements of the programming task. The new prompt must lead to the same solution

work page

[7] [8]

Your output MUST only contain the prompt text and nothing else. Figure 16:Personalized prompt composition.Given an original benchmark prompt and the selected modifications, the LLM generates a personalized version that preserves the underlying programming task. The prompt is written in the persona voice (first person), avoids explicit references to the pr...

work page

[8] [9]

The length should not be longer than 2–4 short sentences

work page

[9] [10]

same end goal

Your output MUST only contain the prefix text and nothing else. Figure 17:Prompt for HumanEval+ prefix composition.For HumanEval+ style prompts that include code context and docstrings, only a short persona prefix is produced and concatenated to the original prompt, avoiding perturbation of code formatting while still injecting persona-relevant framing. P...

work page