Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

Meincke, L · 2025 · arXiv 2506.07142

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents

cs.CR · 2026-01-26 · unverdicted · novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

cs.IR · 2026-05-22 · unverdicted · novelty 6.0

Paraphrase Jaccard similarity of 0.135-0.288 falls below the 0.50-0.61 same-prompt rerun baseline on OpenAI and Anthropic models, showing prompt wording dominates buyer intent in commercial recommendations.

The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure

cs.CL · 2026-04-03 · accept · novelty 5.0

PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.

Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

cs.CL · 2026-03-18 · unverdicted · novelty 3.0

Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.

citing papers explorer

Showing 4 of 4 citing papers.

GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents cs.CR · 2026-01-26 · unverdicted · none · ref 121
GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline cs.IR · 2026-05-22 · unverdicted · none · ref 12
Paraphrase Jaccard similarity of 0.135-0.288 falls below the 0.50-0.61 same-prompt rerun baseline on OpenAI and Anthropic models, showing prompt wording dominates buyer intent in commercial recommendations.
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure cs.CL · 2026-04-03 · accept · none · ref 56
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.
Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models cs.CL · 2026-03-18 · unverdicted · none · ref 9
Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.

Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer