User study with 20 novices using ChatGPT identifies recurring AI visualization errors, user prompting issues, trust factors, and collaboration patterns, with distinct failure modes observed on Gemini and Claude.
Title resolution pending
30 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
Introduces a cooperative Recuse Signal for LLM agents and reports 100% recusal in a pilot when the signal is present versus 100% task completion without it.
LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.
SEED is a structural encoding framework using typed actor-flow graphs to describe, evaluate novelty of, and generate experimental designs for AI-enabled science under feasibility and governance constraints.
ALU uses public data to suppress unlearning cost quadratically while characterizing distribution mismatch effects, enabling mass unlearning with maintained utility.
Agency in sustained human-AI chatbot talks emerges as co-constructed turn-by-turn through boundary-setting and intention-steering, organized in a new 3-by-4 framework of actors and actions.
RLMF uses quality of model self-judgments to refine RL rankings and select training data, achieving SOTA faithful calibration while preserving accuracy and outperforming standard RL by up to 63%.
The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.
In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.
LMs systematically inflate expressed certainty during rewriting, affecting up to 75% of outputs with a 1.5-2x bias toward increasing rather than decreasing certainty, and the effect compounds over iterations.
Exploratory interview study with 17 developers identifies four forms of emergent oversight work for software agents and documents situated challenges and heuristics.
A new framework quantifies faithful confidence expression in large reasoning models by comparing linguistic decisiveness to token probabilities, hidden states, and response consistency, revealing it as a persistent challenge.
LM agents' changeable modules prevent persistent identity and sanction sensitivity, making reputation mechanisms structurally inapplicable and requiring protocol-based behavioral harnesses instead.
An STS case study of MLB's Automated Ball-Strike System reveals that clear rules still require complex sociotechnical translation and calls for practice-based evaluation of automated enforcement systems.
Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
Repeated interaction with deontologically or utilitarian-programmed LLMs caused lasting shifts in human moral inclinations and policy attitudes toward the embedded principles.
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.
Two linked user studies find that LLM rationale correctness and certainty framing affect trust and decision confidence while presentation format does not, and incorrect rationales increase gaze attention and pupil size.
Empirical analysis of 1,524 AI incident reports shows 83% arise from worker-AI trait misalignments, with 74% of those traceable to developers prioritizing efficiency over precision or personalization.
Mixed-methods study finds AI assistance linked to higher textual overlap with suggestions in writing tasks, and a reflective interface prototype increases user awareness of AI incorporation.
The EU AI Act narrows accountability for multi-agent AI in critical infrastructure by excluding safety components from key explanation and impact assessment rights, and the paper proposes AgentGov-SC, a three-layer architecture with 25 measures to address this through traceability to existing AI and
A literature review concludes that pursuing consensus in data annotation creates biased AI by dismissing subjective disagreements and enforcing geographic hegemony, and proposes mapping diversity instead.
Proposes cryptographic certificates of validity by translating logical policy predicates into succinct proof systems for verifying AI agent actions.
Literature review synthesizing evidence on user skepticism, verification, and reliance with hallucinating AI advisors, noting that output-related cues like warnings show weak effects and that content category has not been experimentally varied.
citing papers explorer
-
Vibe Visualizing: How Visualization Novices Try (and Fail) to Generate and Interpret Visualizations with Conversational AI
User study with 20 novices using ChatGPT identifies recurring AI visualization errors, user prompting issues, trust factors, and collaboration patterns, with distinct failure modes observed on Gemini and Claude.