A categorical archive of chatgpt failures

Borji, Ali , month = apr, year = · 2023 · arXiv 2302.03494

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Consistency Training while Mitigating Obfuscation via Rate Matching

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.

U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.

Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis

cs.CL · 2026-03-02 · unverdicted · novelty 5.0

Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.

Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation

cs.SE · 2024-04-01 · unverdicted · novelty 5.0

LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

cs.AI · 2023-08-10 · accept · novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.

How Secure is Code Generated by ChatGPT?

cs.CR · 2023-04-19 · unverdicted · novelty 4.0

ChatGPT often generates code vulnerable to attacks even when prompted to produce secure code.

Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

cs.CY · 2026-05-08 · unverdicted · novelty 3.0

LLM graders achieve substantial human agreement on math and science MCAS items but vary on ELA, performing best as sources of formative narrative feedback rather than summative numerical scores.

Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts

cs.AI · 2024-03-06 · unverdicted · novelty 3.0

AI and NLP applied to educational artifacts within the Instructional Core Framework can identify advantages for teacher coaching, student support, and personalized learning.

citing papers explorer

Showing 8 of 8 citing papers after filters.

Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 46
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning cs.AI · 2026-05-04 · unverdicted · none · ref 11
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis cs.CL · 2026-03-02 · unverdicted · none · ref 5
Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.
Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation cs.SE · 2024-04-01 · unverdicted · none · ref 30
LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 214
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
How Secure is Code Generated by ChatGPT? cs.CR · 2023-04-19 · unverdicted · none · ref 21
ChatGPT often generates code vulnerable to attacks even when prompted to produce secure code.
Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering cs.CY · 2026-05-08 · unverdicted · none · ref 83
LLM graders achieve substantial human agreement on math and science MCAS items but vary on ELA, performing best as sources of formative narrative feedback rather than summative numerical scores.
Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts cs.AI · 2024-03-06 · unverdicted · none · ref 6
AI and NLP applied to educational artifacts within the Instructional Core Framework can identify advantages for teacher coaching, student support, and personalized learning.

A categorical archive of chatgpt failures

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer