U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
A categorical archive of chatgpt failures,
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.
LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
ChatGPT often generates code vulnerable to attacks even when prompted to produce secure code.
AI and NLP applied to educational artifacts within the Instructional Core Framework can identify advantages for teacher coaching, student support, and personalized learning.
citing papers explorer
-
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
-
Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis
Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.
-
Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation
LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
How Secure is Code Generated by ChatGPT?
ChatGPT often generates code vulnerable to attacks even when prompted to produce secure code.
-
Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts
AI and NLP applied to educational artifacts within the Instructional Core Framework can identify advantages for teacher coaching, student support, and personalized learning.