Anchored prompts inflate count-based F1 by up to 0.79 in LLM error detection while raising span-aware ERRANT F0.5 by only 0.04 on average.
Is chatgpt a highly fluent grammatical error correction system? a comprehensive evaluation
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
A masked-token hit-rate comparison method detects pretraining data membership in black-box LLMs with performance comparable to white-box approaches.
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
citing papers explorer
-
Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring
Anchored prompts inflate count-based F1 by up to 0.79 in LLM error detection while raising span-aware ERRANT F0.5 by only 0.04 on average.
-
MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models
A masked-token hit-rate comparison method detects pretraining data membership in black-box LLMs with performance comparable to white-box approaches.