Conceptual steganography encodes covert information in high-level reasoning patterns within LM chains-of-thought, remaining robust to paraphrase defenses while preserving reasoning utility.
Large Language Models are Better Reasoners with Self-Verification
6 Pith papers cite this work. Polarity classification is still indexing.
6
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.
A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.
ThinkBooster supplies a modular library, joint performance-efficiency benchmark, and deployable proxy for test-time compute scaling of LLM reasoning on math and coding tasks.
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.