Conceptual steganography encodes covert information in high-level reasoning patterns within LM chains-of-thought, remaining robust to paraphrase defenses while preserving reasoning utility.
Large Language Models are Better Reasoners with Self-Verification
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.
A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.
ThinkBooster supplies a modular library, joint performance-efficiency benchmark, and deployable proxy for test-time compute scaling of LLM reasoning on math and coding tasks.
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
citing papers explorer
-
Conceptual Steganography
Conceptual steganography encodes covert information in high-level reasoning patterns within LM chains-of-thought, remaining robust to paraphrase defenses while preserving reasoning utility.
-
Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts
Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.
-
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
ThinkBooster supplies a modular library, joint performance-efficiency benchmark, and deployable proxy for test-time compute scaling of LLM reasoning on math and coding tasks.
-
A Survey on LLM-as-a-Judge
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.