s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.
miniCTX : Neural theorem proving with (long-) contexts
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
ImProver is an LLM agent using Chain-of-States, error-correction, and retrieval to rewrite Lean proofs for arbitrary user-defined optimization criteria like shortness and readability.
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
citing papers explorer
-
s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs
s2n-bignum-bench is a new benchmark requiring LLMs to synthesize HOL Light proofs for real-world low-level cryptographic assembly code.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
ImProver: Agent-Based Automated Proof Optimization
ImProver is an LLM agent using Chain-of-States, error-correction, and retrieval to rewrite Lean proofs for arbitrary user-defined optimization criteria like shortness and readability.
-
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.