LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.
hub
Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika, 12(2):153–157
10 Pith papers cite this work, alongside 3,350 external citations. Polarity classification is still indexing.
hub tools
years
2026 10verdicts
UNVERDICTED 10representative citing papers
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.
citing papers explorer
-
ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis
LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.
-
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
-
Evaluating Plan Compliance in Autonomous Programming Agents
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
-
A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories
A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.
-
Spectral Vision Transformer for Efficient Tokenization with Limited Data
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
-
SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response
SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
-
Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy
TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
-
Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
-
An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code
Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.