PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.
A closer look at logical reasoning with llms: The choice of tool matters, 2024
3 Pith papers cite this work. Polarity classification is still indexing.
years
2025 3representative citing papers
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
LLM-assisted pipeline jointly generates logical formulas and executable predicates for rule-based verification of HD map transformations in CommonRoad, evaluated on synthetic bridge and slope scenarios.
citing papers explorer
-
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.
-
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
-
LLM-Assisted Tool for Joint Generation of Formulas and Functions in Rule-Based Verification of Map Transformations
LLM-assisted pipeline jointly generates logical formulas and executable predicates for rule-based verification of HD map transformations in CommonRoad, evaluated on synthetic bridge and slope scenarios.