WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Hello gpt-4o, 2024
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
ArcDeck models paper-to-slide generation as narrative reconstruction using discourse parsing and multi-agent refinement, plus a new ArcBench benchmark, to improve flow and coherence over direct summarization.
SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
STILL-2 uses imitation of distilled long-form thoughts, multi-rollout exploration on difficult problems, and iterative self-improvement of the dataset to train reasoning models that reach competitive performance on three challenging benchmarks.
citing papers explorer
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
Narrative-Driven Paper-to-Slide Generation via ArcDeck
ArcDeck models paper-to-slide generation as narrative reconstruction using discourse parsing and multi-agent refinement, plus a new ArcBench benchmark, to improve flow and coherence over direct summarization.
-
SPaCe: Unlocking Sample-Efficient Large Language Models Training With Self-Pace Curriculum Learning
SPaCe uses semantic clustering to shrink training sets and a multi-armed bandit to adaptively select samples, matching or beating baselines on reasoning benchmarks with up to 100x fewer examples.
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
-
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems
STILL-2 uses imitation of distilled long-form thoughts, multi-rollout exploration on difficult problems, and iterative self-improvement of the dataset to train reasoning models that reach competitive performance on three challenging benchmarks.