Iterative solvers in layer-wise model merging act as spectral regularizers on an ill-posed interference operator; closed-form SWUDI and adaptive SWUDI-A match or exceed SOTA merging accuracy with 28-72x wall-clock speedup.
hub Canonical reference
G-LLaVA: Solving Geometric Problem with Multi- Modal Large Language Model
Canonical reference. 71% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
A procedural engine generates 200k+ synthetic geometry diagrams to fine-tune VLMs for referring image segmentation on abstract diagrams, yielding 49% IoU and 85% Buffered IoU with Florence-2 versus under 1% zero-shot.
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
BiNSGPS proposes bidirectional neuro-symbolic interaction where an MLLM adviser uses symbolic solver feedback to rectify formal representations and propose hypotheses for geometry problem solving.
TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.
Zamba2-VL is a family of 1.2B–7B hybrid Mamba2-transformer vision-language models that match leading transformer VLMs on image, reasoning, OCR, grounding and counting benchmarks while delivering roughly 10x lower time-to-first-token.
Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.
AutoTool uses dual-mode RL to let MLLMs adaptively choose tool use or text-only reasoning, reporting 21.8% accuracy gain on V* and 44.9% efficiency gain on POPE versus baselines.
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
MathFlow decouples perception and inference stages in MLLMs for visual math, with a dedicated perception model delivering gains on the FlowVerse benchmark when paired with existing reasoners.
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
An integrated survey organizing AI mathematical reasoning into informal, formal, discovery, and technique axes while cataloging benchmarks and assessing failure modes.
citing papers explorer
-
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses dual-mode RL to let MLLMs adaptively choose tool use or text-only reasoning, reporting 21.8% accuracy gain on V* and 44.9% efficiency gain on POPE versus baselines.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.