hub Baseline reference

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

· 2023 · arXiv 2209.14610

Baseline reference. 67% of citing Pith papers use this work as a benchmark or comparison.

13 Pith papers citing it

Baseline 67% of classified citations

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 4 background 2

citation-polarity summary

use dataset 4 background 2

representative citing papers

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

cs.CV · 2026-05-01 · conditional · novelty 8.0 · 2 refs

WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

cs.CL · 2022-11-22 · unverdicted · novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.

From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs

cs.CL · 2026-01-07 · unverdicted · novelty 6.0

FSLR explicitly supervises the initial logical planning step in math problems, boosting LLM accuracy by 3-5% while using 80% fewer training tokens than standard CoT fine-tuning.

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

cs.CL · 2025-03-10 · unverdicted · novelty 6.0

A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

Multimodal Chain-of-Thought Reasoning in Language Models

cs.CL · 2023-02-02 · accept · novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

ZAYA1-VL-8B Technical Report

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

cs.CV · 2024-04-25 · unverdicted · novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

cs.CV · 2023-09-29 · conditional · novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation

cs.CL · 2025-10-08 · unverdicted · novelty 3.0

A survey that categorizes TQA benchmarks and LLM modeling strategies by challenges while identifying underexplored areas such as reinforcement learning.

citing papers explorer

Showing 13 of 13 citing papers.

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild cs.CV · 2026-05-01 · conditional · none · ref 8 · 2 links
WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding cs.CV · 2025-04-14 · unverdicted · none · ref 48
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperforming larger models with only 630 vision tokens at 3B scale.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks cs.CL · 2022-11-22 · unverdicted · none · ref 16
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 20
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
From Implicit to Explicit: Token-Efficient Logical Supervision for Mathematical Reasoning in LLMs cs.CL · 2026-01-07 · unverdicted · none · ref 2
FSLR explicitly supervises the initial logical planning step in math problems, boosting LLM accuracy by 3-5% while using 80% fewer training tokens than standard CoT fine-tuning.
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL cs.CL · 2025-03-10 · unverdicted · none · ref 46
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 166
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Multimodal Chain-of-Thought Reasoning in Language Models cs.CL · 2023-02-02 · accept · none · ref 25
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 100
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
ZAYA1-VL-8B Technical Report cs.CV · 2026-05-08 · unverdicted · none · ref 163
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 74
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) cs.CV · 2023-09-29 · conditional · none · ref 87
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation cs.CL · 2025-10-08 · unverdicted · none · ref 12
A survey that categorizes TQA benchmarks and LLM modeling strategies by challenges while identifying underexplored areas such as reinforcement learning.

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer