Nvidia nemotron nano v2 vl

Deshmukh, A · 2025 · arXiv 2511.03929

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

Trustworthy Image Authentication using Forensic Knowledge Graphs

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

cs.CV · 2026-06-17 · unverdicted · novelty 7.0 · 2 refs

PorTEXTO benchmark shows sharp real-world performance drops in pt-PT OCR and finds specialized multilingual data outperforms model size or resolution increases.

RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

RealDocBench supplies 1,356 field-level QA questions over 581 real documents and 1,500 annotated pages, evaluating 18 systems on per-field accuracy, cost, and latency.

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.

Multimodal Data Curation Through Ranked Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 7.0

Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

cs.CV · 2026-06-17 · unverdicted · novelty 6.0 · 3 refs

Introduces AMALIA-VL, the first open-source instruction-tuned LVLM for European Portuguese, using a high-resolution vision encoder, pt-PT language model, learned connector, and three-stage training on a custom data mix.

LinMU: Multimodal Understanding Made Linear

cs.CV · 2026-01-04 · conditional · novelty 6.0

LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x higher throughput.

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 5.0

A neuro-symbolic engine generates GeoSym127K, a 127K-question dataset with symbolic ground truths and verified CoT pairs, yielding +22.21% gains on MathVerse Vision-Only after SFT on Qwen3-VL-8B.

citing papers explorer

Showing 11 of 11 citing papers.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 58 · 2 links
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing cs.CV · 2026-04-10 · accept · none · ref 11
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Trustworthy Image Authentication using Forensic Knowledge Graphs cs.CV · 2026-06-22 · unverdicted · none · ref 22
Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.
PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction cs.CV · 2026-06-17 · unverdicted · none · ref 5 · 2 links
PorTEXTO benchmark shows sharp real-world performance drops in pt-PT OCR and finds specialized multilingual data outperforms model size or resolution increases.
RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents cs.CV · 2026-06-05 · unverdicted · none · ref 17
RealDocBench supplies 1,356 field-level QA questions over 581 real documents and 1,500 annotated pages, evaluating 18 systems on per-field accuracy, cost, and latency.
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks cs.CV · 2026-05-21 · unverdicted · none · ref 10
MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on private CCTV and AccidentBench tasks.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models cs.CV · 2026-05-14 · unverdicted · none · ref 83
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
Multimodal Data Curation Through Ranked Retrieval cs.IR · 2026-05-01 · unverdicted · none · ref 19
Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model cs.CV · 2026-06-17 · unverdicted · none · ref 37 · 3 links
Introduces AMALIA-VL, the first open-source instruction-tuned LVLM for European Portuguese, using a high-resolution vision encoder, pt-PT language model, learned connector, and three-stage training on a custom data mix.
LinMU: Multimodal Understanding Made Linear cs.CV · 2026-01-04 · conditional · none · ref 5
LinMU achieves linear-complexity multimodal understanding by swapping self-attention for an M-MATE dual-branch block and distilling from a frozen teacher VLM, matching accuracy with up to 2.7x faster TTFT and 9x higher throughput.
GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 18
A neuro-symbolic engine generates GeoSym127K, a 127K-question dataset with symbolic ground truths and verified CoT pairs, yielding +22.21% gains on MathVerse Vision-Only after SFT on Qwen3-VL-8B.

Nvidia nemotron nano v2 vl

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer