Mmlongbench: Benchmarking long-context vision-language models effectively and thoroughly

Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman · 2025 · arXiv 2505.10610

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

dataset 3

citation-polarity summary

use dataset 3

representative citing papers

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

cs.AI · 2026-03-24 · unverdicted · novelty 7.0

PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.

SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding

cs.CL · 2026-06-18 · unverdicted · novelty 5.0

SciLens introduces an evidence-conditioned atomic entailment framework that grounds claims to modality-specific witnesses in tables and figures, achieving 79.2% macro-F1 on SciClaimEval.

Seed1.8 Model Card: Towards Generalized Real-World Agency

cs.AI · 2026-03-21 · unverdicted · novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

cs.AI · 2026-06-30 · unverdicted · novelty 2.0

Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.

Internalized Reasoning for Long-Context Visual Document Understanding

cs.CV · 2026-03-31

citing papers explorer

Showing 3 of 3 citing papers after filters.

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments cs.AI · 2026-03-24 · unverdicted · none · ref 59
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
Seed1.8 Model Card: Towards Generalized Real-World Agency cs.AI · 2026-03-21 · unverdicted · none · ref 74
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity cs.AI · 2026-06-30 · unverdicted · none · ref 114
Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.

Mmlongbench: Benchmarking long-context vision-language models effectively and thoroughly

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer