hub Canonical reference

Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

Bordes, F · 2024 · arXiv 2405.17247

Canonical reference. 100% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

cs.LG · 2026-05-08 · unverdicted · novelty 8.0 · 2 refs

Softmax Transformers implement in-context RL through equivalence to weighted softmax TD updates, with error decay under contraction and parameters as global minimizers of pretraining loss.

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

With specific linear Transformer parameters, CoT generation equals iterative TD updates, yielding geometric error decay with CoT length until a context-length statistical floor, and those parameters globally minimize the pretraining loss.

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transformation and configuration.

Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

Zero-shot VLMs like GPT-5.2 achieve very low error rates on random forgeries in signature verification but perform poorly on skilled forgeries and are harmed by chain-of-thought reasoning.

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0 · 2 refs

Chain of Evidence introduces a retriever-agnostic visual attribution method for iRAG that reasons over document screenshots with VLMs to output precise bounding boxes, outperforming text baselines on Wiki-CoE and SlideVQA.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

Training-Free Semantic Multi-Object Tracking with Vision-Language Models

cs.CV · 2026-04-15 · conditional · novelty 7.0

TF-SMOT composes pretrained vision-language models into a training-free pipeline that reaches state-of-the-art tracking and improved summary quality on the BenSMOT benchmark.

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.

Reflective Dialogue between Teacher and Solver Agents for Video Question Answering

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

A multi-turn reflective dialogue between Teacher and Solver agents constructs richer context from support examples than standard in-context learning, improving video QA on the EgoCross benchmark.

OrganicHAR: Towards Activity Discovery in Organic Settings for Privacy Preserving Sensors Using Efficient Video Analysis

cs.HC · 2026-05-18 · unverdicted · novelty 6.0

OrganicHAR discovers 4-8 activity categories per user from sensor signals, achieves 79% accuracy on coarse activities with ambient sensors alone and cuts VLM queries by 90% by triggering video analysis only at detected pattern moments.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

cs.DC · 2026-05-09 · unverdicted · novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

cs.CV · 2026-04-22 · unverdicted · novelty 6.0 · 2 refs

Proposes the Modality Translation Protocol with metrics ToS, CoS, FoS and SSC to quantify visual knowledge bottlenecks in VLMs, plus a Divergence Law hypothesis that scaling language models may increase the penalty.

MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

cs.AI · 2025-11-17 · unverdicted · novelty 6.0

MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.

SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization

cs.LG · 2025-10-29 · unverdicted · novelty 6.0

SemanticOpt fine-tunes LLMs on structured Bayesian optimization trajectories augmented with natural-language context to jointly use numerical and semantic evidence for black-box optimization.

Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

cs.CV · 2025-09-26 · unverdicted · novelty 6.0 · 2 refs

The paper benchmarks sycophancy in medical VLMs using hierarchical VQA templates and proposes VIPER to filter non-evidence social cues, reducing sycophancy while preserving interpretability.

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

cs.CV · 2025-04-09 · unverdicted · novelty 6.0

Introduces FoodNExTDB dataset and EWR metric to benchmark VLMs for food recognition, showing closed-source models achieve over 90% EWR on single-product images but struggle with fine-grained distinctions.

Tokenizing Single-Channel EEG with Time-Frequency Motif Learning

cs.LG · 2025-02-22 · unverdicted · novelty 6.0

TFM-Tokenizer learns a vocabulary of time-frequency motifs from single-channel EEG via a dual-path masked architecture and encodes signals into discrete tokens, reporting up to 11% Cohen's Kappa gains on benchmarks and 14% on ear-EEG sleep staging.

AC3S: Adaptive Conditioning for 3D-Aware Synthetic Data Generation

cs.CV · 2026-06-30 · unverdicted · novelty 5.0

AC3S adds a self-supervised visual prompt modulator to ControlNet diffusion and a multi-agent VLM prompt composer to generate photorealistic images with accurate 2D/3D annotations while avoiding over-conditioning.

MVPruner: Dynamic Token Pruning for Accelerating Multi-view Vision-Language Models in Autonomous Driving

cs.CV · 2026-06-26 · unverdicted · novelty 5.0 · 2 refs

MVPruner is a two-stage adaptive token pruning technique for multi-view VLMs that achieves 87.3% FLOPs reduction and 4.97x prefilling speedup while retaining 98.5% accuracy on DriveLM.

Do Vision-Language Models See Dwarf Galaxies the Way We Do?

astro-ph.IM · 2026-06-05 · unverdicted · novelty 5.0

Zero-shot VLMs reproduce aggregate human annotations on dwarf galaxy detection but exhibit high per-example variability and unreliable self-reported confidence.

When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 5.0

HybridMoE with controlled hybridization and idiomatic property signals yields 5-6% gains in figurative language representation for multilingual vision-language models.

MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization

cs.CV · 2026-05-20 · unverdicted · novelty 5.0

MGVQ introduces sensitivity-aware structured mixed-precision VQ and gradient-aware second-order error compensation using Kronecker and Block-LDL decompositions, reporting up to 4.9 point gains over prior methods at 2-bit on models like InternVL2-26B.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer