EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Title resolution pending
13 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.
Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Hard and MMLU-Hard.
FlashFPS accelerates FPS via candidate/iteration pruning and inter-layer caching, delivering 5.16x GPU speedup and 2.69x on accelerators with negligible accuracy loss.
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
DTL-NS introduces hierarchical index trees and LLM inference on item-ID encodings to identify false negatives and perform multi-view hard negative sampling for improved implicit CF recommendation.
IPQA is a new benchmark that measures how well models identify core user intents from history in personalized question answering, finding that performance is poor and declines with greater question complexity.
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.
MTServe achieves up to 3.1x speedup for generative recommendation model serving by using hierarchical caches with host RAM and system optimizations while keeping cache hit ratios above 98.5%.
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
citing papers explorer
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
CachePrune: Privacy-Aware and Fine-Grained KV Cache Sharing for Efficient LLM Inference
CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.
-
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Hard and MMLU-Hard.
-
FlashFPS: Efficient Farthest Point Sampling for Large-Scale Point Clouds via Pruning and Caching
FlashFPS accelerates FPS via candidate/iteration pruning and inter-layer caching, delivering 5.16x GPU speedup and 2.69x on accelerators with negligible accuracy loss.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
-
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
-
LLM-ODE: Data-driven Discovery of Dynamical Systems with Large Language Models
LLM-ODE integrates large language models into genetic programming to guide symbolic search for governing equations of dynamical systems, outperforming classical GP on 91 test cases in efficiency and solution quality.
-
Dual-Tree LLM-Enhanced Negative Sampling for Implicit Collaborative Filtering
DTL-NS introduces hierarchical index trees and LLM inference on item-ID encodings to identify false negatives and perform multi-view hard negative sampling for improved implicit CF recommendation.
-
IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
IPQA is a new benchmark that measures how well models identify core user intents from history in personalized question answering, finding that performance is poor and declines with greater question complexity.
-
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.
-
MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
MTServe achieves up to 3.1x speedup for generative recommendation model serving by using hierarchical caches with host RAM and system optimizations while keeping cache hit ratios above 98.5%.
-
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
-
Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.