AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
Canonical reference
Title resolution pending
Canonical reference. 86% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.
SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
BEAR adds a beam-search-aware regularization to LLM fine-tuning for recommendations that forces positive-item tokens to rank in the top-B candidates at each decoding step to avoid premature pruning.
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
IMAGEO-Bench evaluates 10 LLMs on image geolocalization across global street scenes, US POIs, and private images, revealing closed-source model advantages and biases favoring high-resource regions.
TeleCom-Bench reveals LLMs reach 90% on telecom intent and entity tasks but drop to 30% on solution generation and root cause analysis in live network scenarios.
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.
A prompting method that forces GPAI models to state SE best practices before deciding reduces prompt-induced cognitive biases by 51% on average across eight tested biases.
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.
By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.
RelianceScope is a new analytical framework that maps AI reliance into nine engagement patterns across help-seeking and response-use, situated in students' prior knowledge and instructional context, validated on programming course logs.
CKG-LLM uses LLMs to generate executable queries over contract knowledge graphs for detecting access control vulnerabilities and reports superior performance versus existing tools.
A TDA-based framework is proposed to assess LLM reasoning trace quality, showing that topological features predict quality better than standard graph metrics.
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
citing papers explorer
-
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
-
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
-
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering
STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
-
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
-
Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development
SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.
-
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation
SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
-
BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models
BEAR adds a beam-search-aware regularization to LLM fine-tuning for recommendations that forces positive-item tokens to rank in the top-B candidates at each decoding step to avoid premature pruning.
-
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
-
From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models
IMAGEO-Bench evaluates 10 LLMs on image geolocalization across global street scenes, US POIs, and private images, revealing closed-source model advantages and biases favoring high-resource regions.
-
TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?
TeleCom-Bench reveals LLMs reach 90% on telecom intent and entity tasks but drop to 30% on solution generation and root cause analysis in live network scenarios.
-
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
-
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
-
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
-
Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.
-
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.
-
Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering
A prompting method that forces GPAI models to state SE best practices before deciding reduces prompt-induced cognitive biases by 51% on average across eight tested biases.
-
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
-
GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing
GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.
-
TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning
By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.
-
RelianceScope: An Analytical Framework for Examining Students' Reliance on Generative AI Chatbots in Problem Solving
RelianceScope is a new analytical framework that maps AI reliance into nine engagement patterns across help-seeking and response-use, situated in students' prior knowledge and instructional context, validated on programming course logs.
-
CKG-LLM: LLM-Assisted Detection of Smart Contract Access Control Vulnerabilities Based on Knowledge Graphs
CKG-LLM uses LLMs to generate executable queries over contract knowledge graphs for detecting access control vulnerabilities and reports superior performance versus existing tools.
-
The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models
A TDA-based framework is proposed to assess LLM reasoning trace quality, showing that topological features predict quality better than standard graph metrics.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning
EviCare uses deep model-guided evidence to enhance LLM in-context reasoning for accurate diagnosis prediction from EHRs, outperforming baselines by 20.65% on average and 30.97% for novel diagnoses on MIMIC datasets.
-
From Intent to AI Pipelines: A Controlled Agentic Framework for Non-AI Expert Scientists
DDAP is a controlled agentic framework that guides non-experts via four LLM-assisted stages to construct competitive AI pipelines for business, biology, and health domains.
-
To Copilot and Beyond: 22 AI Systems Developers Want Built
Survey of 860 developers reveals 22 desired AI systems for non-coding tasks with explicit constraints on authority, provenance, and quality signals, framed as bounded delegation where AI handles assembly work but not core craft.
-
REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control
REFLEX improves explainable fact-checking by using verdict-anchored style control and self-disagreement signals to disentangle fact from style in LLM outputs, achieving SOTA results with minimal self-refined samples.
-
Fall into a Pit, Gain in a Wit: Cognitive-Guided Harmful Meme Detection via Misjudgment Risk Pattern Retrieval
PatMD improves harmful meme detection by retrieving misjudgment risk patterns to guide MLLMs, reporting 8.30% average F1 and 7.71% accuracy gains on 6,626 memes across 5 tasks.
-
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance
SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
-
Policy-Grounded Dynamic Facet Suggestions for Job Search
A policy-grounded retrieval-augmented framework with SLM scoring generates real-time personalized facet suggestions that boost engagement and job search outcomes.
-
To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification
A 9B local LLM using chain-of-thought and error-based few-shot prompting classifies deliberative sentences for FOIA redaction with better recall and F2 than prior models and near commercial LLM performance.
-
SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model
SOM uses a Structural Causal Model to create an explicit graph of opponent observation-to-action links, allowing LLMs to reason along those paths for more accurate and stable predictions in multi-agent settings.
-
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.