Canonical reference

Title resolution pending

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al · 2022

Canonical reference. 86% of citing Pith papers cite this work as background.

34 Pith papers citing it

Background 86% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 6 method 1

citation-polarity summary

background 6 use method 1

representative citing papers

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.

STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.

From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence

cs.SE · 2026-04-10 · conditional · novelty 7.0

Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.

Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.

User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

cs.IR · 2026-04-04 · unverdicted · novelty 7.0

SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.

BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

cs.IR · 2026-01-30 · conditional · novelty 7.0

BEAR adds a beam-search-aware regularization to LLM fine-tuning for recommendations that forces positive-item tokens to rank in the top-B candidates at each decoding step to avoid premature pruning.

ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling

cs.AI · 2025-10-16 · unverdicted · novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

cs.CR · 2025-10-11 · unverdicted · novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.

From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models

cs.CV · 2025-08-03 · unverdicted · novelty 7.0

IMAGEO-Bench evaluates 10 LLMs on image geolocalization across global street scenes, US POIs, and private images, revealing closed-source model advantages and biases favoring high-resource regions.

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

cs.AI · 2026-05-18 · conditional · novelty 6.0

TeleCom-Bench reveals LLMs reach 90% on telecom intent and entity tasks but drop to 30% on solution generation and root cause analysis in live network scenarios.

Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data

cs.CR · 2026-05-11 · conditional · novelty 6.0

Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.

Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

cs.HC · 2026-04-21 · unverdicted · novelty 6.0

VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.

RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.

Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering

cs.SE · 2026-04-18 · unverdicted · novelty 6.0

A prompting method that forces GPAI models to state SE best practices before deciding reduces prompt-induced cognitive biases by 51% on average across eight tested biases.

Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code

cs.SE · 2026-04-13 · unverdicted · novelty 6.0

Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.

GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing

cs.MA · 2026-04-06 · unverdicted · novelty 6.0

GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.

TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning

cs.SE · 2026-04-02 · unverdicted · novelty 6.0

By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.

RelianceScope: An Analytical Framework for Examining Students' Reliance on Generative AI Chatbots in Problem Solving

cs.HC · 2026-02-18 · unverdicted · novelty 6.0

RelianceScope is a new analytical framework that maps AI reliance into nine engagement patterns across help-seeking and response-use, situated in students' prior knowledge and instructional context, validated on programming course logs.

CKG-LLM: LLM-Assisted Detection of Smart Contract Access Control Vulnerabilities Based on Knowledge Graphs

cs.CR · 2025-12-07 · unverdicted · novelty 6.0

CKG-LLM uses LLMs to generate executable queries over contract knowledge graphs for detecting access control vulnerabilities and reports superior performance versus existing tools.

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

cs.AI · 2025-10-23 · unverdicted · novelty 6.0

A TDA-based framework is proposed to assess LLM reasoning trace quality, showing that topological features predict quality better than standard graph metrics.

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

cs.IR · 2026-04-22 · unverdicted · novelty 5.0

SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.

citing papers explorer

Showing 34 of 34 citing papers.

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding cs.CV · 2026-05-13 · unverdicted · none · ref 23
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images cs.CV · 2026-05-12 · unverdicted · none · ref 50 · 3 links
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering cs.AI · 2026-04-19 · unverdicted · none · ref 41
STRIDE uses a meta-planner for entity-agnostic reasoning skeletons and a supervisor for dependency-aware execution to improve retrieval-augmented multi-hop QA.
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence cs.SE · 2026-04-10 · conditional · none · ref 123
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development cs.SE · 2026-04-08 · unverdicted · none · ref 41
SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.
User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation cs.IR · 2026-04-04 · unverdicted · none · ref 33
SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.
BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models cs.IR · 2026-01-30 · conditional · none · ref 65
BEAR adds a beam-search-aware regularization to LLM fine-tuning for recommendations that forces positive-item tokens to rank in the top-B candidates at each decoding step to avoid premature pruning.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling cs.AI · 2025-10-16 · unverdicted · none · ref 42
ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents cs.CR · 2025-10-11 · unverdicted · none · ref 44
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior, and outcomes.
From Pixels to Places: A Systematic Benchmark for Evaluating Image Geolocalization Ability in Large Language Models cs.CV · 2025-08-03 · unverdicted · none · ref 38
IMAGEO-Bench evaluates 10 LLMs on image geolocalization across global street scenes, US POIs, and private images, revealing closed-source model advantages and biases favoring high-resource regions.
TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications? cs.AI · 2026-05-18 · conditional · none · ref 58
TeleCom-Bench reveals LLMs reach 90% on telecom intent and entity tasks but drop to 30% on solution generation and root cause analysis in live network scenarios.
Context-Aware Spear Phishing: Generative AI-Enabled Attacks Against Individuals via Public Social Media Data cs.CR · 2026-05-11 · conditional · none · ref 92
Generative AI enables scalable, context-aware spear phishing by extracting profiles from public social media, producing emails that outperform real-world phishing samples in personalization and lower recipient suspicion.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV · 2026-05-08 · unverdicted · none · ref 36
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning cs.CV · 2026-04-21 · unverdicted · none · ref 65
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications cs.HC · 2026-04-21 · unverdicted · none · ref 67
VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation cs.CL · 2026-04-19 · unverdicted · none · ref 42
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety benchmarks while cutting redundant retrieval.
Mitigating Prompt-Induced Cognitive Biases in General-Purpose AI for Software Engineering cs.SE · 2026-04-18 · unverdicted · none · ref 57
A prompting method that forces GPAI models to state SE best practices before deciding reduces prompt-induced cognitive biases by 51% on average across eight tested biases.
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code cs.SE · 2026-04-13 · unverdicted · none · ref 34
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
GLANCE: A Global-Local Coordination Multi-Agent Framework for Music-Grounded Non-Linear Video Editing cs.MA · 2026-04-06 · unverdicted · none · ref 44
GLANCE introduces a bi-loop multi-agent framework with global-local coordination mechanisms that outperforms baselines by up to 33% on music-grounded nonlinear video editing tasks using a new MVEBench benchmark.
TestDecision: Sequential Test Suite Generation via Greedy Optimization and Reinforcement Learning cs.SE · 2026-04-02 · unverdicted · none · ref 60
By proving test suite coverage is monotone submodular and training LLMs with RL to maximize marginal gains, TestDecision improves branch coverage 38-52% and bug detection up to 95% over base models on ULT and LiveCodeBench.
RelianceScope: An Analytical Framework for Examining Students' Reliance on Generative AI Chatbots in Problem Solving cs.HC · 2026-02-18 · unverdicted · none · ref 67
RelianceScope is a new analytical framework that maps AI reliance into nine engagement patterns across help-seeking and response-use, situated in students' prior knowledge and instructional context, validated on programming course logs.
CKG-LLM: LLM-Assisted Detection of Smart Contract Access Control Vulnerabilities Based on Knowledge Graphs cs.CR · 2025-12-07 · unverdicted · none · ref 40
CKG-LLM uses LLMs to generate executable queries over contract knowledge graphs for detecting access control vulnerabilities and reports superior performance versus existing tools.
The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models cs.AI · 2025-10-23 · unverdicted · none · ref 17
A TDA-based framework is proposed to assess LLM reasoning trace quality, showing that topological features predict quality better than standard graph metrics.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition cs.IR · 2026-04-22 · unverdicted · none · ref 40
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning cs.CL · 2026-04-12 · unverdicted · none · ref 39
EviCare uses deep model-guided evidence to enhance LLM in-context reasoning for accurate diagnosis prediction from EHRs, outperforming baselines by 20.65% on average and 30.97% for novel diagnoses on MIMIC datasets.
From Intent to AI Pipelines: A Controlled Agentic Framework for Non-AI Expert Scientists cs.IR · 2026-04-10 · unverdicted · none · ref 64
DDAP is a controlled agentic framework that guides non-experts via four LLM-assisted stages to construct competitive AI pipelines for business, biology, and health domains.
To Copilot and Beyond: 22 AI Systems Developers Want Built cs.SE · 2026-04-09 · unverdicted · none · ref 54
Survey of 860 developers reveals 22 desired AI systems for non-coding tasks with explicit constraints on authority, provenance, and quality signals, framed as bounded delegation where AI handles assembly work but not core craft.
REFLEX: Self-Refining Explainable Fact-Checking via Verdict-Anchored Style Control cs.CL · 2025-11-25 · unverdicted · none · ref 59
REFLEX improves explainable fact-checking by using verdict-anchored style control and self-disagreement signals to disentangle fact from style in LLM outputs, achieving SOTA results with minimal self-refined samples.
Fall into a Pit, Gain in a Wit: Cognitive-Guided Harmful Meme Detection via Misjudgment Risk Pattern Retrieval cs.LG · 2025-10-10 · unverdicted · none · ref 49
PatMD improves harmful meme detection by retrieving misjudgment risk patterns to guide MLLMs, reporting 8.30% average F1 and 7.71% accuracy gains on 6,626 memes across 5 tasks.
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance cs.AI · 2025-10-09 · unverdicted · none · ref 23
SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
Policy-Grounded Dynamic Facet Suggestions for Job Search cs.IR · 2026-05-15 · unverdicted · none · ref 34
A policy-grounded retrieval-augmented framework with SLM scoring generates real-time personalized facet suggestions that boost engagement and job search outcomes.
To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification cs.CL · 2026-05-11 · unverdicted · none · ref 25
A 9B local LLM using chain-of-thought and error-based few-shot prompting classifies deliberative sentences for FOIA redaction with better recall and F2 than prior models and near commercial LLM performance.
SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model cs.AI · 2026-05-08 · unverdicted · none · ref 30
SOM uses a Structural Causal Model to create an explicit graph of opponent observation-to-action links, allowing LLMs to reason along those paths for more accurate and stable predictions in multi-agent settings.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation cs.IR · 2026-04-21 · unverdicted · none · ref 51
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer