{"total":14,"items":[{"citing_arxiv_id":"2605.23189","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Empirical Bayes Conformal Prediction for Vision and Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T03:17:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical Bayes conformal prediction converts score variability into r-value nonconformity scores that preserve target coverage while reducing inclusion of high-variance false candidates in image classification, CLIP VLMs, and LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20628","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation","primary_cat":"cs.CL","submitted_at":"2026-05-20T02:25:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DPR-BAG generates factually grounded biomedical abstracts from full texts via structured BOMRC decomposition, parallel LLM prompting, and coherence refinement without any model training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15393","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling","primary_cat":"cs.LG","submitted_at":"2026-05-14T20:26:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06423","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-07T15:29:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06327","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03111","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarking Local Language Models for Social Robots using Edge Devices","primary_cat":"cs.RO","submitted_at":"2026-05-04T19:49:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23788","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks","primary_cat":"cs.CV","submitted_at":"2026-04-26T16:25:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of Attention: Visual Attention, Social Cognition, and Individual Differences. Psychological Bulletin133, 4 (2007), 694-724. doi:10.1037/0033-2909.133.4.694 [12] Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139-183. [13] Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X. Wang, and Sadid Hasan. 2024. Does Prompt Formatting Have Any Impact on LLM Performance? doi:10.48550/arXiv.2411.10541 arXiv:2411.10541 [cs]. [14] Roy S. Hessels. 2020. How does gaze to faces support face-to-face interaction? A review and perspective.Psychonomic Bulletin & Review27, 5 (2020), 856-881."},{"citing_arxiv_id":"2604.17497","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generative AI Technologies, Techniques & Tensions: A Primer","primary_cat":"cs.CY","submitted_at":"2026-04-19T15:32:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Generative AI systems arise from statistical data processing that produces human-like outputs, creating a mismatch with traditional computer expectations and positioning educational researchers to lead in studying and applying them.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17458","ref_index":138,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval","primary_cat":"cs.AI","submitted_at":"2026-04-19T14:18:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EHRAG constructs structural hyperedges from sentence co-occurrence and semantic hyperedges from entity embedding clusters, then applies hybrid diffusion plus topic-aware PPR to retrieve top-k documents, outperforming baselines on four datasets with linear indexing cost and zero token overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14930","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IE as Cache: Information Extraction Enhanced Agentic Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-16T12:18:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"structures are rarely maintained or updated for iterative reason- ing, limiting their flexibility and scalability for general-purpose LLM reasoning [23]. c)Structured Reasoning Architectures:Recent ad- vances in prompting highlight the critical role of structured information in enhancing LLM reasoning capabilities. Studies show thattemplate structurescan significantly affect reasoning fidelity [24]. The Chain-of-Table paradigm [25] leverages tabular representations for numerical reasoning, while [26] for- malizestable-structured thoughtfor multi-task generalization. However, these methods largely emphasizeoutput formatting and intermediate traces, leaving a gap in optimizing theinput. Our work addresses this gap by utilizing IE as a cognitive"},{"citing_arxiv_id":"2604.16420","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Breaking Validity-Induced Boundaries to Expand Algorithm Search Space: A Two-Stage AST-Based Operator for LLM-Driven Automated Heuristic Evolution","primary_cat":"cs.NE","submitted_at":"2026-04-03T07:35:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A two-stage AST-based crossover and mutation operator with LLM repair expands the search space in LLM-driven heuristic evolution and improves performance on TSP and online bin packing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20284","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can LLMs Make (Personalized) Access Control Decisions?","primary_cat":"cs.CR","submitted_at":"2025-11-25T13:11:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs reflect users' privacy preferences in access control decisions with up to 86% agreement and can promote safer behavior, but personalization trades off higher individual match for potentially less secure results when users over-permission.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.21850","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Visual Compositional Tuning","primary_cat":"cs.CV","submitted_at":"2025-04-30T17:57:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.13657","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Why Do Multi-Agent LLM Systems Fail?","primary_cat":"cs.AI","submitted_at":"2025-03-17T19:04:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[58] Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, and Sadid Hasan. Does prompt formatting have any impact on llm performance?arXiv preprint arXiv:2411.10541, 2024. [59] Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023. [60] Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. 14 [61] Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification."}],"limit":50,"offset":0}