{"work":{"id":"008df105-2fdd-45d8-857a-8e35868aecb6","openalex_id":null,"doi":null,"arxiv_id":"2507.06261","raw_key":null,"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","authors":null,"authors_text":"Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al","year":2025,"venue":"cs.CL","abstract":"In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.","external_url":"https://arxiv.org/abs/2507.06261","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-07-04T10:49:46.158493+00:00","pith_arxiv_id":"2507.06261","created_at":"2026-05-08T17:43:52.623161+00:00","updated_at":"2026-07-04T10:49:46.158493+00:00","title_quality_ok":true,"display_title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","render_title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities"},"hub":{"state":{"work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","tier":"mega_hub","tier_reason":"1,000+ Pith inbound or 100,000+ external citations","pith_inbound_count":1004,"external_cited_by_count":null,"distinct_field_count":37,"first_pith_cited_at":"2024-06-12T09:36:52+00:00","last_pith_cited_at":"2026-07-02T17:53:15+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"needed","recognition_status":"needed","updated_at":"2026-07-04T11:56:36.749999+00:00","tier_text":"mega_hub"},"tier":"mega_hub","role_counts":[{"context_role":"background","n":122},{"context_role":"baseline","n":46},{"context_role":"method","n":28},{"context_role":"other","n":8},{"context_role":"dataset","n":3}],"polarity_counts":[{"context_polarity":"background","n":114},{"context_polarity":"baseline","n":47},{"context_polarity":"use_method","n":28},{"context_polarity":"unclear","n":12},{"context_polarity":"support","n":3},{"context_polarity":"use_dataset","n":3}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","claims":[{"claim_text":"In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:33:44.567304+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"1dd80648-3706-4464-ae47-89679affde20","orcid":null,"display_name":"Gheorghe Comanici"},{"id":"d21b2627-45b9-4dd7-9249-aeef44ee6939","orcid":null,"display_name":"Eric Bieber"},{"id":"d8b13b89-0a41-4588-a637-3e08010961ec","orcid":null,"display_name":"Mike Schaekermann"},{"id":"fa69cd7f-32a8-453d-8ccb-05c7b17fdad2","orcid":null,"display_name":"Ice Pasupat"},{"id":"5f9ceda9-fb6a-4600-a730-462fac7b8b5e","orcid":null,"display_name":"Noveen Sachdeva"},{"id":"7f648965-8527-4843-b678-cf08435f4b0e","orcid":null,"display_name":"Inderjit Dhillon"}]},"error":null,"updated_at":"2026-05-13T18:23:33.443309+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-13T18:23:32.775358+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":121},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":85},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":83},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":74},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":64},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":62},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":50},{"title":"OpenAI GPT-5 System Card","work_id":"ca87689a-0d29-4476-b504-b65dbbb08af4","shared_citers":49},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":42},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":39},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":37},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":36},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":32},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":29},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":28},{"title":"Qwen2.5-Omni Technical Report","work_id":"438f105c-fa9b-44aa-ad52-43acb8045cda","shared_citers":26},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":24},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":24},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":24},{"title":"Qwen3-Omni Technical Report","work_id":"ae43e594-8bab-4471-b6af-92a300f6a048","shared_citers":22},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":21},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":20},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":20}],"time_series":[{"n":3,"year":2025},{"n":368,"year":2026}]},"error":null,"updated_at":"2026-05-13T17:25:56.108545+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"fixed":1,"items":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-13T18:23:32.148408+00:00"},"reader_index":{"job_type":"reader_index","status":"succeeded","result":{"note":"annotated reader requires full-text/OA fetch; shell is wired for mega hubs","status":"reader queued"},"error":null,"updated_at":"2026-07-04T10:36:59.856299+00:00"},"recognition_alignment":{"job_type":"recognition_alignment","status":"succeeded","result":{"modules":["IndisputableMonolith.Gravity.PropagationSpeed","IndisputableMonolith.Foundation.PreTemporalForcingOrder","IndisputableMonolith.Physics.LightConeCausalityFromRS","IndisputableMonolith.Cosmology.EtaBPrefactorDerivation","IndisputableMonolith.Physics.MaxwellEquationsFromRS","IndisputableMonolith.Gravity.BlackHoleEntropyFromLedger","IndisputableMonolith.Thermodynamics.FermiDirac","IndisputableMonolith.Gravity.BlackHoleHorizonStates"],"query_chars":1129},"error":null,"updated_at":"2026-07-04T10:36:59.854834+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","claims":[{"claim_text":"In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T18:23:32.777251+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","claims":[{"claim_text":"In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-13T17:25:52.723013+00:00"}},"summary":{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","claims":[{"claim_text":"In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. G","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":121},{"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","shared_citers":85},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":83},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":74},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":72},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":64},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":62},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":50},{"title":"OpenAI GPT-5 System Card","work_id":"ca87689a-0d29-4476-b504-b65dbbb08af4","shared_citers":49},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":42},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":39},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":37},{"title":"InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models","work_id":"fe8637aa-12bc-4434-8d36-9f57b5eebcbe","shared_citers":36},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":32},{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","work_id":"8abcfe4f-e0fb-44b7-9123-448fac95f90a","shared_citers":29},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":28},{"title":"Qwen2.5-Omni Technical Report","work_id":"438f105c-fa9b-44aa-ad52-43acb8045cda","shared_citers":26},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":24},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":24},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":24},{"title":"Qwen3-Omni Technical Report","work_id":"ae43e594-8bab-4471-b6af-92a300f6a048","shared_citers":22},{"title":"LLaVA-OneVision: Easy Visual Task Transfer","work_id":"f5f2452b-f2a9-49ac-b38d-c76e18cdfe49","shared_citers":21},{"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","shared_citers":20},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":20}],"time_series":[{"n":3,"year":2025},{"n":368,"year":2026}]},"authors":[{"id":"d21b2627-45b9-4dd7-9249-aeef44ee6939","orcid":null,"display_name":"Eric Bieber","source":"manual","import_confidence":0.72},{"id":"1dd80648-3706-4464-ae47-89679affde20","orcid":null,"display_name":"Gheorghe Comanici","source":"manual","import_confidence":0.72},{"id":"fa69cd7f-32a8-453d-8ccb-05c7b17fdad2","orcid":null,"display_name":"Ice Pasupat","source":"manual","import_confidence":0.72},{"id":"7f648965-8527-4843-b678-cf08435f4b0e","orcid":null,"display_name":"Inderjit Dhillon","source":"manual","import_confidence":0.72},{"id":"d8b13b89-0a41-4588-a637-3e08010961ec","orcid":null,"display_name":"Mike Schaekermann","source":"manual","import_confidence":0.72},{"id":"5f9ceda9-fb6a-4600-a730-462fac7b8b5e","orcid":null,"display_name":"Noveen Sachdeva","source":"manual","import_confidence":0.72}]},"citers":{"total":1004,"items":[{"citing_arxiv_id":"2607.02490","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-07-02T17:53:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VRRL trains LVLMs for visually grounded self-reflection via prefix masking and buffered roll-ins, yielding higher out-of-distribution accuracy on grounding and navigation tasks than standard RL baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.02269","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-07-02T14:52:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AnyGroundBench is a domain-adaptation benchmark for spatio-temporal video grounding across animal, industry, sports, surgery, and public security domains that finds 15 state-of-the-art VLMs fail in zero-shot and ICL settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.02096","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension","primary_cat":"cs.CV","submitted_at":"2026-07-02T12:32:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01721","ref_index":77,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-07-02T05:20:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CoRe combines VLM-designed formal rewards with VLM-labeled residual rewards to produce preference-aligned policies on robotic manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01707","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASER: A Corrective Lens for LVLMs via Visual Attention Preservation and Sink Suppression","primary_cat":"cs.CV","submitted_at":"2026-07-02T04:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LASER uses Visual Grounding Reward and Sink Suppression Reward to preserve visual attention trajectories and suppress sink tokens, reducing visual forgetting in LVLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01442","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Forgeries to Foundation Models: A Systematic Survey of Identity Document Attack and Detection","primary_cat":"cs.CR","submitted_at":"2026-07-01T20:05:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A systematic survey unifies presentation, digital injection, and GenAI synthesis attacks on identity documents, audits datasets for a reality gap, identifies SDGI in multimodal models, and reports APCER above 25% for top models on synthetic IDs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01425","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent4cs: A Multi-agent System for Code Summarization in Large Hierarchical Codebases","primary_cat":"cs.AI","submitted_at":"2026-07-01T19:41:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Agent4cs deploys summarization, keyword-extraction, and quality-assurance agents in a bottom-up pipeline that raises semantic consistency by 8% and normalized keyword coverage by up to 38% over structured prompting baselines on seven frontier models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00873","ref_index":8,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Ethos and Pathos Appeals Resonate in Reader Interpretations of Social Media Messages","primary_cat":"cs.CL","submitted_at":"2026-07-01T12:37:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Analyses of labeled social media sentences and interpretations show 30% divergence in ethos and pathos, greater variability for charged content, and predictive power for audience attitudes toward the author.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00711","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation","primary_cat":"cs.SE","submitted_at":"2026-07-01T09:58:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00465","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning","primary_cat":"cs.CV","submitted_at":"2026-07-01T05:34:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StochasT uses stochastic clustering of language tasks into varying turn depths for the same image to improve LVLMs on both single-turn and multi-turn scenarios without discarding data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00407","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personalization as Inverse Planning: Learning Latent Design Intents for Agentic Slide Generation via Structural Denoising","primary_cat":"cs.AI","submitted_at":"2026-07-01T04:05:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPIRE approximates page-level slide personalization by training agents to denoise corrupted slide structures via collaborative RL, claiming a proof of consistency as a surrogate for inverse planning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00289","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OnPoint: Offline-to-Online Multi-Level Distillation for Point-Supervised Online Temporal Action Localization","primary_cat":"cs.CV","submitted_at":"2026-07-01T00:32:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OnPoint enables point-supervised online temporal action localization by distilling pseudo-segments, class-activation sequences, and anticipatory windows from an offline teacher to an online student.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.32008","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?","primary_cat":"cs.LG","submitted_at":"2026-06-30T17:43:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prediction agreement between open and closed LLMs substantially overstates agreement on attributions and causal reasons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31504","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search","primary_cat":"cs.CV","submitted_at":"2026-06-30T11:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SimpleSearch-VL improves Qwen3-VL multimodal agent baselines by 15.8-16 points on average using 7K total training examples and reaches parity with Gemini-3-Pro on the 30B variant.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31451","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation","primary_cat":"cs.RO","submitted_at":"2026-06-30T10:25:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31338","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models","primary_cat":"cs.SD","submitted_at":"2026-06-30T08:39:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces an OpenMIC-derived multi-axis benchmark sequence showing that high binary instrument QA accuracy fails to predict robust grounding, with models showing position bias, confusable errors, and temporal bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31308","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Large Language Models on Floating-Point Error Classification","primary_cat":"cs.AI","submitted_at":"2026-06-30T08:18:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces InterFLOPBench benchmark and evaluates 14 LLMs on multi-label classification of six floating-point error categories in C code, with top models exceeding 0.88 overall F1 but lower scores on subtle errors like underflow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31247","ref_index":270,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model","primary_cat":"cs.SD","submitted_at":"2026-06-30T07:24:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30887","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support","primary_cat":"cs.CL","submitted_at":"2026-06-29T20:22:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TheraJudge, trained via preference optimization on human annotations, reaches high clinician agreement (ICC 0.87-0.95) and, when used by TheraAgent, raises human-rated therapeutic quality by 0.43 points on a 5-point scale with 94% recovery of low-quality responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30783","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense","primary_cat":"cs.CR","submitted_at":"2026-06-29T18:11:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30598","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards in-the-wild Egocentric 3D Hand-Object Pose Estimation","primary_cat":"cs.CV","submitted_at":"2026-06-29T17:38:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces EPIC-Contact dataset and HOPformer transformer for in-the-wild egocentric 3D hand-object pose estimation, reporting 82.4% success on ARCTIC and doubled success with 75% lower contact error on the new dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30578","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uncertainty-Aware Generation and Decision-Making Under Ambiguity","primary_cat":"cs.CL","submitted_at":"2026-06-29T17:20:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Uncertainty-aware algorithms based on Bayesian decision theory improve generation utility on tutoring and reviewing tasks while risk-averse methods can degrade performance under high ambiguity, with conformal prediction providing guarantees.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30498","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Faithfulness of Post-Hoc Concept Bottleneck Models","primary_cat":"cs.CV","submitted_at":"2026-06-29T16:02:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Post-hoc CBMs produce unfaithful concept projections due to covariate shifts and systematic label noise; new metrics are introduced to measure faithfulness separately from accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30420","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Experience Augmented Policy Optimization for LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2026-06-29T15:05:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EAPO reuses prior RL policy experience adaptively at decision points in LLM rollouts with adapted importance sampling and reports gains over prior RLVR methods on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30376","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification","primary_cat":"cs.LG","submitted_at":"2026-06-29T14:37:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FlowAWR derives an advantage-weighted rectification for optimal velocity fields in flow models, claiming 2-5x faster convergence than DiffusionNFT on SD3.5-Medium.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30339","ref_index":106,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"REAR: Test-time Preference Realignment through Reward Decomposition","primary_cat":"cs.CL","submitted_at":"2026-06-29T14:17:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"REAR decomposes the reward into question and preference components, rescales their balance, and expresses the result as a linear combination of token log-probabilities for efficient integration with best-of-N and tree search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30217","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning","primary_cat":"cs.CL","submitted_at":"2026-06-29T12:30:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30116","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Open Problems in Constitutional Preference Reconstruction","primary_cat":"cs.AI","submitted_at":"2026-06-29T10:47:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis across three datasets identifies three open problems in constitutional preference reconstruction and shows that principle refinement raises inter-executor agreement from 73% to 78%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30054","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-29T09:45:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29905","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StrucTab: A Structured Optimization Framework for Table Parsing","primary_cat":"cs.CV","submitted_at":"2026-06-29T07:41:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29808","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Making Multimodal LLMs Reliable Chart Data Extractors: A Benchmark and Training Framework","primary_cat":"cs.HC","submitted_at":"2026-06-29T05:40:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a benchmark for MLLM-based chart data extraction from unlabeled images and a human-centered training framework that reaches SOTA numerical accuracy with a 7B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29801","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Concept Removal Guidance: Evidence-Calibrated Negative Guidance for Safe Diffusion Sampling","primary_cat":"cs.CV","submitted_at":"2026-06-29T05:28:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CRG adaptively calibrates negative guidance in diffusion models by estimating concept presence from noise predictions at each step to suppress unwanted content while preserving fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29716","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World","primary_cat":"cs.CV","submitted_at":"2026-06-29T02:48:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AerialMetric is a new benchmark dataset and evaluation suite for adapting monocular metric depth estimation models to real-world UAV aerial views.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29682","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Body as Status: Muscularity, Engagement, and Body Image Risk on #GymTok","primary_cat":"cs.CY","submitted_at":"2026-06-29T01:12:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Content analysis of #GymTok videos finds positive associations between muscularity, perceived harm, and engagement metrics, implying TikTok algorithms may amplify muscular ideals and risky behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29531","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MotionAtlas: Detailed Region Captioning for Motion-Centric Videos","primary_cat":"cs.CV","submitted_at":"2026-06-28T17:54:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MotionAtlas supplies a 2,073-question benchmark, a self-bootstrap pipeline yielding 159k captions, and fine-tuned Video-MLLMs that deliver 5.2-point gains over Qwen3-VL-4B on motion tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29473","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control","primary_cat":"cs.CV","submitted_at":"2026-06-28T16:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MAVIN proposes boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and the MAVINSet dataset as the first framework for multi-shot audio-visual generation with narrative control, claiming SOTA results.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29225","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-28T06:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PolicyGuard is a dialogue-grounded sub-agent verifier that raises PASS4 scores by 6-12 points on an airline benchmark while catching more violations with fewer blocks than argument-level guards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28998","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation","primary_cat":"cs.SE","submitted_at":"2026-06-27T16:22:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28843","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning","primary_cat":"cs.CL","submitted_at":"2026-06-27T10:05:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Benign multilingual fine-tuning causes language-specific safety drifts with adversarial compliance rates rising up to four-fold, decoupled from capability gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28780","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal Graph RAG for Long-range Visually Rich Document Understanding","primary_cat":"cs.IR","submitted_at":"2026-06-27T07:14:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Multimodal graph RAG with DLVQA benchmark outperforms MMRAG and KG methods on multi-hop document VQA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28757","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models","primary_cat":"cs.CV","submitted_at":"2026-06-27T06:13:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CrashTwin is a new benchmark framework that exposes physical violations in state-of-the-art world models during multi-agent collisions despite high visual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28747","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Supervised Theorem Discovery in a Formal Axiomatic System","primary_cat":"cs.AI","submitted_at":"2026-06-27T05:55:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A self-supervised agent alternates proof search and theorem extraction in a formal system, discovers tens of thousands of theorems, solves human benchmarks, and boosts LLM proof performance when used as lemmas.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30682","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models","primary_cat":"cs.SD","submitted_at":"2026-06-27T03:56:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28697","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mitigating Batch Effects in Histopathology via Language-Mediated Robust Embedding Generation","primary_cat":"cs.CV","submitted_at":"2026-06-27T02:44:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GLMP generates robust pathology embeddings by routing histology images through an intermediate textual representation produced by general-purpose MLLMs to mitigate batch effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28237","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors","primary_cat":"cs.RO","submitted_at":"2026-06-26T16:23:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27871","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LocalNav: Distilling Frontier VLMs and Embodied RL for On-Device Object Goal Navigation","primary_cat":"cs.RO","submitted_at":"2026-06-26T09:11:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Distillation from frontier VLMs plus E-RLVR regularization produces a 4B local model that achieves 34.5% SR on OVON while cutting inference latency by 82.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27684","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Intuition-Guided Latent Reasoning for LLM-Based Recommendation","primary_cat":"cs.IR","submitted_at":"2026-06-26T03:29:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IntuRec anchors LLM latent reasoning for recommendation by deriving an intuition embedding from top-K candidates via self- and cross-attention to initialize more accurate trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27527","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge","primary_cat":"cs.CV","submitted_at":"2026-06-25T20:19:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correlation datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.27484","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fine-tuning a multimodal large language model for clinician-grade autism behavioral scoring from short home videos","primary_cat":"cs.CV","submitted_at":"2026-06-25T19:06:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning Gemini 2.5 Pro with LoRA on 400 home videos improves per-feature agreement with clinicians by 40% and zero-shot ASD diagnosis F1 by 53% on held-out data, with classifier pipelines reaching 77% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26551","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing","primary_cat":"cs.CV","submitted_at":"2026-06-25T02:57:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26530","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DiARC: Distinguishing Positive and Negative Samples Helps Improving ARC-like Reasoning Ability of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-25T02:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiARC improves LLM performance on ARC-like benchmarks by constructing and training on preference pairs from three types of negative samples while keeping demonstrations fixed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25561","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes","primary_cat":"cs.CR","submitted_at":"2026-06-24T08:37:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25375","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Text Over Image: Auditing Multimodal Robustness in Synthetic Medical Image Detection","primary_cat":"cs.CV","submitted_at":"2026-06-24T04:08:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs for synthetic medical image detection overweight text metadata, flipping authenticity judgments on the same image and dropping accuracy on authentic images by 61.1% on average when an explicit AI-origin tag is present.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24021","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token-to-Token Alignment of Text Embeddings for Semantic Blending","primary_cat":"cs.CV","submitted_at":"2026-06-22T23:54:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Token-to-Token alignment rephrases prompts into shared structure then matches token embeddings by semantic similarity, making linear interpolation a meaningful operation for blending in text-to-image models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23917","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trustworthy Image Authentication using Forensic Knowledge Graphs","primary_cat":"cs.CV","submitted_at":"2026-06-22T20:29:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, localization, and justification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23679","ref_index":204,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Semantic Browsing: Controllable Diversity for Image Generation","primary_cat":"cs.CV","submitted_at":"2026-06-22T17:59:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A technique for controllable diversity in text-to-image generation by inducing structured semantic variations at the prompt level via VLM and agentic workflow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23595","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPIRAL: Learning to Search and Aggregate","primary_cat":"cs.AI","submitted_at":"2026-06-22T17:02:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPIRAL is a reinforcement learning framework that jointly optimizes sequential reasoning, parallel trace generation, and aggregation in language models for improved test-time performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23557","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views","primary_cat":"cs.CV","submitted_at":"2026-06-22T16:28:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DR-MV3D decomposes MV3D-VQA into global map construction, question-conditioned view planning, and egocentric grounding, supervised by global consistency and local trajectory rewards optimized via GRPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23270","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BoxCtrl: 3D-Aware Visual Prompting for Geometric Image Editing","primary_cat":"cs.CV","submitted_at":"2026-06-22T12:49:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BoxCtrl introduces colored 3D bounding boxes as visual prompts for geometric image editing, trained first on synthetic data via supervised fine-tuning then refined with reinforcement learning on real data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23061","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-22T09:13:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"New benchmark diagnoses directional, attributional, and temporal hallucinations in multimodal motion comparison models and demonstrates gains from explicit measurement verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22995","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-22T08:12:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"G2PO transforms linear trajectories into graphs, aggregates identical states for lower-variance value estimates, and uses edge-centric TD standardization, reporting up to 22.2% gains over GRPO on WebShop, ALFWorld, and AppWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22918","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation","primary_cat":"cs.CV","submitted_at":"2026-06-22T06:58:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JudgeFit produces per-VLM physical video evaluation taxonomies that improve held-out accuracy by a mean 32% relative to a single global schema across 16 models from eight families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23754","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Verifiable Foundation Models for Robot Safety","primary_cat":"cs.RO","submitted_at":"2026-06-22T03:10:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FEARL decomposes robot policies into an expressive Controller and a small verifiable Safety module to enable formal verification of safety constraints while retaining foundation-model task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22652","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Markov Chain Approach to Preference Alignment","primary_cat":"cs.LG","submitted_at":"2026-06-21T19:56:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCHF defines a Markov kernel from pairwise utilities U and proves geometric convergence to its stationary distribution at a rate set by the seminorm measuring non-transitivity of U, with first-order equivalence to RLHF and NLHF solutions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22497","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Vision-Language Models for Microscopic Plant Image Understanding","primary_cat":"cs.CV","submitted_at":"2026-06-21T13:39:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PlantMicro benchmark shows current VLMs achieve low accuracy (e.g. GPT-5 at 34.93% on pathogen classification) on fine-grained microscopic plant image tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22476","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CVSBench: A Comprehensive Benchmark for Cross-view Spatial Reasoning and Dreaming","primary_cat":"cs.CV","submitted_at":"2026-06-21T12:35:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CVSBench benchmark shows VLMs struggle with cross-view spatial consistency but improve substantially when given 3D scene imagination inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22460","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Music Playlist Captioning at Scale with Large Language Models","primary_cat":"cs.IR","submitted_at":"2026-06-21T12:08:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Deezer deployed an LLM-driven playlist captioning system in 2025 for its Daily Mix recommendations, claiming significant gains in user engagement from the added natural-language descriptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22385","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MetaPS: Adaptive Programmatic Strategy Selection for Market Agents","primary_cat":"cs.AI","submitted_at":"2026-06-21T08:22:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaPS trains models via simulation rollouts to select from programmatic strategy libraries for market agents, yielding better performance than fixed or direct LLM baselines across model sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22317","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Curriculum Reinforcement Learning Can Incentivize Reasoning Capacity in LLMs Beyond the Base Model","primary_cat":"cs.LG","submitted_at":"2026-06-21T03:15:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Boundary-aware Curriculum RL raises average pass@256 by 9.8 points over base models and 10.3 points over vanilla RLVR on Qwen, Llama, and DeepSeek families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21937","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Confidence Alignment for LLM Self-Assessment","primary_cat":"cs.CY","submitted_at":"2026-06-20T08:13:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LCAE is introduced as a Rasch-model metric that aligns LLM self-reported confidence with latent error probability derived from ability and item difficulty, shown to improve calibration on a medical dataset across 20 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21734","ref_index":135,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-19T20:43:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21657","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Chehre: An Emoji-Prompted Video Dataset for Perceptually Diverse Facial Expression Recognition","primary_cat":"cs.CV","submitted_at":"2026-06-19T18:01:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chehre introduces a new emoji-prompted video dataset with multi-annotator labels to benchmark models on dominant and distributional facial expression recognition tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21572","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robot Critics that Sweat the Small Stuff","primary_cat":"cs.RO","submitted_at":"2026-06-19T16:14:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21408","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vaani Benchmark V1.0: An Inclusive Multimodal Benchmark Dataset for Hindi","primary_cat":"eess.AS","submitted_at":"2026-06-19T13:20:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vaani Benchmark V1.0 is a multimodal Hindi ASR dataset from 104 districts featuring spontaneous speech recordings in real-world conditions and three independent transcriptions per segment for robust multi-reference evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21337","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams","primary_cat":"cs.LG","submitted_at":"2026-06-19T11:31:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DataClaw0 introduces an agentic data-tailoring paradigm, a 9B model trained on a synthetically generated dataset, and a new benchmark, claiming improved downstream adaptation in video generation, VQA, and GUI navigation under limited data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21228","ref_index":252,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sakana Fugu Technical Report","primary_cat":"cs.LG","submitted_at":"2026-06-19T08:47:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20999","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Inductive Generalization for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-19T00:19:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces an inductive generalization evaluation protocol for manipulation policies and shows that SOTA vision-language-action models fail on progressively harder task variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20970","ref_index":116,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CogniRoute: Learning to Route Social Evidence in Omni-Modal Models","primary_cat":"cs.CV","submitted_at":"2026-06-18T22:17:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CogniRoute adds a cognitive schema and route-aware RL to an omni-modal MoE, reaching 59.38% accuracy on a new 118K-example social video QA benchmark and beating prior baselines by 15-27 points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20881","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study","primary_cat":"cs.AI","submitted_at":"2026-06-18T19:15:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Empirical evaluation on LiveCodeBench shows certainty-based RLIF yields early gains followed by output shortening and reasoning collapse, providing no advantage for RLVR initialization on code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20835","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PromptMark: A Prompt-Guided Iterative-Feedback Framework for Source Code Watermarking","primary_cat":"cs.CR","submitted_at":"2026-06-18T18:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PromptMark is a black-box prompt-guided iterative-feedback framework that embeds statistically detectable watermarks in LLM-generated source code via naming patterns while preserving functional correctness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20517","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages","primary_cat":"cs.AI","submitted_at":"2026-06-18T17:35:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Multi-LCB extends LiveCodeBench to 12 languages by translating Python tasks, revealing Python overfitting and performance disparities when evaluating 24 LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20436","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-View Decompilation for LLM-Based Malware Classification","primary_cat":"cs.CR","submitted_at":"2026-06-18T16:15:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-decompiler prompting improves LLM malware classification F1 by supplying complementary views of the same binary.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20244","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs","primary_cat":"cs.CV","submitted_at":"2026-06-18T13:56:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20173","ref_index":141,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Qiskit Code Migration with LLMs","primary_cat":"cs.SE","submitted_at":"2026-06-18T12:40:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A taxonomy-guided RAG system with LLMs reduces hallucinations and improves migration suggestions for Qiskit code compared to unconstrained retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19930","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization","primary_cat":"cs.HC","submitted_at":"2026-06-18T08:29:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld using only automatically generated annotation-free data via MobileGym and HiFPO, with ForgeOwl-8B reaching 77.6%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19926","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management","primary_cat":"cs.HC","submitted_at":"2026-06-18T08:26:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemGUI-Agent uses Context-as-Action (ConAct) for proactive context management in long-horizon GUI tasks, trained on the MemGUI-3K dataset to achieve top 8B-model results on MemGUI-Bench and MobileWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19818","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uncertainty-Aware Reward Modeling for Stable RLHF","primary_cat":"cs.LG","submitted_at":"2026-06-18T05:46:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UARM equips reward models with quantile-based conformal prediction uncertainty and reweights GRPO advantages via heteroscedastic variance decomposition to improve calibration and reduce reward hacking in RLHF.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26132","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Code evolution for link prediction in complex networks","primary_cat":"cs.SI","submitted_at":"2026-06-18T05:46:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Code evolution produces link prediction algorithms with average AUC of 0.915 versus 0.783 for human-designed methods across 580 networks, with better scalability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19584","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Language-Instructed Vision Embeddings for Controllable and Generalizable Perception","primary_cat":"cs.CV","submitted_at":"2026-06-17T20:39:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIVE uses language to generate task-centric vision embeddings at inference, reducing hallucinations by 34 points on MMVP, outperforming larger VLMs on VQA, and generalizing to unseen tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19073","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework","primary_cat":"cs.CV","submitted_at":"2026-06-17T13:44:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces HOI-Edit benchmark with HOI-Eval metric and SCPE self-correcting framework leveraging I2V models for competitive HOI image editing performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19025","ref_index":92,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs","primary_cat":"cs.LG","submitted_at":"2026-06-17T12:50:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18988","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection","primary_cat":"cs.AI","submitted_at":"2026-06-17T12:08:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ThinkDeception introduces MLLMs, a multimodal CoT dataset, and VAC-GRPO progressive RL to convert deception detection into interpretable reasoning and claims new SOTA accuracy plus rationale quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18961","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization","primary_cat":"cs.LG","submitted_at":"2026-06-17T11:42:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Unsupervised rewards combining model uncertainty and semantic consistency allow protein language models to self-steer via SRO and BRO algorithms, outperforming DPO and KTO on out-of-distribution prompts while approaching oracle performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18947","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-06-17T11:30:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DSG decouples search grounding from LLM reasoning via an MCP-compatible gateway, nearly matching native accuracy on QA benchmarks at 91% lower cost while preserving output contracts and cutting production costs by over 98%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18709","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment","primary_cat":"cs.CL","submitted_at":"2026-06-17T05:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs achieve maximum Spearman correlations of 0.152 (direct) and 0.241 (response-based) with human item discrimination values, showing non-random but unreliable signal for distinguishing student proficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18327","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-CTRL: Self-Consistency Training with Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-06-16T17:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-CTRL uses RL to align LM self-explanations with behavior, boosting bias correlation to R²=0.64 and refusal prediction accuracy to 92% while cutting harm failures to 0.5%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18249","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification","primary_cat":"cs.CV","submitted_at":"2026-06-16T17:59:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniAR uses a shared context-visual tokenizer with bitwise quantization and parallel prediction in an autoregressive framework to unify visual understanding and generation, claiming SOTA on generation and editing tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18154","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Cardiac Electrophysiology Digital Twins Through Agentic Discovery of Hybrid Structure","primary_cat":"cs.AI","submitted_at":"2026-06-16T16:54:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEADS is an LLM-agent framework that discovers hybrid models for cardiac EP digital twins by treating domain knowledge as an action space, outperforming human-designed and other LLM-based hybrids on synthetic and real data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18147","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning","primary_cat":"cs.AI","submitted_at":"2026-06-16T16:45:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WEQA proposes a query-adaptive agent framework combining LLMs with wearable data tools, achieving 24% higher accuracy than baselines on a benchmark from four open datasets, with gains in expert-rated usefulness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18134","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning","primary_cat":"eess.AS","submitted_at":"2026-06-16T16:34:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dixtral uses diarization conditioning on a Whisper-based encoder within Voxtral to outperform baselines on multi-speaker transcription and match or exceed on QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":100,"offset":0}}