arXiv preprint arXiv:2510.12712 (2025)

· 2025 · arXiv 2510.12712

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

cs.SD · 2026-05-09 · accept · novelty 7.0

WASIL is a released dataset of 8,529 in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, explicit like/dislike feedback, answerability annotations, a 2,000-turn MSA and dialect test set, and a reference-free multi-judge LLM evaluation method.

GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces

cs.CL · 2026-04-05 · unverdicted · novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.

Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution

cs.AI · 2026-05-09 · unverdicted · novelty 5.0

Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.

citing papers explorer

Showing 4 of 4 citing papers.

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs cs.SD · 2026-05-09 · accept · none · ref 60
WASIL is a released dataset of 8,529 in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, explicit like/dislike feedback, answerability annotations, a 2,000-turn MSA and dialect test set, and a reference-free multi-judge LLM evaluation method.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces cs.CL · 2026-04-05 · unverdicted · none · ref 17
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR cs.AI · 2026-05-19 · unverdicted · none · ref 39
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution cs.AI · 2026-05-09 · unverdicted · none · ref 7
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.

arXiv preprint arXiv:2510.12712 (2025)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer