WASIL is a released dataset of 8,529 in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, explicit like/dislike feedback, answerability annotations, a 2,000-turn MSA and dialect test set, and a reference-free multi-judge LLM evaluation method.
arXiv preprint arXiv:2510.12712 (2025)
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 2polarities
background 2representative citing papers
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.
citing papers explorer
-
WASIL: In-the-Wild Arabic Spoken Interactions with LLMs
WASIL is a released dataset of 8,529 in-the-wild Arabic spoken LLM interactions with audio, ASR hypotheses, responses, explicit like/dislike feedback, answerability annotations, a 2,000-turn MSA and dialect test set, and a reference-free multi-judge LLM evaluation method.
-
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
-
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
-
Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution
Ace-Skill boosts multimodal agent self-evolution via prioritized rollouts with lazy-decay tracking and semantic knowledge clustering, yielding up to 35% relative gains on tool-use benchmarks and zero-shot transfer to smaller models.