{"total":25,"items":[{"citing_arxiv_id":"2606.01016","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects","primary_cat":"cs.CL","submitted_at":"2026-05-31T05:13:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00851","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning","primary_cat":"cs.SD","submitted_at":"2026-05-30T18:53:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sympatheia introduces a continuous affect-conditioned speech dialogue model and the Sympatheia-18k synthetic dataset, showing improved emotional appropriateness over baselines when speech cues are limited.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21008","ref_index":114,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey of Audio Reasoning in Multimodal Foundation Models","primary_cat":"eess.AS","submitted_at":"2026-05-20T10:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"evaluation is to assess whether a model can reason over spoken input rather than merely transcribe it. Existing settings can be divided into content-based reasoning and acoustic-based reasoning, depending on whether inference relies on linguistic content or acoustic signals. Content-based reasoning includes tasks where evidence comes from semantic content, such as spoken QA [113], dialogue understanding [114], [115], and instruction follow- ing [116], [117]. The model needs to recover meaning and reason over it to produce correct responses. Instruction fol- lowing in speech is also a form of content-based reasoning, as it requires understanding and executing semantic intent. Recent settings [113], [116], [117] further introduce content- preserving acoustic variations, such as emotion, speaking rate,"},{"citing_arxiv_id":"2605.20755","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action","primary_cat":"eess.AS","submitted_at":"2026-05-20T05:54:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13841","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents","primary_cat":"cs.SD","submitted_at":"2026-05-13T17:58:52+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"• We create three enterprise benchmark datasets with a total of 213 scenarios focused on surfacing voice-specific failure modes. • We show empirical findings on cascade vs. audio-native tradeoffs, perturbation sensitivity, and behavioral consistency across trials. 2 Related Work Many existing voice benchmarks focus on individual components such as STT robustness [5, 2, 6], TTS quality [20, 13], or conversational dynamics [25, 3], rather than the end-to-end behavior of a voice agent. We organize the following discussion around the two challenges introduced above: the fidelity of multi-turn simulation and the comprehensiveness of voice agent quality measurement. Conversation Simulation.Effective voice agent evaluation requires a simulation methodology that faithfully"},{"citing_arxiv_id":"2605.12034","ref_index":59,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","primary_cat":"cs.MM","submitted_at":"2026-05-12T12:16:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06897","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes","primary_cat":"cs.CL","submitted_at":"2026-05-07T19:57:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27393","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction","primary_cat":"cs.CL","submitted_at":"2026-04-30T04:05:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"6 24.3 Multi-Image Mantis-Eval 72.8 70.5 74.2 78.3 79.7 MUIRBench74.555.8 64.4 61.9 72.0 MMSI-Bench 12.1 - 11.3 14.2 16.6 Video Video-MME (w/o subs)75.666.0 71.4 70.5 70.4 LVBench62.2- 58.0 50.2 50.9 MLVU (M-Avg) 77.8 70.278.175.2 76.5 LongVideoBench (val) - 62.1 66.466.9 66.0 MotionBench -62.359.5 61.7 61.4 spoken question answering on V oiceBench [63], Speech TriviaQA [64], Speech Web Questions [65], and Speech CMMU [66]. For speech generation, we evaluate speech quality, intelligibility, speaker similarity, long-form generation, and emotion/style control using SeedTTS Test [67], LongTTS [68], Expresso [69], and ESD [70]. Text Capability.We compare MiniCPM-o 4.5 with its language backbone, Qwen3-Instruct-8B [10],"},{"citing_arxiv_id":"2604.20842","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:59:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00022","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment","primary_cat":"cs.CL","submitted_at":"2026-04-20T00:57:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16659","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs","primary_cat":"cs.CR","submitted_at":"2026-04-17T19:28:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For AF3 and Kimi-Audio, LoRA adapters are applied to all attention and FFN projections; for Qwen2.5-Omni, adapters are applied to all linear layers of the Thinker module. For models with multi-stream architectures (e.g., parallel audio and text generation), loss is computed over all output streams with appropriate masking: L= ∑ s∈S ∑t ℓ(s) t ·m (s) t ∑t m(s) t +ϵ (5) whereSdenotes the set of output streams andm (s) t are binary loss masks. Kimi-Audio embedding centering.For Kimi-Audio's WhisperVQEncoder, raw embed- dings are dominated by a large global mean component (>99.9% of L2 norm). We center 21 Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs embeddings by subtracting the global mean before computing cosine distances:"},{"citing_arxiv_id":"2604.22821","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use","primary_cat":"cs.SD","submitted_at":"2026-04-17T16:41:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15037","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench","primary_cat":"cs.AI","submitted_at":"2026-04-16T14:06:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2.1. Digital State Construction To simulate realistic and semantically rich digital contexts, which serve to ground user personas and inform potential future actions, we adopt the application state format from OB2 [16]. The generation process begins by randomly selecting a theme from the dialog-topics dataset [17]. Subsequently, we employ Qwen3-Max [18] to synthesize fine-grained digital states con- ditioned on these themes. These states incorporate implicit cues, such as scheduled appointments (e.g., specific meeting times) or personal con- straints (e.g., dietary restrictions or medical conditions). By incorporating these underlying variables, the pipeline provides the necessary contextual foundation for the model to evaluate"},{"citing_arxiv_id":"2604.14604","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-04-16T04:22:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"each category, we instantiate the attack with specific target responses, as summarized in Table 7 in Appendix A. Implementation details.For each target response in Ta- ble 7, we randomly select an audio carrier of 15s, which is trained using 100 user instructions, and then tested on a disjoint set of 100 unseen instructions. We select an RIR signal from RVB2014 [69] to initialize the convolutional perturbation with a duration of 0.2s. As for the hyperpa- rameters, we setαto 1,βto 50 andκto 0.015. Training is conducted for 2,000 steps on continuous and hybrid models, and 3,000 steps for discrete models, with a step size of 0.001 and a batch size of 4. The temperatureτ= 10for gradient estimation. In all experiments, we use bfloat16 precision"},{"citing_arxiv_id":"2604.14548","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2506.04779, 2025. [2] Sonal Kumar, Šimon Sedlá ˇcek, Vaibhavi Lokegaonkar, Fernando López, Wenyi Yu, Nishit Anand, Hyeonggon Ryu, Lichang Chen, Maxim Pliˇcka, Miroslav Hlaváˇcek, et al. Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. arXiv preprint arXiv:2508.13992, 2025. [3] Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. V oicebench: Benchmarking llm-based voice assistants.arXiv preprint arXiv:2410.17196, 2024. [4] Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, and Zhou Zhao. Wavbench: Benchmarking reasoning, colloquialism, and paralinguistics"},{"citing_arxiv_id":"2604.11594","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-13T15:06:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"HumDial-EIBench (Ours) Human-rec. Multi✓ ✓ ✓ ✓Obj. / LLM / Human 2. Related Work Audio Language Models.Traditional spoken dialogue systems adopt a cascaded ASR→LLM→TTS architecture, where the intermediate text transcription inevitably discards paralinguis- tic cues such as intonation and emotion. End-to-end ALMs- including open-source models like Moshi [21], Qwen2.5- Omni [22], and Qwen3-Omni [23], alongside closed-source ar- chitectures such as GPT-4o [1]-have emerged to address this limitation. By directly processing continuous audio signals, these models maintain unified representations of both seman- tic content and acoustic paralinguistic features, establishing the technical foundation for native emotional intelligence tasks."},{"citing_arxiv_id":"2603.17837","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning","primary_cat":"eess.AS","submitted_at":"2026-03-18T15:30:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.09643","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings","primary_cat":"cs.ET","submitted_at":"2026-03-10T13:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26388","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Game-Time: Evaluating Temporal Dynamics in Spoken Language Models","primary_cat":"eess.AS","submitted_at":"2025-09-30T15:23:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23435","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models","primary_cat":"cs.SD","submitted_at":"2025-09-27T18:08:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioRole provides 1M+ character-grounded audio-text dialogues from TV series plus ARP-Eval to train and measure audio role-playing models, with ARP-Model showing 0.31 acoustic and 0.36 content personalization scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19858","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Benchmarking Gaslighting Attacks Against Speech Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-09-24T07:57:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gaslighting attacks using Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation strategies cause a 24.3% average accuracy drop in Speech LLMs while also triggering behavioral changes like apologies and refusals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.08031","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs","primary_cat":"cs.SD","submitted_at":"2025-09-09T15:30:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.08128","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2025-07-10T19:40:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"conduct a human study of model outputs on AF-Chat-test (more details in Appendix E) and compare only with Qwen2-Audio. Each annotator is asked to rate the response of the model for every turn on a scale of 1-5 for factuality, usefulness, and depth. We report results averaged across all instances across all turns. Furthermore, we evaluate the voice-text capabilities of our AF3-Chat model on two datasets, OpenAudioBench [75] and V oiceBench [18]. These benchmarks consist of voice queries (synthetically generated speech from text queries) and assess aspects such as instruction following, question answering, trivia knowledge, and reasoning. Finally, we evaluate our speech generation module using zero-shot TTS evaluation on the English subset of the SEED benchmark [ 4]. All baseline results reported in this work are based on our own evaluations; we did not rely on results"},{"citing_arxiv_id":"2504.18425","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"It currently integrates and supports Kimi-Audio and a series of recent audio LLMs [11, 74, 84, 41, 28], and can be leveraged to evaluate any other audio foundation models. The toolkit provides the following features and benefits: • We implement a standardized WER calculation (based on Qwen-2-Audio [11]) and integrate GPT-4o-mini as an intelligent judge (following [8]) for tasks like audio question answering. This approach overcomes the limitations of inconsistent metrics and simplistic string matching, enabling fair comparison. • Our toolkit offers a single unified platform supporting diverse models and versions, simplifying side-by-side comparisons. It provides a crucial structure for defining and sharing standardized"},{"citing_arxiv_id":"2503.20215","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen2.5-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-03-26T04:17:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"1Alphabetical order. 14 Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A scalable and polyglot approach to benchmarking neural code generation. IEEE Trans. Software Eng., 49(7):3675-3691, 2023. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv:2403.20330, 2024a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen"}],"limit":50,"offset":0}