pith. sign in

arxiv: 2605.15104 · v2 · pith:7IK6ILWQnew · submitted 2026-05-14 · 💻 cs.CL

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Pith reviewed 2026-05-21 08:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords tool callingvoice agentsLLM evaluationtext-to-speech conversionbenchmark adaptationmultimodal modelsLLM-as-judge
0
0 comments X

The pith

A conversion framework turns text tool-calling benchmarks into paired audio versions to test voice agents without new annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset-agnostic method that applies text-to-speech synthesis, speaker variation, and environmental noise to existing text benchmarks for tool calling, creating matched audio instances that keep the original tool schemas and gold labels intact. This allows direct measurement of how well models perform tool use when inputs arrive as speech rather than text. Evaluation across seven omni-modal models on two converted benchmarks shows that results depend heavily on the model and the task, with consistent but varying drops from text to audio performance. Failures in the audio setting most commonly trace to errors in capturing specific argument values from spoken input. The work also validates an open-source LLM-as-judge approach that reaches high agreement with proprietary judges, enabling private evaluation pipelines.

Core claim

We present a reproducible framework that converts verified text tool-calling benchmarks into controlled audio versions through text-to-speech, speaker variation, and added noise while preserving all original annotations, enabling direct text-to-voice comparison of model tool-use performance without re-annotation.

What carries the argument

The dataset-agnostic conversion pipeline that generates paired text-audio instances from existing benchmarks by applying text-to-speech synthesis, speaker variation, and environmental noise while retaining the original tool schema and gold labels.

If this is right

  • Model rankings for tool calling shift between text and audio inputs, with the best model on one benchmark not necessarily leading on the other.
  • Text-to-voice performance gaps range from roughly 2 to 5 points depending on the model.
  • Most degradations in audio arise from incorrect extraction of argument values in spoken instructions.
  • Open-source models with at least 8 billion parameters can serve as judges that agree with proprietary judges more than 80 percent of the time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could let developers quickly screen voice-agent designs against many existing text benchmarks before building dedicated spoken corpora.
  • It points to a need for improved handling of precise numerical or entity values when models receive instructions through speech rather than text.
  • The same conversion method could be tested on other tool-use or agent benchmarks to check whether the observed text-to-audio gaps generalize.

Load-bearing premise

That audio created from text benchmarks by adding speech synthesis, speaker changes, and noise keeps the same semantic content and difficulty for tool calling as actual spoken interactions.

What would settle it

A direct comparison showing that human raters judge the audio instances substantially harder or easier than the original text versions in ways that correlate with model score gaps.

Figures

Figures reproduced from arXiv: 2605.15104 by Jonas Robertson, Md Tahmid Rahman Laskar, Quinten McNamara, Seyyed Saeed Sarfjoo, Shashi Bhushan TN, Xue-Yong Fu.

Figure 1
Figure 1. Figure 1: An overview of our methodology for converting text-based tool datasets into audio benchmarks for tool-calling evaluation. The pipeline uses text-to-speech (TTS) models (GPT-4o-Mini-TTS and Gemini-2.5-TTS) to generate diverse audio queries with different voices and genders, which are then processed by omni-modal LLMs and evaluated via automatic evaluation or LLM Judge. better. Second, each audio example has… view at source ↗
Figure 2
Figure 2. Figure 2: Average performance of Qwen3-Omni across SNR levels, aggregated over all TTS models and voices. Model Clean Text Direct Voice Cascade (ASR→text) ∆TV ∆TC ∆CV Gemini-3.1-Flash-Live 73.0 70.4 71.3 2.6 1.7 0.9 GPT-Realtime-1.5 64.0 59.2 58.8 4.8 5.2 -0.4 Qwen3-Omni 62.2 60.4 58.9 1.8 3.3 -1.5 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Error Analysis on the When2Call benchmark computed over 6 TTS voice variants. and tool-selection errors (15.5%) are more frequent than other models. Decision errors are also substantial across all three systems, ranging from 25.8% for Gemini-3.1-Flash-Live to 37.4% for GPT-Realtime-1.5. This suggests that the text-to-audio gap is not only caused by incorrect argument values; in many cases, audio input chan… view at source ↗
Figure 4
Figure 4. Figure 4: LLM-as-judge evaluation on Confetti. (a) Proprietary LLM judge scores in reference-wise and reference-free settings (see Appendix A.6 for the detailed breakdown). (b) Judgment agreement between open judges (Qwen3) against proprietary reference judgments. references are shown, even if the output is functionally reasonable (see Appendix A.4 for an example). To verify this statistically, we apply McNemar’s te… view at source ↗
Figure 5
Figure 5. Figure 5: Text-only tool-calling analysis on Confetti. (a) AST soft accuracy across model families and sizes. (b) AST soft accuracy under ambiguous-query reformulation stress tests. Robustness to query reformulation. Beyond the original datasets, we additionally define stress-test slices that practitioners can add when making deployment choices. These slices target customer-support failure modes that may be underrep… view at source ↗
read the original abstract

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a dataset-agnostic framework that converts existing text-based tool-calling benchmarks into paired audio instances via TTS synthesis, speaker variation, and environmental noise while preserving original tool schemas and gold labels. It evaluates seven omni-modal models on audio versions of the Confetti and When2Call datasets, reporting model- and task-specific scores (Gemini-3.1-Flash-Live at 70.4 on Confetti; GPT-Realtime-1.5 at 71.9 on When2Call), text-to-voice gaps of 1.8–4.8 points, failure modes dominated by argument-value misunderstandings, an ambiguity reformulation stress test, and validation of open-source LLM-as-judge protocols against human preferences.

Significance. If the audio conversion preserves semantic content and task difficulty without introducing systematic biases, the framework supplies a reproducible first-stage diagnostic for voice-based tool calling that complements purpose-built audio corpora. The empirical demonstration of model- and task-dependence, together with the finding that open-source Qwen3 judges (≥8B) exceed 80% agreement with proprietary judges, would support more accessible and privacy-preserving evaluation practices in multimodal agent research.

major comments (2)
  1. [§3 and §4] §3 (Framework) and §4 (Experiments): The headline claims that text-to-voice gaps (1.8–4.8 points) and model rankings are meaningful rest on the premise that TTS+noise conversion leaves argument-value parsing difficulty unchanged. No quantitative check—such as ASR word-error rate measured specifically on argument spans or human re-annotation of a held-out sample—is reported to confirm that speaker variation and environmental noise do not disproportionately degrade slot values relative to other content.
  2. [§3.1] §3.1 (Audio Conversion Pipeline): It is unclear whether noise-addition parameters, speaker-selection criteria, and TTS settings were fixed before any model evaluation or selected after inspecting preliminary outputs. This ambiguity directly affects the reproducibility claim and the interpretation of the reported performance differences as intrinsic model properties rather than pipeline artifacts.
minor comments (2)
  1. [Table 2] Table 2 (model scores): Adding per-model standard deviations or bootstrap confidence intervals around the reported point estimates (e.g., 70.4, 71.9) would better convey the precision of the observed gaps.
  2. [§5] §5 (LLM-as-Judge Validation): The agreement figures for open-source judges would be strengthened by reporting the exact prompt templates and the size of the human-preference validation set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the reproducibility and validation aspects of the framework.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Framework) and §4 (Experiments): The headline claims that text-to-voice gaps (1.8–4.8 points) and model rankings are meaningful rest on the premise that TTS+noise conversion leaves argument-value parsing difficulty unchanged. No quantitative check—such as ASR word-error rate measured specifically on argument spans or human re-annotation of a held-out sample—is reported to confirm that speaker variation and environmental noise do not disproportionately degrade slot values relative to other content.

    Authors: We agree that additional quantitative validation would further support the interpretation of the text-to-voice gaps. While the paired design and preservation of original gold labels enable direct comparisons, the original manuscript does not include a targeted ASR word-error rate analysis on argument spans or human re-annotation focused on slot values. In the revised manuscript we will add such an analysis on a held-out sample, reporting ASR performance specifically on argument-value segments and discussing any observed effects on parsing difficulty. revision: yes

  2. Referee: [§3.1] §3.1 (Audio Conversion Pipeline): It is unclear whether noise-addition parameters, speaker-selection criteria, and TTS settings were fixed before any model evaluation or selected after inspecting preliminary outputs. This ambiguity directly affects the reproducibility claim and the interpretation of the reported performance differences as intrinsic model properties rather than pipeline artifacts.

    Authors: We thank the referee for identifying this source of potential ambiguity. The pipeline hyperparameters were in fact fixed prior to the main model evaluations, with noise levels, speaker selection criteria, and TTS settings determined during pilot experiments and then locked. To improve clarity and reproducibility, we will revise Section 3.1 to explicitly describe this decision sequence and will include the precise parameter values and random seeds in the supplementary materials. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on converted benchmarks with external validation

full rationale

The paper describes a dataset conversion pipeline (TTS + speaker variation + noise) applied to existing text tool-calling benchmarks, followed by direct model evaluations that produce the reported scores (e.g., Gemini-3.1-Flash-Live at 70.4 on Confetti). Central results are performance measurements and failure-mode observations on the generated audio instances. The LLM-as-judge protocol is explicitly validated against human preferences, providing an external anchor independent of the paper's own fitted values or self-citations. No equations, parameter fits, or derivations are presented as predictions; the work contains no self-definitional loops, uniqueness theorems, or ansatz smuggling. The derivation chain is therefore self-contained empirical measurement rather than reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that TTS-generated audio with added noise preserves the original task semantics for tool calling. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Text-to-speech synthesis combined with speaker variation and environmental noise produces audio instances whose tool-calling difficulty matches real-world spoken queries.
    Invoked when claiming that the converted audio versions serve as valid proxies for voice-based tool calling without re-annotation.

pith-pipeline@v0.9.0 · 5833 in / 1412 out tokens · 29691 ms · 2026-05-21T08:42:11.009892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

294 extracted references · 294 canonical work pages · 40 internal anchors

  1. [1]

    21st International Congress on Acoustics (ICA 2013)

    DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments , author=. 21st International Congress on Acoustics (ICA 2013)

  2. [2]

    N e K o: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model

    Lin, Yen-Ting and Chen, Zhehuai and Zelasko, Piotr and Wan, Zhen and Yang, Xuesong and Chen, Zih-Ching and Puvvada, Krishna C and Hu, Ke and Fu, Szu-Wei and Chiu, Jun Wei and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Yang, Chao-Han Huck. N e K o: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts...

  3. [3]

    2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =

    Generative Speech Recognition Error Correction With Large Language Models and Task-Activating Prompting , author =. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =

  4. [4]

    International Conference on Learning Representations , volume=

    Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. International Conference on Learning Representations , volume=

  5. [5]

    arXiv preprint arXiv:2503.23395 , year=

    Scaling auditory cognition via test-time compute in audio language models , author=. arXiv preprint arXiv:2503.23395 , year=

  6. [6]

    Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

    Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception , author=. arXiv preprint arXiv:2601.09413 , year=

  7. [7]

    International conference on machine learning , pages=

    Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

  8. [8]

    Baba, Kaito and Nakata, Wataru and Saito, Yuki and Saruwatari, Hiroshi , booktitle =. The

  9. [9]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  10. [10]

    Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

    C-pack: Packed resources for general chinese embeddings , author=. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

  11. [11]

    Qwen3-Omni Technical Report

    Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

  12. [12]

    AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching

    Laskar, Md Tahmid Rahman and Chen, Cheng and Fu, Xue-yong and Azizi, Mahsa and Bhushan, Shashi and Corston-oliver, Simon. AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.1865...

  13. [13]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

  14. [14]

    arXiv preprint arXiv:2510.05858 , year=

    Dacp: Domain-adaptive continual pre-training of large language models for phone conversation summarization , author=. arXiv preprint arXiv:2510.05858 , year=

  15. [15]

    arXiv preprint arXiv:2510.08152 , year=

    DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations , author=. arXiv preprint arXiv:2510.08152 , year=

  16. [16]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  17. [17]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  18. [18]

    arXiv preprint arXiv:2402.18667 , year=

    FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability , author=. arXiv preprint arXiv:2402.18667 , year=

  19. [19]

    The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    The Prompt Report: A Systematic Survey of Prompting Techniques , author=. arXiv preprint arXiv:2406.06608 , year=

  20. [20]

    Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?

    Fu, Xue-Yong and Laskar, Md Tahmid Rahman and Khasanova, Elena and Chen, Cheng and Tn, Shashi. Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...

  21. [21]

    A lign S core: Evaluating Factual Consistency with A Unified Alignment Function

    Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting. A lign S core: Evaluating Factual Consistency with A Unified Alignment Function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

  22. [22]

    Emergent Abilities of Large Language Models

    Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

  23. [23]

    Instruction Tuning with GPT-4

    Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=

  24. [24]

    arXiv preprint arXiv:2307.03109 , year=

    A survey on evaluation of large language models , author=. arXiv preprint arXiv:2307.03109 , year=

  25. [26]

    Thirty-Sixth

    Ming Zhong and Yang Liu and Yichong Xu and Chenguang Zhu and Michael Zeng , title =. Thirty-Sixth. 2022 , url =

  26. [27]

    Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective

    Laskar, Md Tahmid Rahman and Fu, Xue-Yong and Chen, Cheng and Bhushan TN, Shashi. Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023. doi:10.18653/v1/2023.emnlp-industry.33

  27. [28]

    TinyLlama: An Open-Source Small Language Model

    TinyLlama: An Open-Source Small Language Model , author=. arXiv preprint arXiv:2401.02385 , year=

  28. [29]

    Scaling Instruction-Finetuned Language Models

    Scaling instruction-finetuned language models , author=. arXiv preprint arXiv:2210.11416 , year=

  29. [30]

    PaLM: Scaling Language Modeling with Pathways

    Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

  30. [31]

    Computational Linguistics , volume=

    Domain adaptation with pre-trained transformers for query-focused abstractive text summarization , author=. Computational Linguistics , volume=. 2022 , publisher=

  31. [32]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    The fineweb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=

  32. [33]

    Developing a Production System for P urpose of C all Detection in Business Phone Conversations

    Khasanova, Elena and Hiranandani, Pooja and Gardiner, Shayna and Chen, Cheng and Corston-Oliver, Simon and Fu, Xue-Yong. Developing a Production System for P urpose of C all Detection in Business Phone Conversations. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

  33. [34]

    Query- OPT : Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization

    Laskar, Md Tahmid Rahman and Khasanova, Elena and Fu, Xue-Yong and Chen, Cheng and Tn, Shashi Bhushan. Query- OPT : Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.86

  34. [35]

    Proceedings of the 28th international conference on computational linguistics , pages=

    Distill and replay for continual language learning , author=. Proceedings of the 28th international conference on computational linguistics , pages=

  35. [36]

    SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale , author=. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2022 , organization=

  36. [37]

    Labrak, A

    Biomistral: A collection of open-source pretrained large language models for medical domains , author=. arXiv preprint arXiv:2402.10373 , year=

  37. [38]

    Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=

    Can probing classifiers reveal the learning by contact center large language models?: No, it doesn’t! , author=. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=

  38. [39]

    arXiv preprint arXiv:2311.08545 , year=

    Efficient continual pre-training for building domain specific large language models , author=. arXiv preprint arXiv:2311.08545 , year=

  39. [40]

    BloombergGPT: A Large Language Model for Finance

    Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

  40. [41]

    Journal of the American Medical Informatics Association , pages=

    PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , pages=. 2024 , publisher=

  41. [42]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  42. [43]

    arXiv preprint arXiv:2402.01364 , year=

    Continual learning for large language models: A survey , author=. arXiv preprint arXiv:2402.01364 , year=

  43. [44]

    A survey on model compression for large language models

    A survey on model compression for large language models , author=. arXiv preprint arXiv:2308.07633 , year=

  44. [45]

    arXiv preprint arXiv:2308.10792 , year=

    Instruction tuning for large language models: A survey , author=. arXiv preprint arXiv:2308.10792 , year=

  45. [46]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=

  46. [47]

    LoRA: Low-Rank Adaptation of Large Language Models

    Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

  47. [48]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

  48. [49]

    Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

    Efficient Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.03863 , year=

  49. [50]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  50. [51]

    Improving language understanding by generative pre-training , author=

  51. [52]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  52. [53]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  53. [54]

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    Meditron-70b: Scaling medical pretraining for large language models , author=. arXiv preprint arXiv:2311.16079 , year=

  54. [55]

    arXiv preprint arXiv:2310.04270 , year=

    A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks , author=. arXiv preprint arXiv:2310.04270 , year=

  55. [56]

    Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

    Jahan, Israt and Laskar, Md Tahmid Rahman and Peng, Chun and Huang, Jimmy. Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. 2023. doi:10.18653/v1/2023.bionlp-1.30

  56. [57]

    A Survey of Large Language Models

    A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

  57. [58]

    Entity-level Sentiment Analysis in Contact Center Telephone Conversations

    Fu, Xue-yong and Chen, Cheng and Laskar, Md Tahmid Rahman and Gardiner, Shayna and Hiranandani, Pooja and Tn, Shashi Bhushan. Entity-level Sentiment Analysis in Contact Center Telephone Conversations. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.emnlp-industry.49

  58. [59]

    Predicting Customer Satisfaction with Soft Labels for Ordinal Classification

    Manderscheid, Etienne and Lee, Matthias. Predicting Customer Satisfaction with Soft Labels for Ordinal Classification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.18653/v1/2023.acl-industry.62

  59. [60]

    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    An auto encoder-based dimensionality reduction technique for efficient entity linking in business phone conversations , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  60. [61]

    International workshop on machine learning for multimodal interaction , pages=

    The AMI meeting corpus: A pre-announcement , author=. International workshop on machine learning for multimodal interaction , pages=. 2005 , organization=

  61. [62]

    2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003

    The ICSI meeting corpus , author=. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). , volume=. 2003 , organization=

  62. [63]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  63. [64]

    Transformers: State-of-the-Art Natural Language Processing , booktitle =

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Transformers: State-of-the-Art Natural Language Processing , booktitle =. 2020 , url =. doi:10.18653/v1/2020.emnlp-demos.6 , timestamp =

  64. [65]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

    Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

  65. [66]

    Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

    Raphael Tang and Yao Lu and Linqing Liu and Lili Mou and Olga Vechtomova and Jimmy Lin , title =. CoRR , volume =. 2019 , url =. 1903.12136 , timestamp =

  66. [67]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =

  67. [68]

    Deep contextualized word representations

    Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer , title =. CoRR , volume =. 2018 , url =. 1802.05365 , timestamp =

  68. [69]

    Bidirectional LSTM-CRF Models for Sequence Tagging

    Zhiheng Huang and Wei Xu and Kai Yu , title =. CoRR , volume =. 2015 , url =. 1508.01991 , timestamp =

  69. [70]

    Contextual String Embeddings for Sequence Labeling , booktitle =

    Alan Akbik and Duncan Blythe and Roland Vollgraf , editor =. Contextual String Embeddings for Sequence Labeling , booktitle =. 2018 , url =

  70. [71]

    Big Bird: Transformers for Longer Sequences , booktitle =

    Manzil Zaheer and Guru Guruganesh and Kumar Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta. Big Bird: Transformers for Longer Sequences , booktitle =. 2020 , url =

  71. [72]

    Longformer: The Long-Document Transformer

    Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =. 2004.05150 , timestamp =

  72. [73]

    Rethinking Attention with Performers , booktitle =

    Krzysztof Marcin Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking Attention with Performers , booktitle =. 2021 , url =

  73. [74]

    8th International Conference on Learning Representations,

    Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

  74. [75]

    CoRR , volume =

    Yi Tay and Mostafa Dehghani and Dara Bahri and Donald Metzler , title =. CoRR , volume =. 2020 , url =. 2009.06732 , timestamp =

  75. [76]

    9th International Conference on Learning Representations,

    Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,. 2021 , url =

  76. [77]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =

  77. [78]

    Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

  78. [79]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  79. [81]

    Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

    Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

  80. [82]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,

    Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto , editor =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,. 2020 , url =. doi:10.18653/v1/2020.emnlp-main.523 , timestamp =

Showing first 80 references.