From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
Pith reviewed 2026-05-21 08:42 UTC · model grok-4.3
The pith
A conversion framework turns text tool-calling benchmarks into paired audio versions to test voice agents without new annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a reproducible framework that converts verified text tool-calling benchmarks into controlled audio versions through text-to-speech, speaker variation, and added noise while preserving all original annotations, enabling direct text-to-voice comparison of model tool-use performance without re-annotation.
What carries the argument
The dataset-agnostic conversion pipeline that generates paired text-audio instances from existing benchmarks by applying text-to-speech synthesis, speaker variation, and environmental noise while retaining the original tool schema and gold labels.
If this is right
- Model rankings for tool calling shift between text and audio inputs, with the best model on one benchmark not necessarily leading on the other.
- Text-to-voice performance gaps range from roughly 2 to 5 points depending on the model.
- Most degradations in audio arise from incorrect extraction of argument values in spoken instructions.
- Open-source models with at least 8 billion parameters can serve as judges that agree with proprietary judges more than 80 percent of the time.
Where Pith is reading between the lines
- The approach could let developers quickly screen voice-agent designs against many existing text benchmarks before building dedicated spoken corpora.
- It points to a need for improved handling of precise numerical or entity values when models receive instructions through speech rather than text.
- The same conversion method could be tested on other tool-use or agent benchmarks to check whether the observed text-to-audio gaps generalize.
Load-bearing premise
That audio created from text benchmarks by adding speech synthesis, speaker changes, and noise keeps the same semantic content and difficulty for tool calling as actual spoken interactions.
What would settle it
A direct comparison showing that human raters judge the audio instances substantially harder or easier than the original text versions in ways that correlate with model score gaps.
Figures
read the original abstract
Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a dataset-agnostic framework that converts existing text-based tool-calling benchmarks into paired audio instances via TTS synthesis, speaker variation, and environmental noise while preserving original tool schemas and gold labels. It evaluates seven omni-modal models on audio versions of the Confetti and When2Call datasets, reporting model- and task-specific scores (Gemini-3.1-Flash-Live at 70.4 on Confetti; GPT-Realtime-1.5 at 71.9 on When2Call), text-to-voice gaps of 1.8–4.8 points, failure modes dominated by argument-value misunderstandings, an ambiguity reformulation stress test, and validation of open-source LLM-as-judge protocols against human preferences.
Significance. If the audio conversion preserves semantic content and task difficulty without introducing systematic biases, the framework supplies a reproducible first-stage diagnostic for voice-based tool calling that complements purpose-built audio corpora. The empirical demonstration of model- and task-dependence, together with the finding that open-source Qwen3 judges (≥8B) exceed 80% agreement with proprietary judges, would support more accessible and privacy-preserving evaluation practices in multimodal agent research.
major comments (2)
- [§3 and §4] §3 (Framework) and §4 (Experiments): The headline claims that text-to-voice gaps (1.8–4.8 points) and model rankings are meaningful rest on the premise that TTS+noise conversion leaves argument-value parsing difficulty unchanged. No quantitative check—such as ASR word-error rate measured specifically on argument spans or human re-annotation of a held-out sample—is reported to confirm that speaker variation and environmental noise do not disproportionately degrade slot values relative to other content.
- [§3.1] §3.1 (Audio Conversion Pipeline): It is unclear whether noise-addition parameters, speaker-selection criteria, and TTS settings were fixed before any model evaluation or selected after inspecting preliminary outputs. This ambiguity directly affects the reproducibility claim and the interpretation of the reported performance differences as intrinsic model properties rather than pipeline artifacts.
minor comments (2)
- [Table 2] Table 2 (model scores): Adding per-model standard deviations or bootstrap confidence intervals around the reported point estimates (e.g., 70.4, 71.9) would better convey the precision of the observed gaps.
- [§5] §5 (LLM-as-Judge Validation): The agreement figures for open-source judges would be strengthened by reporting the exact prompt templates and the size of the human-preference validation set.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the reproducibility and validation aspects of the framework.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Framework) and §4 (Experiments): The headline claims that text-to-voice gaps (1.8–4.8 points) and model rankings are meaningful rest on the premise that TTS+noise conversion leaves argument-value parsing difficulty unchanged. No quantitative check—such as ASR word-error rate measured specifically on argument spans or human re-annotation of a held-out sample—is reported to confirm that speaker variation and environmental noise do not disproportionately degrade slot values relative to other content.
Authors: We agree that additional quantitative validation would further support the interpretation of the text-to-voice gaps. While the paired design and preservation of original gold labels enable direct comparisons, the original manuscript does not include a targeted ASR word-error rate analysis on argument spans or human re-annotation focused on slot values. In the revised manuscript we will add such an analysis on a held-out sample, reporting ASR performance specifically on argument-value segments and discussing any observed effects on parsing difficulty. revision: yes
-
Referee: [§3.1] §3.1 (Audio Conversion Pipeline): It is unclear whether noise-addition parameters, speaker-selection criteria, and TTS settings were fixed before any model evaluation or selected after inspecting preliminary outputs. This ambiguity directly affects the reproducibility claim and the interpretation of the reported performance differences as intrinsic model properties rather than pipeline artifacts.
Authors: We thank the referee for identifying this source of potential ambiguity. The pipeline hyperparameters were in fact fixed prior to the main model evaluations, with noise levels, speaker selection criteria, and TTS settings determined during pilot experiments and then locked. To improve clarity and reproducibility, we will revise Section 3.1 to explicitly describe this decision sequence and will include the precise parameter values and random seeds in the supplementary materials. revision: yes
Circularity Check
Empirical evaluation on converted benchmarks with external validation
full rationale
The paper describes a dataset conversion pipeline (TTS + speaker variation + noise) applied to existing text tool-calling benchmarks, followed by direct model evaluations that produce the reported scores (e.g., Gemini-3.1-Flash-Live at 70.4 on Confetti). Central results are performance measurements and failure-mode observations on the generated audio instances. The LLM-as-judge protocol is explicitly validated against human preferences, providing an external anchor independent of the paper's own fitted values or self-citations. No equations, parameter fits, or derivations are presented as predictions; the work contains no self-definitional loops, uniqueness theorems, or ansatz smuggling. The derivation chain is therefore self-contained empirical measurement rather than reduction to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Text-to-speech synthesis combined with speaker variation and environmental noise produces audio instances whose tool-calling difficulty matches real-world spoken queries.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
21st International Congress on Acoustics (ICA 2013)
DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments , author=. 21st International Congress on Acoustics (ICA 2013)
work page 2013
-
[2]
Lin, Yen-Ting and Chen, Zhehuai and Zelasko, Piotr and Wan, Zhen and Yang, Xuesong and Chen, Zih-Ching and Puvvada, Krishna C and Hu, Ke and Fu, Szu-Wei and Chiu, Jun Wei and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Yang, Chao-Han Huck. N e K o: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts...
work page 2025
-
[3]
2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =
Generative Speech Recognition Error Correction With Large Language Models and Task-Activating Prompting , author =. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =
work page 2023
-
[4]
International Conference on Learning Representations , volume=
Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. International Conference on Learning Representations , volume=
-
[5]
arXiv preprint arXiv:2503.23395 , year=
Scaling auditory cognition via test-time compute in audio language models , author=. arXiv preprint arXiv:2503.23395 , year=
-
[6]
Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception , author=. arXiv preprint arXiv:2601.09413 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
International conference on machine learning , pages=
Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[8]
Baba, Kaito and Nakata, Wataru and Saito, Yuki and Saruwatari, Hiroshi , booktitle =. The
-
[9]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
C-pack: Packed resources for general chinese embeddings , author=. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=
-
[11]
Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching
Laskar, Md Tahmid Rahman and Chen, Cheng and Fu, Xue-yong and Azizi, Mahsa and Bhushan, Shashi and Corston-oliver, Simon. AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.1865...
-
[13]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[14]
arXiv preprint arXiv:2510.05858 , year=
Dacp: Domain-adaptive continual pre-training of large language models for phone conversation summarization , author=. arXiv preprint arXiv:2510.05858 , year=
-
[15]
arXiv preprint arXiv:2510.08152 , year=
DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations , author=. arXiv preprint arXiv:2510.08152 , year=
- [16]
-
[17]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[18]
arXiv preprint arXiv:2402.18667 , year=
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability , author=. arXiv preprint arXiv:2402.18667 , year=
-
[19]
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
The Prompt Report: A Systematic Survey of Prompting Techniques , author=. arXiv preprint arXiv:2406.06608 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Fu, Xue-Yong and Laskar, Md Tahmid Rahman and Khasanova, Elena and Chen, Cheng and Tn, Shashi. Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...
work page 2024
-
[21]
A lign S core: Evaluating Factual Consistency with A Unified Alignment Function
Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting. A lign S core: Evaluating Factual Consistency with A Unified Alignment Function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023
work page 2023
-
[22]
Emergent Abilities of Large Language Models
Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
arXiv preprint arXiv:2307.03109 , year=
A survey on evaluation of large language models , author=. arXiv preprint arXiv:2307.03109 , year=
-
[26]
Ming Zhong and Yang Liu and Yichong Xu and Chenguang Zhu and Michael Zeng , title =. Thirty-Sixth. 2022 , url =
work page 2022
-
[27]
Laskar, Md Tahmid Rahman and Fu, Xue-Yong and Chen, Cheng and Bhushan TN, Shashi. Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023. doi:10.18653/v1/2023.emnlp-industry.33
-
[28]
TinyLlama: An Open-Source Small Language Model
TinyLlama: An Open-Source Small Language Model , author=. arXiv preprint arXiv:2401.02385 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Scaling Instruction-Finetuned Language Models
Scaling instruction-finetuned language models , author=. arXiv preprint arXiv:2210.11416 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
PaLM: Scaling Language Modeling with Pathways
Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Computational Linguistics , volume=
Domain adaptation with pre-trained transformers for query-focused abstractive text summarization , author=. Computational Linguistics , volume=. 2022 , publisher=
work page 2022
-
[32]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
The fineweb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Developing a Production System for P urpose of C all Detection in Business Phone Conversations
Khasanova, Elena and Hiranandani, Pooja and Gardiner, Shayna and Chen, Cheng and Corston-Oliver, Simon and Fu, Xue-Yong. Developing a Production System for P urpose of C all Detection in Business Phone Conversations. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...
-
[34]
Laskar, Md Tahmid Rahman and Khasanova, Elena and Fu, Xue-Yong and Chen, Cheng and Tn, Shashi Bhushan. Query- OPT : Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.86
-
[35]
Proceedings of the 28th international conference on computational linguistics , pages=
Distill and replay for continual language learning , author=. Proceedings of the 28th international conference on computational linguistics , pages=
-
[36]
Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale , author=. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2022 , organization=
work page 2022
- [37]
-
[38]
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=
Can probing classifiers reveal the learning by contact center large language models?: No, it doesn’t! , author=. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=
-
[39]
arXiv preprint arXiv:2311.08545 , year=
Efficient continual pre-training for building domain specific large language models , author=. arXiv preprint arXiv:2311.08545 , year=
-
[40]
BloombergGPT: A Large Language Model for Finance
Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Journal of the American Medical Informatics Association , pages=
PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , pages=. 2024 , publisher=
work page 2024
-
[42]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[43]
arXiv preprint arXiv:2402.01364 , year=
Continual learning for large language models: A survey , author=. arXiv preprint arXiv:2402.01364 , year=
-
[44]
A survey on model compression for large language models
A survey on model compression for large language models , author=. arXiv preprint arXiv:2308.07633 , year=
-
[45]
arXiv preprint arXiv:2308.10792 , year=
Instruction tuning for large language models: A survey , author=. arXiv preprint arXiv:2308.10792 , year=
-
[46]
QLoRA: Efficient Finetuning of Quantized LLMs
Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023
Efficient Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.03863 , year=
-
[50]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Improving language understanding by generative pre-training , author=
-
[52]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[53]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Meditron-70b: Scaling medical pretraining for large language models , author=. arXiv preprint arXiv:2311.16079 , year=
work page internal anchor Pith review arXiv
-
[55]
arXiv preprint arXiv:2310.04270 , year=
A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks , author=. arXiv preprint arXiv:2310.04270 , year=
-
[56]
Jahan, Israt and Laskar, Md Tahmid Rahman and Peng, Chun and Huang, Jimmy. Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. 2023. doi:10.18653/v1/2023.bionlp-1.30
-
[57]
A Survey of Large Language Models
A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Entity-level Sentiment Analysis in Contact Center Telephone Conversations
Fu, Xue-yong and Chen, Cheng and Laskar, Md Tahmid Rahman and Gardiner, Shayna and Hiranandani, Pooja and Tn, Shashi Bhushan. Entity-level Sentiment Analysis in Contact Center Telephone Conversations. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.emnlp-industry.49
-
[59]
Predicting Customer Satisfaction with Soft Labels for Ordinal Classification
Manderscheid, Etienne and Lee, Matthias. Predicting Customer Satisfaction with Soft Labels for Ordinal Classification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.18653/v1/2023.acl-industry.62
-
[60]
An auto encoder-based dimensionality reduction technique for efficient entity linking in business phone conversations , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
-
[61]
International workshop on machine learning for multimodal interaction , pages=
The AMI meeting corpus: A pre-announcement , author=. International workshop on machine learning for multimodal interaction , pages=. 2005 , organization=
work page 2005
-
[62]
2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003
The ICSI meeting corpus , author=. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). , volume=. 2003 , organization=
work page 2003
-
[63]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
Transformers: State-of-the-Art Natural Language Processing , booktitle =
Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Transformers: State-of-the-Art Natural Language Processing , booktitle =. 2020 , url =. doi:10.18653/v1/2020.emnlp-demos.6 , timestamp =
-
[65]
PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =
Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...
-
[66]
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
Raphael Tang and Yao Lu and Linqing Liu and Lili Mou and Olga Vechtomova and Jimmy Lin , title =. CoRR , volume =. 2019 , url =. 1903.12136 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[67]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[68]
Deep contextualized word representations
Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer , title =. CoRR , volume =. 2018 , url =. 1802.05365 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[69]
Bidirectional LSTM-CRF Models for Sequence Tagging
Zhiheng Huang and Wei Xu and Kai Yu , title =. CoRR , volume =. 2015 , url =. 1508.01991 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[70]
Contextual String Embeddings for Sequence Labeling , booktitle =
Alan Akbik and Duncan Blythe and Roland Vollgraf , editor =. Contextual String Embeddings for Sequence Labeling , booktitle =. 2018 , url =
work page 2018
-
[71]
Big Bird: Transformers for Longer Sequences , booktitle =
Manzil Zaheer and Guru Guruganesh and Kumar Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta. Big Bird: Transformers for Longer Sequences , booktitle =. 2020 , url =
work page 2020
-
[72]
Longformer: The Long-Document Transformer
Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =. 2004.05150 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[73]
Rethinking Attention with Performers , booktitle =
Krzysztof Marcin Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking Attention with Performers , booktitle =. 2021 , url =
work page 2021
-
[74]
8th International Conference on Learning Representations,
Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =
work page 2020
-
[75]
Yi Tay and Mostafa Dehghani and Dara Bahri and Donald Metzler , title =. CoRR , volume =. 2020 , url =. 2009.06732 , timestamp =
-
[76]
9th International Conference on Learning Representations,
Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,. 2021 , url =
work page 2021
-
[77]
Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =
work page 2020
-
[78]
Gomez and Lukasz Kaiser and Illia Polosukhin , editor =
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =
work page 2017
-
[79]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2021
-
[81]
Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=
Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=
-
[82]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,
Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto , editor =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,. 2020 , url =. doi:10.18653/v1/2020.emnlp-main.523 , timestamp =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.