From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Jonas Robertson; Md Tahmid Rahman Laskar; Quinten McNamara; Seyyed Saeed Sarfjoo; Shashi Bhushan TN; Xue-Yong Fu

arxiv: 2605.15104 · v2 · pith:7IK6ILWQnew · submitted 2026-05-14 · 💻 cs.CL

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Md Tahmid Rahman Laskar , Xue-Yong Fu , Seyyed Saeed Sarfjoo , Quinten McNamara , Jonas Robertson , Shashi Bhushan TN This is my paper

Pith reviewed 2026-05-21 08:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords tool callingvoice agentsLLM evaluationtext-to-speech conversionbenchmark adaptationmultimodal modelsLLM-as-judge

0 comments

The pith

A conversion framework turns text tool-calling benchmarks into paired audio versions to test voice agents without new annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset-agnostic method that applies text-to-speech synthesis, speaker variation, and environmental noise to existing text benchmarks for tool calling, creating matched audio instances that keep the original tool schemas and gold labels intact. This allows direct measurement of how well models perform tool use when inputs arrive as speech rather than text. Evaluation across seven omni-modal models on two converted benchmarks shows that results depend heavily on the model and the task, with consistent but varying drops from text to audio performance. Failures in the audio setting most commonly trace to errors in capturing specific argument values from spoken input. The work also validates an open-source LLM-as-judge approach that reaches high agreement with proprietary judges, enabling private evaluation pipelines.

Core claim

We present a reproducible framework that converts verified text tool-calling benchmarks into controlled audio versions through text-to-speech, speaker variation, and added noise while preserving all original annotations, enabling direct text-to-voice comparison of model tool-use performance without re-annotation.

What carries the argument

The dataset-agnostic conversion pipeline that generates paired text-audio instances from existing benchmarks by applying text-to-speech synthesis, speaker variation, and environmental noise while retaining the original tool schema and gold labels.

If this is right

Model rankings for tool calling shift between text and audio inputs, with the best model on one benchmark not necessarily leading on the other.
Text-to-voice performance gaps range from roughly 2 to 5 points depending on the model.
Most degradations in audio arise from incorrect extraction of argument values in spoken instructions.
Open-source models with at least 8 billion parameters can serve as judges that agree with proprietary judges more than 80 percent of the time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could let developers quickly screen voice-agent designs against many existing text benchmarks before building dedicated spoken corpora.
It points to a need for improved handling of precise numerical or entity values when models receive instructions through speech rather than text.
The same conversion method could be tested on other tool-use or agent benchmarks to check whether the observed text-to-audio gaps generalize.

Load-bearing premise

That audio created from text benchmarks by adding speech synthesis, speaker changes, and noise keeps the same semantic content and difficulty for tool calling as actual spoken interactions.

What would settle it

A direct comparison showing that human raters judge the audio instances substantially harder or easier than the original text versions in ways that correlate with model score gaps.

Figures

Figures reproduced from arXiv: 2605.15104 by Jonas Robertson, Md Tahmid Rahman Laskar, Quinten McNamara, Seyyed Saeed Sarfjoo, Shashi Bhushan TN, Xue-Yong Fu.

**Figure 1.** Figure 1: An overview of our methodology for converting text-based tool datasets into audio benchmarks for tool-calling evaluation. The pipeline uses text-to-speech (TTS) models (GPT-4o-Mini-TTS and Gemini-2.5-TTS) to generate diverse audio queries with different voices and genders, which are then processed by omni-modal LLMs and evaluated via automatic evaluation or LLM Judge. better. Second, each audio example has… view at source ↗

**Figure 2.** Figure 2: Average performance of Qwen3-Omni across SNR levels, aggregated over all TTS models and voices. Model Clean Text Direct Voice Cascade (ASR→text) ∆TV ∆TC ∆CV Gemini-3.1-Flash-Live 73.0 70.4 71.3 2.6 1.7 0.9 GPT-Realtime-1.5 64.0 59.2 58.8 4.8 5.2 -0.4 Qwen3-Omni 62.2 60.4 58.9 1.8 3.3 -1.5 [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Error Analysis on the When2Call benchmark computed over 6 TTS voice variants. and tool-selection errors (15.5%) are more frequent than other models. Decision errors are also substantial across all three systems, ranging from 25.8% for Gemini-3.1-Flash-Live to 37.4% for GPT-Realtime-1.5. This suggests that the text-to-audio gap is not only caused by incorrect argument values; in many cases, audio input chan… view at source ↗

**Figure 4.** Figure 4: LLM-as-judge evaluation on Confetti. (a) Proprietary LLM judge scores in reference-wise and reference-free settings (see Appendix A.6 for the detailed breakdown). (b) Judgment agreement between open judges (Qwen3) against proprietary reference judgments. references are shown, even if the output is functionally reasonable (see Appendix A.4 for an example). To verify this statistically, we apply McNemar’s te… view at source ↗

**Figure 5.** Figure 5: Text-only tool-calling analysis on Confetti. (a) AST soft accuracy across model families and sizes. (b) AST soft accuracy under ambiguous-query reformulation stress tests. Robustness to query reformulation. Beyond the original datasets, we additionally define stress-test slices that practitioners can add when making deployment choices. These slices target customer-support failure modes that may be underrep… view at source ↗

read the original abstract

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical conversion pipeline for audio tool-calling benchmarks with some open questions on audio fidelity.

read the letter

The main thing to know is that this paper describes a conversion method to create audio versions of text tool-calling benchmarks using TTS synthesis, speaker variation, and noise addition, keeping the tool schemas and gold labels the same. They test this on Confetti and When2Call with seven models and report model-specific performance drops from text to voice. The paper does solid work in making the process dataset-agnostic and reproducible. The evaluations show clear differences, like Gemini leading on one set and GPT on the other, with gaps of a few points. The analysis of failures focusing on argument values is helpful, and the validation of open-source LLM judges against human preferences adds a practical angle for evaluation without proprietary APIs. A softer area is the lack of direct verification that the audio conversion does not change the difficulty in targeted ways. While they note argument value misunderstandings as common, there is no reported check on whether noise or TTS prosody hits those specific elements harder, such as through targeted ASR accuracy on value spans. If it does, the text-to-voice gaps could be inflated by the generation process rather than model limitations alone. This is not a deal-breaker but worth noting for interpretation. This work is aimed at people developing voice agents who need to evaluate tool calling from speech. It gives a low-cost starting point that complements dedicated audio datasets. I would send it for peer review, as the core framework is useful and the results are concrete, even with room for tighter validation on the audio fidelity.

Referee Report

2 major / 2 minor

Summary. The paper presents a dataset-agnostic framework that converts existing text-based tool-calling benchmarks into paired audio instances via TTS synthesis, speaker variation, and environmental noise while preserving original tool schemas and gold labels. It evaluates seven omni-modal models on audio versions of the Confetti and When2Call datasets, reporting model- and task-specific scores (Gemini-3.1-Flash-Live at 70.4 on Confetti; GPT-Realtime-1.5 at 71.9 on When2Call), text-to-voice gaps of 1.8–4.8 points, failure modes dominated by argument-value misunderstandings, an ambiguity reformulation stress test, and validation of open-source LLM-as-judge protocols against human preferences.

Significance. If the audio conversion preserves semantic content and task difficulty without introducing systematic biases, the framework supplies a reproducible first-stage diagnostic for voice-based tool calling that complements purpose-built audio corpora. The empirical demonstration of model- and task-dependence, together with the finding that open-source Qwen3 judges (≥8B) exceed 80% agreement with proprietary judges, would support more accessible and privacy-preserving evaluation practices in multimodal agent research.

major comments (2)

[§3 and §4] §3 (Framework) and §4 (Experiments): The headline claims that text-to-voice gaps (1.8–4.8 points) and model rankings are meaningful rest on the premise that TTS+noise conversion leaves argument-value parsing difficulty unchanged. No quantitative check—such as ASR word-error rate measured specifically on argument spans or human re-annotation of a held-out sample—is reported to confirm that speaker variation and environmental noise do not disproportionately degrade slot values relative to other content.
[§3.1] §3.1 (Audio Conversion Pipeline): It is unclear whether noise-addition parameters, speaker-selection criteria, and TTS settings were fixed before any model evaluation or selected after inspecting preliminary outputs. This ambiguity directly affects the reproducibility claim and the interpretation of the reported performance differences as intrinsic model properties rather than pipeline artifacts.

minor comments (2)

[Table 2] Table 2 (model scores): Adding per-model standard deviations or bootstrap confidence intervals around the reported point estimates (e.g., 70.4, 71.9) would better convey the precision of the observed gaps.
[§5] §5 (LLM-as-Judge Validation): The agreement figures for open-source judges would be strengthened by reporting the exact prompt templates and the size of the human-preference validation set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the reproducibility and validation aspects of the framework.

read point-by-point responses

Referee: [§3 and §4] §3 (Framework) and §4 (Experiments): The headline claims that text-to-voice gaps (1.8–4.8 points) and model rankings are meaningful rest on the premise that TTS+noise conversion leaves argument-value parsing difficulty unchanged. No quantitative check—such as ASR word-error rate measured specifically on argument spans or human re-annotation of a held-out sample—is reported to confirm that speaker variation and environmental noise do not disproportionately degrade slot values relative to other content.

Authors: We agree that additional quantitative validation would further support the interpretation of the text-to-voice gaps. While the paired design and preservation of original gold labels enable direct comparisons, the original manuscript does not include a targeted ASR word-error rate analysis on argument spans or human re-annotation focused on slot values. In the revised manuscript we will add such an analysis on a held-out sample, reporting ASR performance specifically on argument-value segments and discussing any observed effects on parsing difficulty. revision: yes
Referee: [§3.1] §3.1 (Audio Conversion Pipeline): It is unclear whether noise-addition parameters, speaker-selection criteria, and TTS settings were fixed before any model evaluation or selected after inspecting preliminary outputs. This ambiguity directly affects the reproducibility claim and the interpretation of the reported performance differences as intrinsic model properties rather than pipeline artifacts.

Authors: We thank the referee for identifying this source of potential ambiguity. The pipeline hyperparameters were in fact fixed prior to the main model evaluations, with noise levels, speaker selection criteria, and TTS settings determined during pilot experiments and then locked. To improve clarity and reproducibility, we will revise Section 3.1 to explicitly describe this decision sequence and will include the precise parameter values and random seeds in the supplementary materials. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation on converted benchmarks with external validation

full rationale

The paper describes a dataset conversion pipeline (TTS + speaker variation + noise) applied to existing text tool-calling benchmarks, followed by direct model evaluations that produce the reported scores (e.g., Gemini-3.1-Flash-Live at 70.4 on Confetti). Central results are performance measurements and failure-mode observations on the generated audio instances. The LLM-as-judge protocol is explicitly validated against human preferences, providing an external anchor independent of the paper's own fitted values or self-citations. No equations, parameter fits, or derivations are presented as predictions; the work contains no self-definitional loops, uniqueness theorems, or ansatz smuggling. The derivation chain is therefore self-contained empirical measurement rather than reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that TTS-generated audio with added noise preserves the original task semantics for tool calling. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Text-to-speech synthesis combined with speaker variation and environmental noise produces audio instances whose tool-calling difficulty matches real-world spoken queries.
Invoked when claiming that the converted audio versions serve as valid proxies for voice-based tool calling without re-annotation.

pith-pipeline@v0.9.0 · 5833 in / 1412 out tokens · 29691 ms · 2026-05-21T08:42:11.009892+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

294 extracted references · 294 canonical work pages · 40 internal anchors

[1]

21st International Congress on Acoustics (ICA 2013)

DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments , author=. 21st International Congress on Acoustics (ICA 2013)

work page 2013
[2]

N e K o: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model

Lin, Yen-Ting and Chen, Zhehuai and Zelasko, Piotr and Wan, Zhen and Yang, Xuesong and Chen, Zih-Ching and Puvvada, Krishna C and Hu, Ke and Fu, Szu-Wei and Chiu, Jun Wei and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Yang, Chao-Han Huck. N e K o: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts...

work page 2025
[3]

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =

Generative Speech Recognition Error Correction With Large Language Models and Task-Activating Prompting , author =. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =

work page 2023
[4]

International Conference on Learning Representations , volume=

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. International Conference on Learning Representations , volume=

work page
[5]

arXiv preprint arXiv:2503.23395 , year=

Scaling auditory cognition via test-time compute in audio language models , author=. arXiv preprint arXiv:2503.23395 , year=

work page arXiv
[6]

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception , author=. arXiv preprint arXiv:2601.09413 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[8]

Baba, Kaito and Nakata, Wataru and Saito, Yuki and Saruwatari, Hiroshi , booktitle =. The

work page
[9]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

C-pack: Packed resources for general chinese embeddings , author=. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

work page
[11]

Qwen3-Omni Technical Report

Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching

Laskar, Md Tahmid Rahman and Chen, Cheng and Fu, Xue-yong and Azizi, Mahsa and Bhushan, Shashi and Corston-oliver, Simon. AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.1865...

work page doi:10.18653/v1/2023.acl-industry.57 2023
[13]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[14]

arXiv preprint arXiv:2510.05858 , year=

Dacp: Domain-adaptive continual pre-training of large language models for phone conversation summarization , author=. arXiv preprint arXiv:2510.05858 , year=

work page arXiv
[15]

arXiv preprint arXiv:2510.08152 , year=

DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations , author=. arXiv preprint arXiv:2510.08152 , year=

work page arXiv
[16]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[17]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[18]

arXiv preprint arXiv:2402.18667 , year=

FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability , author=. arXiv preprint arXiv:2402.18667 , year=

work page arXiv
[19]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

The Prompt Report: A Systematic Survey of Prompting Techniques , author=. arXiv preprint arXiv:2406.06608 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?

Fu, Xue-Yong and Laskar, Md Tahmid Rahman and Khasanova, Elena and Chen, Cheng and Tn, Shashi. Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...

work page 2024
[21]

A lign S core: Evaluating Factual Consistency with A Unified Alignment Function

Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting. A lign S core: Evaluating Factual Consistency with A Unified Alignment Function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[22]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Instruction Tuning with GPT-4

Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2307.03109 , year=

A survey on evaluation of large language models , author=. arXiv preprint arXiv:2307.03109 , year=

work page arXiv
[26]

Thirty-Sixth

Ming Zhong and Yang Liu and Yichong Xu and Chenguang Zhu and Michael Zeng , title =. Thirty-Sixth. 2022 , url =

work page 2022
[27]

Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective

Laskar, Md Tahmid Rahman and Fu, Xue-Yong and Chen, Cheng and Bhushan TN, Shashi. Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023. doi:10.18653/v1/2023.emnlp-industry.33

work page doi:10.18653/v1/2023.emnlp-industry.33 2023
[28]

TinyLlama: An Open-Source Small Language Model

TinyLlama: An Open-Source Small Language Model , author=. arXiv preprint arXiv:2401.02385 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Scaling Instruction-Finetuned Language Models

Scaling instruction-finetuned language models , author=. arXiv preprint arXiv:2210.11416 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Computational Linguistics , volume=

Domain adaptation with pre-trained transformers for query-focused abstractive text summarization , author=. Computational Linguistics , volume=. 2022 , publisher=

work page 2022
[32]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The fineweb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Developing a Production System for P urpose of C all Detection in Business Phone Conversations

Khasanova, Elena and Hiranandani, Pooja and Gardiner, Shayna and Chen, Cheng and Corston-Oliver, Simon and Fu, Xue-Yong. Developing a Production System for P urpose of C all Detection in Business Phone Conversations. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page doi:10.18653/v1/2022.naacl-industry.29 2022
[34]

Query- OPT : Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization

Laskar, Md Tahmid Rahman and Khasanova, Elena and Fu, Xue-Yong and Chen, Cheng and Tn, Shashi Bhushan. Query- OPT : Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.86

work page doi:10.18653/v1/2024.emnlp-industry.86 2024
[35]

Proceedings of the 28th international conference on computational linguistics , pages=

Distill and replay for continual language learning , author=. Proceedings of the 28th international conference on computational linguistics , pages=

work page
[36]

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale , author=. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2022 , organization=

work page 2022
[37]

Labrak, A

Biomistral: A collection of open-source pretrained large language models for medical domains , author=. arXiv preprint arXiv:2402.10373 , year=

work page arXiv
[38]

Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=

Can probing classifiers reveal the learning by contact center large language models?: No, it doesn’t! , author=. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=

work page
[39]

arXiv preprint arXiv:2311.08545 , year=

Efficient continual pre-training for building domain specific large language models , author=. arXiv preprint arXiv:2311.08545 , year=

work page arXiv
[40]

BloombergGPT: A Large Language Model for Finance

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Journal of the American Medical Informatics Association , pages=

PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , pages=. 2024 , publisher=

work page 2024
[42]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[43]

arXiv preprint arXiv:2402.01364 , year=

Continual learning for large language models: A survey , author=. arXiv preprint arXiv:2402.01364 , year=

work page arXiv
[44]

A survey on model compression for large language models

A survey on model compression for large language models , author=. arXiv preprint arXiv:2308.07633 , year=

work page arXiv
[45]

arXiv preprint arXiv:2308.10792 , year=

Instruction tuning for large language models: A survey , author=. arXiv preprint arXiv:2308.10792 , year=

work page arXiv
[46]

QLoRA: Efficient Finetuning of Quantized LLMs

Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

LoRA: Low-Rank Adaptation of Large Language Models

Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

Efficient Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.03863 , year=

work page arXiv
[50]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Improving language understanding by generative pre-training , author=

work page
[52]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Meditron-70b: Scaling medical pretraining for large language models , author=. arXiv preprint arXiv:2311.16079 , year=

work page internal anchor Pith review arXiv
[55]

arXiv preprint arXiv:2310.04270 , year=

A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks , author=. arXiv preprint arXiv:2310.04270 , year=

work page arXiv
[56]

Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

Jahan, Israt and Laskar, Md Tahmid Rahman and Peng, Chun and Huang, Jimmy. Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. 2023. doi:10.18653/v1/2023.bionlp-1.30

work page doi:10.18653/v1/2023.bionlp-1.30 2023
[57]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Entity-level Sentiment Analysis in Contact Center Telephone Conversations

Fu, Xue-yong and Chen, Cheng and Laskar, Md Tahmid Rahman and Gardiner, Shayna and Hiranandani, Pooja and Tn, Shashi Bhushan. Entity-level Sentiment Analysis in Contact Center Telephone Conversations. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.emnlp-industry.49

work page doi:10.18653/v1/2022.emnlp-industry.49 2022
[59]

Predicting Customer Satisfaction with Soft Labels for Ordinal Classification

Manderscheid, Etienne and Lee, Matthias. Predicting Customer Satisfaction with Soft Labels for Ordinal Classification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.18653/v1/2023.acl-industry.62

work page doi:10.18653/v1/2023.acl-industry.62 2023
[60]

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

An auto encoder-based dimensionality reduction technique for efficient entity linking in business phone conversations , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page
[61]

International workshop on machine learning for multimodal interaction , pages=

The AMI meeting corpus: A pre-announcement , author=. International workshop on machine learning for multimodal interaction , pages=. 2005 , organization=

work page 2005
[62]

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003

The ICSI meeting corpus , author=. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). , volume=. 2003 , organization=

work page 2003
[63]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

Transformers: State-of-the-Art Natural Language Processing , booktitle =

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Transformers: State-of-the-Art Natural Language Processing , booktitle =. 2020 , url =. doi:10.18653/v1/2020.emnlp-demos.6 , timestamp =

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[65]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

work page
[66]

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Raphael Tang and Yao Lu and Linqing Liu and Lili Mou and Olga Vechtomova and Jimmy Lin , title =. CoRR , volume =. 2019 , url =. 1903.12136 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[67]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[68]

Deep contextualized word representations

Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer , title =. CoRR , volume =. 2018 , url =. 1802.05365 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[69]

Bidirectional LSTM-CRF Models for Sequence Tagging

Zhiheng Huang and Wei Xu and Kai Yu , title =. CoRR , volume =. 2015 , url =. 1508.01991 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015
[70]

Contextual String Embeddings for Sequence Labeling , booktitle =

Alan Akbik and Duncan Blythe and Roland Vollgraf , editor =. Contextual String Embeddings for Sequence Labeling , booktitle =. 2018 , url =

work page 2018
[71]

Big Bird: Transformers for Longer Sequences , booktitle =

Manzil Zaheer and Guru Guruganesh and Kumar Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta. Big Bird: Transformers for Longer Sequences , booktitle =. 2020 , url =

work page 2020
[72]

Longformer: The Long-Document Transformer

Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =. 2004.05150 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2020
[73]

Rethinking Attention with Performers , booktitle =

Krzysztof Marcin Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking Attention with Performers , booktitle =. 2021 , url =

work page 2021
[74]

8th International Conference on Learning Representations,

Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

work page 2020
[75]

CoRR , volume =

Yi Tay and Mostafa Dehghani and Dara Bahri and Donald Metzler , title =. CoRR , volume =. 2020 , url =. 2009.06732 , timestamp =

work page arXiv 2020
[76]

9th International Conference on Learning Representations,

Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,. 2021 , url =

work page 2021
[77]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =

work page 2020
[78]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

work page 2017
[79]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[81]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

work page
[82]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,

Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto , editor =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,. 2020 , url =. doi:10.18653/v1/2020.emnlp-main.523 , timestamp =

work page doi:10.18653/v1/2020.emnlp-main.523 2020

Showing first 80 references.

[1] [1]

21st International Congress on Acoustics (ICA 2013)

DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments , author=. 21st International Congress on Acoustics (ICA 2013)

work page 2013

[2] [2]

N e K o: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model

Lin, Yen-Ting and Chen, Zhehuai and Zelasko, Piotr and Wan, Zhen and Yang, Xuesong and Chen, Zih-Ching and Puvvada, Krishna C and Hu, Ke and Fu, Szu-Wei and Chiu, Jun Wei and Balam, Jagadeesh and Ginsburg, Boris and Wang, Yu-Chiang Frank and Yang, Chao-Han Huck. N e K o: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts...

work page 2025

[3] [3]

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =

Generative Speech Recognition Error Correction With Large Language Models and Task-Activating Prompting , author =. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages =

work page 2023

[4] [4]

International Conference on Learning Representations , volume=

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning , author=. International Conference on Learning Representations , volume=

work page

[5] [5]

arXiv preprint arXiv:2503.23395 , year=

Scaling auditory cognition via test-time compute in audio language models , author=. arXiv preprint arXiv:2503.23395 , year=

work page arXiv

[6] [6]

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception , author=. arXiv preprint arXiv:2601.09413 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[8] [8]

Baba, Kaito and Nakata, Wataru and Saito, Yuki and Saruwatari, Hiroshi , booktitle =. The

work page

[9] [9]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

C-pack: Packed resources for general chinese embeddings , author=. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

work page

[11] [11]

Qwen3-Omni Technical Report

Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching

Laskar, Md Tahmid Rahman and Chen, Cheng and Fu, Xue-yong and Azizi, Mahsa and Bhushan, Shashi and Corston-oliver, Simon. AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.1865...

work page doi:10.18653/v1/2023.acl-industry.57 2023

[13] [13]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[14] [14]

arXiv preprint arXiv:2510.05858 , year=

Dacp: Domain-adaptive continual pre-training of large language models for phone conversation summarization , author=. arXiv preprint arXiv:2510.05858 , year=

work page arXiv

[15] [15]

arXiv preprint arXiv:2510.08152 , year=

DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations , author=. arXiv preprint arXiv:2510.08152 , year=

work page arXiv

[16] [16]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[17] [17]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[18] [18]

arXiv preprint arXiv:2402.18667 , year=

FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability , author=. arXiv preprint arXiv:2402.18667 , year=

work page arXiv

[19] [19]

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

The Prompt Report: A Systematic Survey of Prompting Techniques , author=. arXiv preprint arXiv:2406.06608 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?

Fu, Xue-Yong and Laskar, Md Tahmid Rahman and Khasanova, Elena and Chen, Cheng and Tn, Shashi. Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (...

work page 2024

[21] [21]

A lign S core: Evaluating Factual Consistency with A Unified Alignment Function

Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting. A lign S core: Evaluating Factual Consistency with A Unified Alignment Function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023

[22] [22]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Instruction Tuning with GPT-4

Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2307.03109 , year=

A survey on evaluation of large language models , author=. arXiv preprint arXiv:2307.03109 , year=

work page arXiv

[25] [26]

Thirty-Sixth

Ming Zhong and Yang Liu and Yichong Xu and Chenguang Zhu and Michael Zeng , title =. Thirty-Sixth. 2022 , url =

work page 2022

[26] [27]

Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective

Laskar, Md Tahmid Rahman and Fu, Xue-Yong and Chen, Cheng and Bhushan TN, Shashi. Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2023. doi:10.18653/v1/2023.emnlp-industry.33

work page doi:10.18653/v1/2023.emnlp-industry.33 2023

[27] [28]

TinyLlama: An Open-Source Small Language Model

TinyLlama: An Open-Source Small Language Model , author=. arXiv preprint arXiv:2401.02385 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[28] [29]

Scaling Instruction-Finetuned Language Models

Scaling instruction-finetuned language models , author=. arXiv preprint arXiv:2210.11416 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [30]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways , author=. arXiv preprint arXiv:2204.02311 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [31]

Computational Linguistics , volume=

Domain adaptation with pre-trained transformers for query-focused abstractive text summarization , author=. Computational Linguistics , volume=. 2022 , publisher=

work page 2022

[31] [32]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The fineweb datasets: Decanting the web for the finest text data at scale , author=. arXiv preprint arXiv:2406.17557 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [33]

Developing a Production System for P urpose of C all Detection in Business Phone Conversations

Khasanova, Elena and Hiranandani, Pooja and Gardiner, Shayna and Chen, Cheng and Corston-Oliver, Simon and Fu, Xue-Yong. Developing a Production System for P urpose of C all Detection in Business Phone Conversations. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

work page doi:10.18653/v1/2022.naacl-industry.29 2022

[33] [34]

Query- OPT : Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization

Laskar, Md Tahmid Rahman and Khasanova, Elena and Fu, Xue-Yong and Chen, Cheng and Tn, Shashi Bhushan. Query- OPT : Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.86

work page doi:10.18653/v1/2024.emnlp-industry.86 2024

[34] [35]

Proceedings of the 28th international conference on computational linguistics , pages=

Distill and replay for continual language learning , author=. Proceedings of the 28th international conference on computational linguistics , pages=

work page

[35] [36]

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale , author=. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2022 , organization=

work page 2022

[36] [37]

Labrak, A

Biomistral: A collection of open-source pretrained large language models for medical domains , author=. arXiv preprint arXiv:2402.10373 , year=

work page arXiv

[37] [38]

Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=

Can probing classifiers reveal the learning by contact center large language models?: No, it doesn’t! , author=. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP , pages=

work page

[38] [39]

arXiv preprint arXiv:2311.08545 , year=

Efficient continual pre-training for building domain specific large language models , author=. arXiv preprint arXiv:2311.08545 , year=

work page arXiv

[39] [40]

BloombergGPT: A Large Language Model for Finance

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [41]

Journal of the American Medical Informatics Association , pages=

PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , pages=. 2024 , publisher=

work page 2024

[41] [42]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page

[42] [43]

arXiv preprint arXiv:2402.01364 , year=

Continual learning for large language models: A survey , author=. arXiv preprint arXiv:2402.01364 , year=

work page arXiv

[43] [44]

A survey on model compression for large language models

A survey on model compression for large language models , author=. arXiv preprint arXiv:2308.07633 , year=

work page arXiv

[44] [45]

arXiv preprint arXiv:2308.10792 , year=

Instruction tuning for large language models: A survey , author=. arXiv preprint arXiv:2308.10792 , year=

work page arXiv

[45] [46]

QLoRA: Efficient Finetuning of Quantized LLMs

Qlora: Efficient finetuning of quantized llms , author=. arXiv preprint arXiv:2305.14314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [47]

LoRA: Low-Rank Adaptation of Large Language Models

Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [48]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [49]

Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023

Efficient Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.03863 , year=

work page arXiv

[49] [50]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [51]

Improving language understanding by generative pre-training , author=

work page

[51] [52]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page

[52] [53]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [54]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Meditron-70b: Scaling medical pretraining for large language models , author=. arXiv preprint arXiv:2311.16079 , year=

work page internal anchor Pith review arXiv

[54] [55]

arXiv preprint arXiv:2310.04270 , year=

A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks , author=. arXiv preprint arXiv:2310.04270 , year=

work page arXiv

[55] [56]

Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers

Jahan, Israt and Laskar, Md Tahmid Rahman and Peng, Chun and Huang, Jimmy. Evaluation of C hat GPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers. The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. 2023. doi:10.18653/v1/2023.bionlp-1.30

work page doi:10.18653/v1/2023.bionlp-1.30 2023

[56] [57]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [58]

Entity-level Sentiment Analysis in Contact Center Telephone Conversations

Fu, Xue-yong and Chen, Cheng and Laskar, Md Tahmid Rahman and Gardiner, Shayna and Hiranandani, Pooja and Tn, Shashi Bhushan. Entity-level Sentiment Analysis in Contact Center Telephone Conversations. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2022. doi:10.18653/v1/2022.emnlp-industry.49

work page doi:10.18653/v1/2022.emnlp-industry.49 2022

[58] [59]

Predicting Customer Satisfaction with Soft Labels for Ordinal Classification

Manderscheid, Etienne and Lee, Matthias. Predicting Customer Satisfaction with Soft Labels for Ordinal Classification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track). 2023. doi:10.18653/v1/2023.acl-industry.62

work page doi:10.18653/v1/2023.acl-industry.62 2023

[59] [60]

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

An auto encoder-based dimensionality reduction technique for efficient entity linking in business phone conversations , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

work page

[60] [61]

International workshop on machine learning for multimodal interaction , pages=

The AMI meeting corpus: A pre-announcement , author=. International workshop on machine learning for multimodal interaction , pages=. 2005 , organization=

work page 2005

[61] [62]

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003

The ICSI meeting corpus , author=. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). , volume=. 2003 , organization=

work page 2003

[62] [63]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [64]

Transformers: State-of-the-Art Natural Language Processing , booktitle =

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. Transformers: State-of-the-Art Natural Language Processing , booktitle =. 2020 , url =. doi:10.18653/v1/2020.emnlp-demos.6 , timestamp =

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[64] [65]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

work page

[65] [66]

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Raphael Tang and Yao Lu and Linqing Liu and Lili Mou and Olga Vechtomova and Jimmy Lin , title =. CoRR , volume =. 2019 , url =. 1903.12136 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019

[66] [67]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin and Ming. CoRR , volume =. 2018 , url =. 1810.04805 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018

[67] [68]

Deep contextualized word representations

Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer , title =. CoRR , volume =. 2018 , url =. 1802.05365 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018

[68] [69]

Bidirectional LSTM-CRF Models for Sequence Tagging

Zhiheng Huang and Wei Xu and Kai Yu , title =. CoRR , volume =. 2015 , url =. 1508.01991 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2015

[69] [70]

Contextual String Embeddings for Sequence Labeling , booktitle =

Alan Akbik and Duncan Blythe and Roland Vollgraf , editor =. Contextual String Embeddings for Sequence Labeling , booktitle =. 2018 , url =

work page 2018

[70] [71]

Big Bird: Transformers for Longer Sequences , booktitle =

Manzil Zaheer and Guru Guruganesh and Kumar Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Onta. Big Bird: Transformers for Longer Sequences , booktitle =. 2020 , url =

work page 2020

[71] [72]

Longformer: The Long-Document Transformer

Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =. 2004.05150 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2020

[72] [73]

Rethinking Attention with Performers , booktitle =

Krzysztof Marcin Choromanski and Valerii Likhosherstov and David Dohan and Xingyou Song and Andreea Gane and Tam. Rethinking Attention with Performers , booktitle =. 2021 , url =

work page 2021

[73] [74]

8th International Conference on Learning Representations,

Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

work page 2020

[74] [75]

CoRR , volume =

Yi Tay and Mostafa Dehghani and Dara Bahri and Donald Metzler , title =. CoRR , volume =. 2020 , url =. 2009.06732 , timestamp =

work page arXiv 2020

[75] [76]

9th International Conference on Learning Representations,

Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,. 2021 , url =

work page 2021

[76] [77]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =

work page 2020

[77] [78]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

work page 2017

[78] [79]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021

[79] [81]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task , author=. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages=

work page

[80] [82]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,

Ikuya Yamada and Akari Asai and Hiroyuki Shindo and Hideaki Takeda and Yuji Matsumoto , editor =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,. 2020 , url =. doi:10.18653/v1/2020.emnlp-main.523 , timestamp =

work page doi:10.18653/v1/2020.emnlp-main.523 2020