pith. sign in

arxiv: 2401.00870 · v5 · submitted 2023-12-30 · 💻 cs.CR · cs.AI

ConfusionPrompt: Practical Private Inference for Online Large Language Models

Pith reviewed 2026-05-24 04:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords private inferencelarge language modelsprompt decompositionprivacy-utility tradeoffblack-box LLMstext perturbationrecomposition
0
0 comments X

The pith

ConfusionPrompt protects prompts sent to black-box LLMs by splitting them into sub-prompts mixed with generated pseudo-prompts that the user later filters and recomposes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ConfusionPrompt to address privacy risks when users send detailed prompts to online LLM services. It decomposes a user prompt into smaller genuine sub-prompts, creates accompanying pseudo-prompts, sends the mixed group to the server, and lets the user recompose the returned responses into the final answer. This design works with existing closed models without requiring changes to the LLM itself. It claims a better privacy-utility balance than prior text-perturbation approaches and lower memory use than running open-source models locally. The authors also define a (λ, μ, ρ)-privacy model for prompt groups and analyze the complexity savings from decomposition.

Core claim

ConfusionPrompt achieves private inference on black-box LLMs by decomposing the original prompt into sub-prompts, generating pseudo-prompts to form a privacy-preserving group, transmitting the mixed set to the server, and allowing the user to filter and recompose the responses into the correct output, yielding higher utility than local open-source inference or perturbation methods while using less memory than full local models.

What carries the argument

The ConfusionPrompt framework, which decomposes prompts into genuine sub-prompts, interleaves them with pseudo-prompts, and relies on user-side recomposition of server responses.

If this is right

  • Black-box LLM services can be used privately without model changes or local model hosting.
  • Prompt decomposition reduces the computational burden compared to full local open-source models.
  • The (λ, μ, ρ)-privacy model quantifies the protection level of any mixed prompt group.
  • Complexity analysis shows decomposition lowers the effective query cost for privacy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to multi-turn conversations if recomposition logic can track context across exchanges.
  • If pseudo-prompt generation can be made domain-specific, utility loss could drop further for specialized tasks.
  • Adoption would require users to run a local client for decomposition and recomposition, shifting some compute from server to client.

Load-bearing premise

The user can reliably identify which server responses come from genuine sub-prompts and recombine them into the correct final output without large accuracy loss.

What would settle it

A test set of prompts where an automated or human recomposer fails to recover the original answer at a rate comparable to direct LLM use, or where an adversary distinguishes genuine from pseudo sub-prompts above the (λ, μ, ρ) threshold.

Figures

Figures reproduced from arXiv: 2401.00870 by Peihua Mai, Ran Yan, Rui Ye, Yan Pang, Youjia Yang.

Figure 1
Figure 1. Figure 1: Overview of ConfusionPrompt. for the evaluation of privacy level and training of models (i.e., decomposer, generator, and recomposer). To explain the rationale of our privacy model, we follow [33] to quantify the privacy risk of the queries exposed to the server. Consider a set of prompts denoted as P = {p1 , p2 , ..., pn}. For any p ∈ P , let π(p) be the adversary’s prior probability that p is the genuine… view at source ↗
Figure 2
Figure 2. Figure 2: Example of decomposition savings in query complexity. Decomposition module reduces [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt identification attack accuracy under various combinations of privacy parameters. [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attribute inference attack accuracy for ConfusionPrompt and LDP-based methods. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Monetary ratio of strategyQA and MuSiQue dataset before ( [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

State-of-the-art large language models (LLMs) are typically deployed as online services, requiring users to transmit detailed prompts to cloud servers. This raises significant privacy concerns. In response, we introduce ConfusionPrompt, a novel framework for private LLM inference that protects user privacy by: (i) decomposing the original prompt into smaller sub-prompts, and (ii) generating pseudo-prompts alongside the genuine sub-prompts, which are then sent to the LLM. The server responses are later recomposed by the user to reconstruct the final output. This approach offers key advantages over previous LLM privacy protection methods: (i) it integrates seamlessly with existing black-box LLMs, and (ii) it delivers a significantly improved privacy-utility trade-off compared to existing text perturbation methods. We also develop a $(\lambda, \mu, \rho)$-privacy model to formulate the requirements for a privacy-preserving group of prompts and provide a complexity analysis to justify the role of prompt decomposition. Our empirical evaluation shows that ConfusionPrompt achieves significantly higher utility than local inference methods using open-source models and perturbation-based techniques, while also reducing memory consumption compared to open-source LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ConfusionPrompt, a framework for private inference on black-box online LLMs. The method decomposes a user prompt into sub-prompts, mixes them with generated pseudo-prompts according to a (λ, μ, ρ)-privacy model, sends the batch to the LLM server, and relies on the user to recompose the returned responses into the final output. It claims seamless integration with existing LLMs, a significantly improved privacy-utility tradeoff versus text-perturbation baselines, lower memory use than local open-source models, and supports these claims with a complexity analysis plus empirical evaluation.

Significance. If the recomposition step can be shown to recover outputs reliably, the approach would provide a practical black-box privacy mechanism that avoids both the utility loss of perturbation methods and the memory/compute cost of local open-source LLMs. The explicit (λ, μ, ρ) privacy formulation and complexity analysis are positive elements that could be built upon.

major comments (2)
  1. [Framework description and recomposition step] The recomposition step is described only at high level in the framework overview. The central utility claim—that ConfusionPrompt delivers higher utility than perturbation baselines—rests on the unverified assumption that users can accurately isolate and recombine genuine sub-prompt responses from a batch of indistinguishable pseudo-prompt responses without substantial loss; no concrete filtering mechanism, algorithm, or experimental measurement of filtering fidelity (e.g., accuracy or semantic overlap under LLM nondeterminism) is supplied, rendering the reported gains unsupported.
  2. [Empirical evaluation] Empirical evaluation section: the abstract states that ConfusionPrompt “achieves significantly higher utility” than local open-source inference and perturbation techniques, yet no dataset details, baseline implementations, error bars, statistical tests, or exact recomposition procedure are referenced. Without these, the quantitative privacy-utility claims cannot be assessed and the comparison to perturbation methods remains unverifiable.
minor comments (1)
  1. [Privacy model] The (λ, μ, ρ) privacy model is introduced without an explicit equation or formal definition in the provided abstract; a numbered definition or boxed formulation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the framework and evaluation. We address each major comment below and will revise the manuscript to provide the requested details.

read point-by-point responses
  1. Referee: [Framework description and recomposition step] The recomposition step is described only at high level in the framework overview. The central utility claim—that ConfusionPrompt delivers higher utility than perturbation baselines—rests on the unverified assumption that users can accurately isolate and recombine genuine sub-prompt responses from a batch of indistinguishable pseudo-prompt responses without substantial loss; no concrete filtering mechanism, algorithm, or experimental measurement of filtering fidelity (e.g., accuracy or semantic overlap under LLM nondeterminism) is supplied, rendering the reported gains unsupported.

    Authors: We agree that the current manuscript describes the recomposition step at a high level and does not supply a concrete algorithm or fidelity measurements. In the revised version we will add a detailed filtering and recombination algorithm, including handling of nondeterminism, together with new experiments quantifying its accuracy and semantic overlap. revision: yes

  2. Referee: [Empirical evaluation] Empirical evaluation section: the abstract states that ConfusionPrompt “achieves significantly higher utility” than local open-source inference and perturbation techniques, yet no dataset details, baseline implementations, error bars, statistical tests, or exact recomposition procedure are referenced. Without these, the quantitative privacy-utility claims cannot be assessed and the comparison to perturbation methods remains unverifiable.

    Authors: We acknowledge that the empirical section lacks the listed details. The revised manuscript will include dataset descriptions, baseline implementations, error bars, statistical tests, and the precise recomposition procedure used in the experiments. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; privacy model and recomposition described independently of results

full rationale

The paper defines a (λ, μ, ρ)-privacy model to formulate prompt-group requirements and provides a complexity analysis for decomposition. These are presented as design choices rather than derived predictions that reduce to fitted inputs or self-citations. The recomposition step is described at a high level without equations that loop back to the privacy claims by construction. No self-citation chains or ansatzes are invoked to justify the core privacy-utility tradeoff. This yields a minor score reflecting normal self-referential method description without forcing the central result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; the (λ, μ, ρ)-privacy model introduces three parameters whose selection or fitting is not detailed, and the recomposition step relies on an unstated domain assumption that mixed responses remain separable.

free parameters (1)
  • λ, μ, ρ
    Parameters defining the privacy model; their values are not specified as derived from first principles or external benchmarks in the abstract.
axioms (1)
  • domain assumption User can accurately recompose final output from mixed sub-prompt responses
    Invoked in the description of the reconstruction process; no justification or error analysis provided in abstract.

pith-pipeline@v0.9.0 · 5735 in / 1301 out tokens · 19316 ms · 2026-05-24T04:56:28.040686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 8 internal anchors

  1. [1]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J´ er´ emy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

  3. [3]

    Instruct2act: Mapping multi-modality instructions to robotic actions with large language model

    Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, and Hongsheng Li. Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176 , 2023

  4. [4]

    Large language models in medicine

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023

  5. [5]

    BloombergGPT: A Large Language Model for Finance

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Se- bastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023

  6. [6]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

  7. [7]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 , 2023

  8. [8]

    Llms can understand encrypted prompt: Towards privacy-computing friendly transformers

    Xuanqi Liu and Zhuotao Liu. Llms can understand encrypted prompt: Towards privacy-computing friendly transformers. arXiv preprint arXiv:2305.18396 , 2023. 20

  9. [9]

    The-x: Privacy-preserving transformer inference with homomorphic encryption

    Tianyu Chen, Hangbo Bao, Shaohan Huang, Li Dong, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. The-x: Privacy-preserving transformer inference with homomorphic encryption. arXiv preprint arXiv:2206.00216 , 2022

  10. [10]

    Dp-forward: Fine-tuning and inference on language models with differential privacy in forward pass

    Minxin Du, Xiang Yue, Sherman SM Chow, Tianhao Wang, Chenyu Huang, and Huan Sun. Dp-forward: Fine-tuning and inference on language models with differential privacy in forward pass. arXiv preprint arXiv:2309.06746 , 2023

  11. [11]

    A survey on homomorphic encryption schemes: Theory and implementation

    Abbas Acar, Hidayet Aksu, A Selcuk Uluagac, and Mauro Conti. A survey on homomorphic encryption schemes: Theory and implementation. ACM Computing Surveys (Csur), 51(4):1–35, 2018

  12. [12]

    Secure multiparty computation

    Ronald Cramer, Ivan Bjerre Damg˚ ard, et al. Secure multiparty computation. Cambridge University Press, 2015

  13. [13]

    Differentially private representation for nlp: Formal guarantee and an empirical study on privacy and fairness

    Lingjuan Lyu, Xuanli He, and Yitong Li. Differentially private representation for nlp: Formal guarantee and an empirical study on privacy and fairness. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 2355–2365, 2020

  14. [14]

    Differential privacy

    Cynthia Dwork. Differential privacy. In International colloquium on automata, languages, and programming, pages 1–12. Springer, 2006

  15. [15]

    Natural language understanding with privacy-preserving bert

    Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, and Marc Najork. Natural language understanding with privacy-preserving bert. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1488–1497, 2021

  16. [16]

    Split-and- denoise: Protect large language model inference with local differential privacy

    Peihua Mai, Ran Yan, Zhe Huang, Youjia Yang, and Yan Pang. Split-and- denoise: Protect large language model inference with local differential privacy. In Forty-first International Conference on Machine Learning

  17. [17]

    Salted inference: Enhancing privacy while maintaining efficiency of split inference in mobile computing

    Mohammad Malekzadeh and Fahim Kawsar. Salted inference: Enhancing privacy while maintaining efficiency of split inference in mobile computing. In Proceedings of the 25th International Workshop on Mobile Computing Systems and Applications, pages 14–20, 2024

  18. [18]

    Trusted execution environment: What it is, and what it is not

    Mohamed Sabt, Mohammed Achemlal, and Abdelmadjid Bouabdallah. Trusted execution environment: What it is, and what it is not. In 2015 IEEE Trust- com/BigDataSE/Ispa, volume 1, pages 57–64. IEEE, 2015. 21

  19. [19]

    Named entity recognition and classification in historical documents: A survey

    Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, and Antoine Doucet. Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys , 56(2):1–47, 2023

  20. [20]

    Neural Architectures for Named Entity Recognition

    Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recogni- tion. arXiv preprint arXiv:1603.01360 , 2016

  21. [21]

    Protecting user privacy in remote conversational systems: A privacy-preserving framework based on text sanitization

    Zhigang Kan, Linbo Qiao, Hao Yu, Liwen Peng, Yifu Gao, and Dongsheng Li. Protecting user privacy in remote conversational systems: A privacy-preserving framework based on text sanitization. arXiv preprint arXiv:2306.08223 , 2023

  22. [22]

    Hide and seek (has): A lightweight framework for prompt privacy protection

    Yu Chen, Tingxin Li, Huiming Liu, and Yang Yu. Hide and seek (has): A lightweight framework for prompt privacy protection. arXiv preprint arXiv:2309.03057, 2023

  23. [23]

    t-plausibility: Generalizing words to desensitize text

    Balamurugan Anandan, Chris Clifton, Wei Jiang, Mummoorthy Murugesan, Pedro Pastrana-Camacho, and Luo Si. t-plausibility: Generalizing words to desensitize text. Trans. Data Priv., 5(3):505–534, 2012

  24. [24]

    Cryptonets: Applying neural networks to en- crypted data with high throughput and accuracy

    Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to en- crypted data with high throughput and accuracy. In International conference on machine learning, pages 201–210. PMLR, 2016

  25. [25]

    Iron: Private inference on transformers

    Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, and Tianwei Zhang. Iron: Private inference on transformers. Advances in Neural Information Processing Systems, 35:15718–15731, 2022

  26. [26]

    Differentially private language models benefit from public pre-training

    Gavin Kerrigan, Dylan Slack, and Jens Tuyls. Differentially private language models benefit from public pre-training. arXiv preprint arXiv:2009.05886 , 2020

  27. [27]

    Differentially private fine-tuning of language models

    Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, et al. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021

  28. [28]

    Flocks of stochastic parrots: Differentially private prompt learning for large language models, 2023

    Haonan Duan, Adam Dziedzic, Nicolas Papernot, and Franziska Boenisch. Flocks of stochastic parrots: Differentially private prompt learning for large language models, 2023. 22

  29. [29]

    Privacy-preserving prompt tuning for large language model services, 2023

    Yansong Li, Zhixing Tan, and Yang Liu. Privacy-preserving prompt tuning for large language model services, 2023

  30. [30]

    Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations

    Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. Privacy-and utility-preserving textual analysis via calibrated multivariate perturbations. In Proceedings of the 13th international conference on web search and data mining , pages 178–186, 2020

  31. [31]

    Locally differentially private document generation using zero shot prompting.arXiv preprint arXiv:2310.16111, 2023

    Saiteja Utpala, Sara Hooker, and Pin Yu Chen. Locally differentially private document generation using zero shot prompting.arXiv preprint arXiv:2310.16111, 2023

  32. [32]

    The limits of word level differential privacy

    Justus Mattern, Benjamin Weggenmann, and Florian Kerschbaum. The limits of word level differential privacy. In Findings of the Association for Computational Linguistics: NAACL 2022 , pages 867–881, 2022

  33. [33]

    Embellishing text search queries to protect user privacy.(2010)

    Hwee Hwa PANG, Xuhua DING, and Xiaokui XIAO. Embellishing text search queries to protect user privacy.(2010). In Proceedings of the VLDB Endowment: 36th International Conference on Very Large Data Bases: Singapore, pages 13–17, 2010

  34. [34]

    Constructing plausible innocuous pseudo queries to protect user query intention

    Zongda Wu, Jie Shi, Chenglang Lu, Enhong Chen, Guandong Xu, Guiling Li, Sihong Xie, and S Yu Philip. Constructing plausible innocuous pseudo queries to protect user query intention. Information Sciences, 325:215–226, 2015

  35. [35]

    The McKinsey Way

    Ethan M Rasiel. The McKinsey Way . McGraw-Hill New York, 1999

  36. [36]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

  37. [37]

    Musique: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics , 10:539–554, 2022

  38. [38]

    Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras

    Bhargav Srinivasa-Desikan. Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing Ltd, 2018. 23

  39. [39]

    Flair: An easy-to-use framework for state-of-the-art nlp

    Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. Flair: An easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (demonstrations) , pages 54–59, 2019

  40. [40]

    Gender classification using twitter text data

    Pradeep Vashisth and Kevin Meehan. Gender classification using twitter text data. In 2020 31st Irish Signals and Systems Conference (ISSC) , pages 1–6. IEEE, 2020

  41. [41]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600 , 2018

  42. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  43. [43]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023

  44. [44]

    Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 7871–7880, 2020

  45. [45]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  46. [46]

    Scaling instruction-finetuned language models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024. 24

  47. [47]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 , 2019

  48. [48]

    Unsupervised approach to evaluate sentence-level fluency: Do we really need reference? arXiv preprint arXiv:2312.01500 , 2023

    Gopichand Kanumolu, Lokesh Madasu, Pavan Baswani, Ananya Mukherjee, and Manish Shrivastava. Unsupervised approach to evaluate sentence-level fluency: Do we really need reference? arXiv preprint arXiv:2312.01500 , 2023

  49. [49]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 , 2019

  50. [50]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 2383–2392, 2016

  51. [51]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

  52. [52]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318, 2002. 25 Appendix A. Proof of Theorem 11 and 12 We begin with the proof for Theorem 11 as followed: Proof. For the prompt wi...

  53. [53]

    Appendix C.3

    and DROP [51]. Appendix C.3. Semantic Similarity Model and Discriminator The comparison data collection for the generator involves a local similarity evalu- ation model and discriminator. Similarity evaluation model: We adopt a finetuned version of MiniLM-6L model [47] to extract the embedding of each private attribute. The semantic relevance between a pa...

  54. [54]

    Inarticulate/ non-fluent sentence

    Score 1: Incomprehensible. Inarticulate/ non-fluent sentence

  55. [55]

    Score 2: Low Quality. Partially fluent sentence: (a) only half of the sentence 31 is fluent or (b) more than 1 missing words or (c) more than 1 misspelt words or d) contains individual fluent word-groups with missing coherence between them

  56. [56]

    Sentence is predominantly fluent but contains either (a) misspelt word or (b) missing word or (c) multiple occurrence of a word

    Score 3: Moderate. Sentence is predominantly fluent but contains either (a) misspelt word or (b) missing word or (c) multiple occurrence of a word

  57. [57]

    Perfectly fluent sentence without any syntactic or grammatical error

    Score 4: Perfect. Perfectly fluent sentence without any syntactic or grammatical error. Strictly respond in the form of JSON with the following format: {”S1”: the score, ”S2”: the score }. Sentences: {dictionary of sentences} On obtaining 4000 training and 700 validation samples, we finetune a Bert-base (110M parameters) to train a local discriminator. Ap...