Recognition: 2 theorem links
· Lean TheoremLLMs Get Lost In Multi-Turn Conversation
Pith reviewed 2026-05-14 00:52 UTC · model grok-4.3
The pith
LLMs drop 39 percent in multi-turn conversations because they make early assumptions and fail to recover from mistakes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across more than 200,000 simulated multi-turn conversations, every tested LLM exhibits significantly lower success rates than in equivalent single-turn settings. The degradation decomposes into a modest reduction in aptitude and a much larger rise in unreliability. The dominant failure mode is that models form assumptions in early turns, generate what they treat as a final solution, and then over-rely on that output even when subsequent user messages contradict it. In short, once an LLM takes a wrong turn it tends to stay lost rather than recover.
What carries the argument
Simulation-based decomposition of multi-turn performance into separate aptitude loss and unreliability growth, with the key observation that models form early assumptions and over-commit to premature final outputs.
Load-bearing premise
The simulated multi-turn conversations accurately reflect the distribution and dynamics of real user interactions with LLMs.
What would settle it
Measure whether actual human users in live multi-turn sessions observe the same 39 percent drop and lack of recovery from early wrong turns that the simulations report.
read the original abstract
Large Language Models (LLMs) are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that *when LLMs take a wrong turn in a conversation, they get lost and do not recover*.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports large-scale simulation experiments comparing LLM performance on six generation tasks in single-turn fully-specified settings versus multi-turn conversational settings. It finds an average 39% performance drop in the multi-turn case across tested models, decomposed into minor aptitude loss and substantial unreliability; the latter is attributed to models forming premature assumptions in early turns and failing to recover from incorrect paths, based on analysis of over 200,000 simulated conversations.
Significance. If the core empirical pattern holds, the work identifies a practically important limitation for conversational LLM use cases, where underspecification and iterative refinement are common. The scale of the simulation (200k+ conversations) provides statistical power to detect the unreliability component, which could motivate new training objectives or evaluation protocols focused on error recovery.
major comments (2)
- [Abstract] The central 39% drop and the unreliability decomposition rest on the fidelity of the simulated conversations, yet the abstract only summarizes the generation process as 'prompting an LLM to produce successive user turns or using fixed templates that embed underspecification' without providing the exact prompt templates, task definitions, or how correction opportunities are introduced; this detail is load-bearing because an artifactual suppression of natural user corrections would inflate the non-recovery observation.
- [Analysis of 200,000+ simulated conversations] The claim that LLMs 'make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely' requires explicit operationalization of 'wrong turn,' 'recovery,' and the metrics used to quantify each component; without these definitions and the associated evaluation code or rubrics, the decomposition into aptitude versus unreliability cannot be independently verified.
minor comments (1)
- [Abstract] The abstract states 'all the top open- and closed-weight LLMs we test' but does not list the specific models or their sizes; adding this enumeration would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the importance of methodological transparency. We address each major comment below with point-by-point responses and have revised the manuscript to provide greater clarity on simulation details and operational definitions.
read point-by-point responses
-
Referee: [Abstract] The central 39% drop and the unreliability decomposition rest on the fidelity of the simulated conversations, yet the abstract only summarizes the generation process as 'prompting an LLM to produce successive user turns or using fixed templates that embed underspecification' without providing the exact prompt templates, task definitions, or how correction opportunities are introduced; this detail is load-bearing because an artifactual suppression of natural user corrections would inflate the non-recovery observation.
Authors: We agree that the fidelity of the simulation is central to the claims. The exact prompt templates for generating successive user turns (via LLM prompting) and the fixed templates embedding underspecification are fully documented in Appendix A. The six generation tasks, including their specifications and ground-truth evaluation criteria, are defined in Section 3.1. Correction opportunities arise naturally because the user simulator is prompted to continue the conversation based on the model's prior response, permitting clarifications or corrections in later turns as described in Section 3.2. We have revised the abstract to reference Appendix A for these details while preserving its brevity. This change improves verifiability without affecting the reported 39% drop or its decomposition. revision: yes
-
Referee: [Analysis of 200,000+ simulated conversations] The claim that LLMs 'make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely' requires explicit operationalization of 'wrong turn,' 'recovery,' and the metrics used to quantify each component; without these definitions and the associated evaluation code or rubrics, the decomposition into aptitude versus unreliability cannot be independently verified.
Authors: We concur that explicit operationalization strengthens the analysis. Section 4.2 of the revised manuscript now defines a 'wrong turn' as an early-turn response that introduces an assumption unsupported by the initial task specification, resulting in a final output deviating from ground truth. 'Recovery' is operationalized as the proportion of such cases where later turns produce a correct final output. The decomposition metrics are: aptitude loss (performance drop on conversations without wrong turns) and unreliability (additional drop from non-recovered errors). These are computed automatically over the 200k+ conversations, with rubrics validated via manual review of 500 samples. The evaluation code and rubrics are now linked in the manuscript via our public repository. These additions enable independent verification of the reported decomposition. revision: yes
Circularity Check
No significant circularity; empirical measurements are direct and self-contained
full rationale
The paper conducts large-scale simulation experiments that directly measure LLM performance in single-turn versus multi-turn settings, reporting an average 39% drop and decomposing it into aptitude loss versus unreliability via analysis of 200,000+ generated conversations. No equations, fitted parameters, or derivations are presented that reduce the central claim (LLMs make early assumptions and fail to recover) to the simulation procedure itself or to any self-citation chain. The results are obtained by running the evaluated models on procedurally generated dialogues and recording outcomes, which constitutes independent empirical content rather than a self-definitional or fitted-input prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Simulated conversations reflect real user behavior and task underspecification patterns
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
Evaluating Temporal Consistency in Multi-Turn Language Models
Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
-
EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents
EMSDialog is a dataset of 4,414 synthetic multi-speaker EMS dialogues generated by a multi-LLM agent pipeline grounded in ePCR reports, annotated with diagnoses, roles, and topics, and shown to improve accuracy, timel...
-
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
$\delta$-mem: Efficient Online Memory for Large Language Models
δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
-
Pause or Fabricate? Training Language Models for Grounded Reasoning
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task succe...
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models
AnchorMem decouples atomic fact anchors and associative event graphs for retrieval from preserved raw interaction contexts, outperforming prior memory methods on the LoCoMo benchmark.
-
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety...
-
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
ZORO: Active Rules for Reliable Vibe Coding
ZORO integrates rules directly into AI coding workflows by enriching plans, enforcing compliance with proof requirements, and evolving rules via user feedback, resulting in better rule adherence and shifts in user behavior.
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation
MT-OSC condenses chat history via a one-off sequential process with a few-shot Condenser and lightweight Decider to reduce tokens and preserve LLM accuracy in multi-turn settings.
-
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
-
Context-Agent: Dynamic Discourse Trees for Non-Linear Dialogue
Context-Agent represents dialogue history as a dynamic tree to handle non-linear topic shifts and introduces the NTM benchmark for evaluating long-horizon non-linear dialogues.
-
Trace Mutation in Human-LLM Dialogue: The Transcript as Forensic and Mitigation Surface
Trace mutations are a class of context failures in LLM conversations consisting of utterance effacement and genitive dissociation that distort the shared record while resisting ordinary repair.
-
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
-
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
-
MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs
MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.
-
Can AI Help You Get Over Your Breakup? One Session with a Belief-Reframing Chatbot Shows Sustained Distress Reduction
A pre-registered RCT found that one session with a belief-reframing AI chatbot produced significantly greater reductions in breakup distress than a survey-only control at 7 days, with a smaller effect persisting at 1 month.
-
Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction
Bipredictability from token statistics monitors structural consistency in multi-turn LLM interactions, showing 85% alignment with structure but only 44% with semantics and 100% sensitivity to tested drifts across 4574 turns.
Reference graph
Works this paper leans on
-
[1]
M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
G. Bai, J. Liu, X. Bu, Y . He, J. Liu, Z. Zhou, Z. Lin, W. Su, T. Ge, B. Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7421–7454, 2024
work page 2024
- [3]
-
[4]
P. Brauner, A. Hick, R. Philipsen, and M. Ziefle. What does the public think about artificial intelligence?—a criticality map to understand bias in the public perception of ai. In Frontiers of Computer Science, 2023. URL https://api.semanticscholar.org/CorpusID:257598212. 13 LLMs Get Lost In Multi-Turn Conversation PREPRINT
work page 2023
-
[5]
T. Chakrabarty, P. Laban, D. Agarwal, S. Muresan, and C.-S. Wu. Art or artifice? large language models and the false promise of creativity. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–34, 2024
work page 2024
-
[6]
T. Chakrabarty, P. Laban, and C.-S. Wu. Ai-slop to ai-polish? aligning language models through edit-based writing rewards and test-time computation. arXiv preprint arXiv:2504.07532, 2025
- [7]
-
[8]
H. Chase. Langchain, October 2022. URL https://github.com/langchain-ai/langchain
work page 2022
-
[9]
A. Chaturvedi, K. Thompson, and N. Asher. Nebula: A discourse aware minecraft builder. ArXiv, abs/2406.18164,
-
[10]
URL https://api.semanticscholar.org/CorpusID:270738020
-
[11]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [12]
-
[13]
E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y . Choi, P. Liang, and L. Zettlemoyer. Quac: Question answering in context. arXiv preprint arXiv:1808.07036, 2018
work page Pith review arXiv 2018
-
[14]
E. Choi, J. Palomaki, M. Lamm, T. Kwiatkowski, D. Das, and M. Collins. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461, 2021
work page 2021
-
[15]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [16]
- [17]
- [18]
-
[19]
H. Duan, J. Wei, C. Wang, H. Liu, Y . Fang, S. Zhang, D. Lin, and K. Chen. Botchat: Evaluating llms’ capabilities of having multi-turn dialogues. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3184–3200, 2024
work page 2024
- [20]
-
[21]
V . S. Ferreira. Ambiguity, accessibility, and a division of labor for communicative success.Psychology of Learning and motivation, 49:209–246, 2008
work page 2008
-
[22]
S. E. Finch, J. D. Finch, and J. D. Choi. Don’t forget your abc’s: Evaluating the state-of-the-art in chat-oriented dialogue systems. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023
work page 2023
-
[23]
S. Frisson. Semantic underspecification in language processing. Lang. Linguistics Compass, 3:111–127, 2009. URL https://api.semanticscholar.org/CorpusID:13384476
work page 2009
-
[24]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [26]
-
[27]
Troy, Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli
K. Handa, A. Tamkin, M. McCain, S. Huang, E. Durmus, S. Heck, J. Mueller, J. Hong, S. Ritchie, T. Belonax, et al. Which economic tasks are performed with ai? evidence from millions of claude conversations. arXiv preprint arXiv:2503.04761, 2025
-
[28]
C. Herlihy, J. Neville, T. Schnabel, and A. Swaminathan. On overcoming miscalibrated conversational priors in llm-based chatbots. arXiv preprint arXiv:2406.01633, 2024. 14 LLMs Get Lost In Multi-Turn Conversation PREPRINT
-
[29]
M. C. Horowitz, L. Kahn, J. Macdonald, and J. Schneider. Adopting ai: how familiarity breeds both trust and contempt. AI & society, 39(4):1721–1735, 2024
work page 2024
-
[30]
K.-H. Huang, P. Laban, A. R. Fabbri, P. K. Choubey, S. Joty, C. Xiong, and C.-S. Wu. Embrace divergence for richer insights: A multi-document summarization benchmark and a case study on summarizing diverse information from news articles. arXiv preprint arXiv:2309.09369, 2023
-
[31]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Live- codebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
M. Karpinska, K. Thai, K. Lo, T. Goyal, and M. Iyyer. One thousand and one pairs: A" novel" challenge for long-context language models. arXiv preprint arXiv:2406.16264, 2024
- [34]
- [35]
- [36]
- [37]
-
[38]
W.-C. Kwan, X. Zeng, Y . Jiang, Y . Wang, L. Li, L. Shang, X. Jiang, Q. Liu, and K.-F. Wong. Mt-eval: A multi-turn capabilities evaluation benchmark for large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20153–20177, 2024
work page 2024
- [39]
- [40]
- [41]
-
[42]
S. Lappin. An intensional parametric semantics for vague quantifiers. Linguistics and Philosophy, 23:599–620,
-
[43]
URL https://api.semanticscholar.org/CorpusID:170154611
- [44]
-
[45]
Y . Lee, K. Son, T. S. Kim, J. Kim, J. J. Y . Chung, E. Adar, and J. Kim. One vs. many: Comprehending accurate information from multiple erroneous and inconsistent ai generations. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , 2024. URL https://api.semanticscholar.org/CorpusID: 269635304
work page 2024
- [46]
-
[47]
M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[48]
R. Li, R. Li, B. Wang, and X. Du. Iqa-eval: Automatic evaluation of human-model interactive question answering. Advances in Neural Information Processing Systems, 37:109894–109921, 2024
work page 2024
- [49]
- [50]
-
[51]
A. Liu, Z. Wu, J. Michael, A. Suhr, P. West, A. Koller, S. Swayamdipta, N. A. Smith, and Y . Choi. We’re afraid language models aren’t modeling ambiguity. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 790–807, 2023. 15 LLMs Get Lost In Multi-Turn Conversation PREPRINT
work page 2023
-
[52]
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
- [53]
- [54]
-
[55]
C. Malaviya, J. C. Chang, D. Roth, M. Iyyer, M. Yatskar, and K. Lo. Contextualized evaluations: Taking the guesswork out of language model evaluations. arXiv preprint arXiv:2411.07237, 2024
-
[56]
L. Murakhovs’ ka, P. Laban, T. Xie, C. Xiong, and C.-S. Wu. Salespeople vs salesbot: Exploring the role of educational value in conversational recommender systems. arXiv preprint arXiv:2310.17749, 2023
-
[57]
M. Mylrea and N. Robinson. Artificial intelligence (ai) trust framework and maturity model: Applying an entropy lens to improve security, privacy, and ethical ai.Entropy, 25, 2023. URL https://api.semanticscholar.org/ CorpusID:263840323
work page 2023
-
[58]
T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
OpenAI o3 and o4-mini System Card — openai.com
OpenAI. OpenAI o3 and o4-mini System Card — openai.com. https://openai.com/index/ o3-o4-mini-system-card/ , 2025. [Accessed 08-05-2025]
work page 2025
-
[60]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318, 2002
work page 2002
- [61]
- [62]
- [63]
-
[64]
L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
C. Poelitz and N. McKenna. Synthetic clarification and correction dialogues about data-centric tasks–a teacher- student approach. arXiv preprint arXiv:2503.14167, 2025
-
[66]
M. Post and M. Junczys-Dowmunt. Escaping the sentence-level paradigm in machine translation. arXiv preprint arXiv:2304.12959, 2023
-
[67]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
- [68]
- [69]
- [70]
- [71]
-
[72]
URL https://api.semanticscholar.org/CorpusID:277244656
-
[73]
Y . Scherrer, J. Tiedemann, and S. Loáiciga. Analysing concatenation approaches to document-level nmt in two different domains. In Proceedings of the Third Workshop on Discourse in Machine Translation, Hong-Kong, Nov
-
[74]
Association for Computational Linguistics
- [75]
-
[76]
arXiv preprint arXiv:2501.17399 , year=
V . Sirdeshmukh, K. Deshpande, J. Mols, L. Jin, E.-Y . Cardona, D. Lee, J. Kritz, W. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier llms. arXiv preprint arXiv:2501.17399, 2025
-
[77]
J. Southworth, K. Migliaccio, J. Glover, J. Glover, D. Reed, C. McCarty, J. Brendemuhl, and A. Thomas. Developing a model for ai across the curriculum: Transforming the higher education landscape via innovation in ai literacy. Computers and Education: Artificial Intelligence, 4:100127, 2023
work page 2023
-
[78]
Y . Sun, C. Liu, K. Zhou, J. Huang, R. Song, W. X. Zhao, F. Zhang, D. Zhang, and K. Gai. Parrot: Enhancing multi-turn instruction following for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9729–9750, 2024
work page 2024
-
[79]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.