{"total":27,"items":[{"citing_arxiv_id":"2607.00276","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds","primary_cat":"cs.LG","submitted_at":"2026-06-30T23:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13165","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes","primary_cat":"cs.CL","submitted_at":"2026-05-13T08:28:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08904","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04449","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking","primary_cat":"cs.CL","submitted_at":"2026-05-06T03:25:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"model (AWS Documentation 2023), represented as zi = Embed(fcombine(Dsys i−1, Duseri, Si)), where fcombine constructs a structured text representation combining the previous sys- tem response, current user utterance, and identified slots. For a given dialogue turn t, we retrieve the most relevant exam- ples by computing semantic similarity: Et =TopK(e∈ E: cos(z e,z t)> τ sim),(4) where zt represents the embedding of the current turn and τsim is a similarity threshold. These embeddings are indexed in ChromaDB for efficient similarity-based retrieval. Each retrieved example e∈E t includes its ground truth slot si, value vj pairs Ve ={(s j, vj)}me j=1, providing demonstration instances for the subsequent value generation methods."},{"citing_arxiv_id":"2605.04243","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA","primary_cat":"cs.AI","submitted_at":"2026-05-05T19:30:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2305.11738, 2023. [31] David Harel. Dynamic logic for programs.Information and Control, 1979. [32] Peter Henderson et al. On the reproducibility of neural network training.arXiv preprint arXiv:1709.06560, 2017. [33] Dan Hendrycks et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. [34] Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey.arXiv preprint arXiv:2212.10403, 2022. [35] Jie Huang, Xinyun Chen, Swaroop Yu, et al. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798, 2023. [36] Eero Hyvönen. Reasoning with interval constraints.Artificial Intelligence, 58(1-3):139-173,"},{"citing_arxiv_id":"2604.04942","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-03-13T13:01:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.13262","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning","primary_cat":"cs.AI","submitted_at":"2026-01-19T17:51:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12538","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic Reasoning for Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-01-18T18:58:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Goal Orientation prompt based ↔ explicit goal reactive ↔ planning This transition marks a conceptual shift: reasoning no longer scales through static capacity, but through structured interaction that enables planning, adaptation, and collaboration across time and tasks. 2.1. Positioning Our Survey While several recent surveys have examined LLM reasoning or agent architectures [51, 52, 53, 54, 55, 56, 57, 58, 59], our work focuses specifically onagentic reasoningas a unified paradigm for understanding reasoning as interaction. We position this survey at the intersection of model-centric reasoning and system- level intelligence, aiming to bridge prior discussions on reasoning mechanisms and agent architectures. Relation to LLM Reasoning Surveys."},{"citing_arxiv_id":"2511.20857","ref_index":192,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory","primary_cat":"cs.CL","submitted_at":"2025-11-25T21:08:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04978","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI","primary_cat":"cs.AI","submitted_at":"2025-10-06T16:16:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"gies across domains, extending beyond recognizing input features toward generating and interacting with realistic physical scenarios. Scope Comparison and Contributions.As summa- rized in Table 1, existing surveys have examined individual dimensions of physical understanding in isolation, addressing perception [40], [41], [42], [43], [44], reasoning [45], [46], [47], [48], [49], mod- eling [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], and interaction [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70], [71] as separate research areas without examining the synergistic connections between them. Our survey uniquely focuses on the evolutionary trajectory that unites these four capa- bilities into a coherent paradigm, analyzing how"},{"citing_arxiv_id":"2509.24765","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Semantic-Aware Logical Reasoning via a Semiotic Framework","primary_cat":"cs.AI","submitted_at":"2025-09-29T13:31:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.13351","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks","primary_cat":"cs.CL","submitted_at":"2025-06-16T10:43:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Direct Reasoning Optimization applies token-level Reasoning Reflection Reward (R3) focused on high-variance tokens and rubric-gating constraints to improve sample-efficient RL training of LLMs on unverifiable tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.15564","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research","primary_cat":"cs.SE","submitted_at":"2025-04-22T03:33:57+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"view related work in LLM-based code generation benchmarks, re- spectively. Finally, Section 7 concludes the paper and sugg ests di- rections for future research. 2 Dataset Curation In this section, we describe the process followed to curate the dataset. • Step 1 - Project Selection: To construct a comprehensive class- level dataset, we began by identifying all projects from theCode- SearchNet [26] dataset. CodeSearchNet consists of functions rep- resented as (comment, code) pairs extracted from real-life open- source projects on GitHub. A comment refers to a top-level func- tion docstring [1], and thecode refers to the corresponding human- written function. Since our goal is to extract classes from r eal- world software, we selected CodeSearchNet as our source of projects."},{"citing_arxiv_id":"2503.21460","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model Agent: A Survey on Methodology, Applications and Challenges","primary_cat":"cs.CL","submitted_at":"2025-03-27T12:50:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Compared to traditional agent systems [2], LLM-based agents have achieved generational across multiple dimen- sions, including knowledge sources [3], generalization ca- pabilities [4], and interaction modalities [5]. Today's agents represent a qualitative leap driven by the convergence of three key developments: ❶ unprecedented reasoning capabil- ities of LLMs [6], ❷ advancements in tool manipulation and environmental interaction [7], and ❸ sophisticated memory architectures that support longitudinal experience accumu- lation [8], [9]. This convergence has transformed theoretical constructs into practical systems, increasingly blurring the boundary between assistants and collaborators. This shift fundamentally arises from LLMs' role as general-purpose task"},{"citing_arxiv_id":"2503.10615","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization","primary_cat":"cs.CV","submitted_at":"2025-03-13T17:56:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.02871","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T04:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.18279","ref_index":100,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tions for novel tasks, demonstrating strong generalization skills [89]. This allows LLMs to effectively comprehend user requests directed at GUI agents and to follow predefined objectives accurately. 3) Long-Term Reasoning [99]: LLMs possess the ability to plan and solve complex tasks by breaking them down into manageable steps, often employing techniques like chain-of-thought (CoT) reasoning [100], [101]. This capability is essential for GUI agents, as many tasks require multiple steps and a robust planning framework. 4) Code Generation and Tool Utilization [102]: LLMs excel in generating code and utilizing various tools, such as APIs [13]. This expertise is vital, as code and tools form the essential toolkit for GUI agents to interact with"},{"citing_arxiv_id":"2410.04047","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis","primary_cat":"cs.LG","submitted_at":"2024-10-05T06:04:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.10038","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Diagram of Thought","primary_cat":"cs.CL","submitted_at":"2024-09-16T07:01:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Diagram of Thought (DoT) is a controller-light framework in which an LLM builds typed reasoning diagrams validated online and interpreted as diagrams in a slice topos whose synthesis is a finite limit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.00515","ref_index":107,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Large Language Models for Code Generation","primary_cat":"cs.CL","submitted_at":"2024-06-01T17:48:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"CommitPackFT [187], Code Alpaca[43], OA-Leet[63], OSS-Instruct[278], Evol-instruction[225] Self-OSS-Instruct-SC2-Exec-Filter[304] Benchmarks General HumanEval[48], HumanEval+[162], HumanEvalPack[187], MBPP[17] MBPP+[162], CoNaLa[297], Spider[300], CONCODE[113], ODEX[273] CoderEval[299], ReCode[263], StudentEval[19] Competitions APPS[95], CodeContests[151] Data Science DSP[41], DS-1000[136], ExeDS[107] Multilingual MBXP[16], Multilingual HumanEval[16], HumanEval-X[321], MultiPL-E[39] xCodeEval[128] Reasoning MathQA-X[16], MathQA-Python[17], GSM8K[58], GSM-HARD[79] Repository RepoEval[309], Stack-Repo[239], Repobench[167], EvoCodeBench[144] SWE-bench[123], CrossCodeEval[68], SketchEval[308] Recent Advances Data Synthesis (Sec. 5.2) Self-Instruct [268], Evol-Instruct [289], Phi-1[84], Code Alpaca[43], WizardCoder[173]"},{"citing_arxiv_id":"2405.16755","ref_index":77,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CHESS: Contextual Harnessing for Efficient SQL Synthesis","primary_cat":"cs.LG","submitted_at":"2024-05-27T01:54:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"can follow human instructions of complex new tasks per- forming multi-step reasoning when needed. LLMs are thus becoming the basic building block for the development of general-purpose AI agents or artificial general intelligence (AGI). As the field of LLMs is moving fast, with new findings, models and techniques being published in a matter of months or weeks [7], [8], [9], [10], [11], AI researchers and practi- tioners often find it challenging to figure out the best recipes to build LLM-powered AI systems for their tasks. This paper gives a timely survey of the recent advances on LLMs. We hope this survey will prove a valuable and accessible resource for students, researchers and developers. LLMs are large-scale, pre-trained, statistical language mod-"},{"citing_arxiv_id":"2308.11432","ref_index":178,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Large Language Model based Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2023-08-22T13:30:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Aligning LLMs with human intelligence is an active area of research to address concerns such as biases and illusions. [177] have compiled existing techniques for human alignment, including data collection and model training methodologies. Reasoning is a crucial aspect of intelligence, in- fluencing decision-making, problem-solving, and other cognitive abilities. [178] presents the cur- rent state of research on LLMs' reasoning abili- ties, exploring approaches to improve and evaluate their reasoning skills. [179] propose that language models can be enhanced with reasoning capabilities and the ability to utilize tools, termed Augmented Language Models (ALMs). They conduct a com- prehensive review of the latest advancements in"},{"citing_arxiv_id":"2307.06435","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Overview of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-07-12T20:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"methods [16, 97] train them on reasoning datasets. We discuss various prompting techniques for reasoning below. Chain-of-Thought (CoT): A special case of prompting where demonstrations contain reasoning information aggregated with inputs and outputs so that the model generates outcomes with step-by-step reasoning. More details on CoT prompts are avail- able in [55, 103, 101]. Self-Consistency: Improves CoT performance by generat- ing multiple responses and selecting the most frequent an- swer [104]. Tree-of-Thought (ToT): Explores multiple reasoning paths with possibilities to look ahead and backtrack for problem- solving [105]. Single-Turn Instructions: In this prompting setup, LLMs are queried only once with all the relevant information in the"},{"citing_arxiv_id":"2305.14992","ref_index":120,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reasoning with Language Model is Planning with World Model","primary_cat":"cs.CL","submitted_at":"2023-05-24T10:28:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.18223","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2302.00923","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multimodal Chain-of-Thought Reasoning in Language Models","primary_cat":"cs.CL","submitted_at":"2023-02-02T07:51:19+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}