Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
hub
Talm: Tool augmente d language models
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.
ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
RooAgent provides an LLM agent interface that translates natural-language prompts into calls to PyROOT analysis functions for high energy physics tasks, with support for multiple AI backends and tested on ZH simulations and ATLAS open data.
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
NaviAgent decouples task planning from tool execution via a Tool World Navigation Model graph to improve scalability and success rates in LLM agents handling large tool ecosystems.
Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
LLM agents exhibit temporal blindness, achieving no better than 65% normalized alignment with human preferences on tool-use decisions across time-sensitive scenarios in the new TicToc dataset.
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
citing papers explorer
-
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
Efficient numeracy in language models through single-token number embeddings
BitTokens represent numbers as single tokens via IEEE 754 binary format, allowing small language models to learn basic arithmetic algorithms nearly perfectly.
-
ViperGPT: Visual Inference via Python Execution for Reasoning
ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
-
RooAgent: An LLM Agent for Root-Based High Energy Physics Analysis
RooAgent provides an LLM agent interface that translates natural-language prompts into calls to PyROOT analysis functions for high energy physics tasks, with support for multiple AI backends and tested on ZH simulations and ATLAS open data.
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
-
CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
-
NaviAgent: Bilevel Planning on Tool Navigation Graph for Large-Scale Orchestration
NaviAgent decouples task planning from tool execution via a Tool World Navigation Model graph to improve scalability and success rates in LLM agents handling large tool ecosystems.
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
-
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception
LLM agents exhibit temporal blindness, achieving no better than 65% normalized alignment with human preferences on tool-use decisions across time-sensitive scenarios in the new TicToc dataset.
-
Tracing the ongoing emergence of human-like reasoning in Large Language Models
LLMs function as accurate semantic processors for conditionals but do not replicate the pragmatic inferences that define human reasoning.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
-
Bridging Language Models and Financial Analysis
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.