hub

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

browse 14 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

cs.RO · 2026-04-19 · unverdicted · novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

cs.AI · 2026-03-08 · unverdicted · novelty 7.0

GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.

How Adversarial Environments Mislead Agentic AI?

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

Capabilities of Gemini Models in Medicine

cs.AI · 2024-04-29 · unverdicted · novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.

TIDE: Every Layer Knows the Token Beneath the Context

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

cs.CR · 2026-05-07 · unverdicted · novelty 5.0

SafeHarbor introduces a hierarchical memory-augmented guardrail with adversarial rule extraction and entropy-driven self-evolution to balance safety and utility in LLM agents.

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

cs.CL · 2026-05-19

citing papers explorer

Showing 14 of 14 citing papers.

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain cs.CL · 2026-05-09 · unverdicted · none · ref 42
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL · 2026-04-21 · unverdicted · none · ref 17
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming cs.CL · 2026-04-21 · unverdicted · none · ref 18
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning cs.RO · 2026-04-19 · unverdicted · none · ref 55
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration cs.AI · 2026-03-08 · unverdicted · none · ref 16
GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code cs.AI · 2026-05-19 · unverdicted · none · ref 6
Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 299
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation cs.RO · 2026-05-11 · unverdicted · none · ref 47
SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 1
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 58
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 230
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 89
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety cs.CR · 2026-05-07 · unverdicted · none · ref 2
SafeHarbor introduces a hierarchical memory-augmented guardrail with adversarial rule extraction and entropy-driven self-evolution to balance safety and utility in LLM agents.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning cs.CL · 2026-05-19 · unreviewed · ref 90

Advances in Neural Information Processing Systems , volume=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer