AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
hub
Internet-augmented dialogue generation
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
Framework acquires descriptive text for entities via web and LLMs to train classifiers from names and labels alone, achieving 82.3% and 72.9% macro F1 on SIC code and healthcare taxonomy classification tasks.
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
citing papers explorer
-
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Improving language models by retrieving from trillions of tokens
RETRO matches GPT-3 and Jurassic-1 performance on the Pile benchmark using 25 times fewer parameters by conditioning on retrieved chunks from a 2-trillion-token database.
-
Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks
Framework acquires descriptive text for entities via web and LLMs to train classifiers from names and labels alone, achieving 82.3% and 72.9% macro F1 on SIC code and healthcare taxonomy classification tasks.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
LaMDA: Language Models for Dialog Applications
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.