Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
Title resolution pending
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 12roles
other 1polarities
unclear 1representative citing papers
LOFT unifies orthogonal PEFT by treating adaptation as low-rank subspace rotation and adds task-aware support selection that improves efficiency under fixed budgets.
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
R3-Streaming uses cascaded control, age-aware memory forgetting, and TB-GRPO reinforcement learning to reach SOTA scores on streaming video benchmarks while cutting visual token usage by 95-96%.
MMGuard generates unlearnable multimodal examples via perturbations that exploit LVLM optimization shortcuts and disrupt cross-modal bindings, providing robust protection against unauthorized fine-tuning across threat models.
The authors adapt established RCT validity principles from other fields into a standardized framework with 33 guidelines tailored to AI evaluation contexts.
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
LLARS is a new integrated platform that combines collaborative prompt authoring, cost-controlled batch generation, and hybrid evaluation to help domain experts and developers jointly build and assess LLM systems.
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
citing papers explorer
-
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
-
LOFT: Low-Rank Orthogonal Fine-Tuning via Task-Aware Support Selection
LOFT unifies orthogonal PEFT by treating adaptation as low-rank subspace rotation and adds task-aware support selection that improves efficiency under fixed budgets.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
An Efficient Streaming Video Understanding Framework with Agentic Control
R3-Streaming uses cascaded control, age-aware memory forgetting, and TB-GRPO reinforcement learning to reach SOTA scores on streaming video benchmarks while cutting visual token usage by 95-96%.
-
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
MMGuard generates unlearnable multimodal examples via perturbations that exploit LVLM optimization shortcuts and disrupt cross-modal bindings, providing robust protection against unauthorized fine-tuning across threat models.
-
Principles and Guidelines for Randomized Controlled Trials in AI Evaluation
The authors adapt established RCT validity principles from other fields into a standardized framework with 33 guidelines tailored to AI evaluation contexts.
-
torchtune: PyTorch native post-training library
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
-
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation
LLARS is a new integrated platform that combines collaborative prompt authoring, cost-controlled batch generation, and hybrid evaluation to help domain experts and developers jointly build and assess LLM systems.
-
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
-
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
-
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.