Generalized Slow Roll for Tensors
read the original abstract
The recent BICEP2 detection of degree scale CMB B-mode polarization, coupled with a deficit of observed power in large angle temperature anisotropy, suggest that the slow-roll parameter $\epsilon_H$, the fractional variation in the Hubble rate per efold, is both relatively large and may evolve from an even larger value on scales greater than the horizon at recombination. The relatively large tensor contribution implied also requires finite matching features in the tensor power spectrum for any scalar power spectrum feature proposed to explain anomalies in the temperature data. We extend the generalized slow-roll approach for computing power spectra, appropriate for such models where the slow-roll parameters vary, to tensor features where scalar features are large. This approach also generalizes the tensor-scalar consistency relation to be between the ratio of tensor and scalar sources and features in the two power spectra. Features in the tensor spectrum are generically suppressed by $\epsilon_H$ relative those in the scalar spectrum and by the smoothness of the Hubble rate, which must obey covariant conservation of energy, versus its derivatives. Their detection in near future CMB data would indicate a fast roll period of inflation where $\epsilon_H$ approaches order unity, allowed but not required by inflationary explanations of temperature anomalies.
This paper has not been read by Pith yet.
Forward citations
Cited by 32 Pith papers
-
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90...
-
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
-
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
-
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
-
COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC
COMPASS formalizes HPC configuration questions as ML tasks on traces, quantifies recommendation trustworthiness, and delivers 65.93% lower average job turnaround time plus 80.93% lower node usage versus prior methods ...
-
PRAXIS: Integrating Program Analysis with Observability for Root-Cause Analysis
PRAXIS combines LLM-driven structured traversal of service dependency graphs and hammock-block program dependence graphs to improve root-cause analysis accuracy by up to 6.3x while cutting token consumption by 5.3x on...
-
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
-
PICO: Performance Insights for Collective Operations
PICO is a benchmarking framework for collective operations that decouples portable setup from platform execution, supplies reference MPI implementations, and shows default choices can be up to 5x slower with up to 44%...
-
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.
-
Tensor-Parallel Emulation of Quantum Circuits with Block-Cyclic Distributed Matrix Product States
Presents a tensor-parallel distributed MPS method with block-cyclic partitioning and pivoted QR that emulates Google's RCS benchmark at bond dimension 16384 on 32 nodes, claiming three orders of magnitude better accur...
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
-
torchtune: PyTorch native post-training library
torchtune is a modular PyTorch library for LLM post-training that delivers competitive performance and memory efficiency while supporting rapid research iteration through hackable components.
-
Entanglement-informed distributed wavefunction approach to scalable quantum many-body systems
Entanglement structure provides a natural distributed representation for quantum wavefunctions that reduces Hamiltonian applications to local contractions and enables near-linear scaling in simulations.
-
Selecting optimal unrestricted Hartree-Fock trial wavefunctions for phaseless auxiliary-field quantum Monte Carlo: Accuracy and limitations in modeling three iron-sulfur clusters
Chemical properties and symmetries, not variational energy, should guide UHF trial selection for ph-AFQMC on iron-sulfur clusters, yielding accurate energies despite suboptimal sampling and bias compensation.
-
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
A heterogeneous HPC diagnostics framework achieves 314x GPU speedup for 100k execution traces and identifies 32.28% potential speedup for GAMESS on Frontier via a tri-dimensional performance model.
-
Enhancing Performance Insight at Scale: A Heterogeneous Framework for Exascale Diagnostics
An accelerated hpcanalysis framework ingests performance data from 100,000 MPI ranks in 9.69 seconds, delivers up to 314x GPU speedup, maps network congestion on Aurora, and uses a new tri-dimensional model to identif...
-
Practical Formal Verification for MLIR Programs
A hybrid concrete-symbolic verifier checks MLIR program equivalence in linear time for a supported subset and is applied to AMD MLIR-AIR, MLIR-AIE, and mlir-opt on hundreds of benchmarks.
-
LASER: Learning Active Sensing for Continuum Field Reconstruction
LASER trains a reinforcement learning policy inside a latent dynamics model to choose sensor placements that improve reconstruction of continuum fields under sparsity.
-
NOMAD: Generating Embeddings for Massive Distributed Graphs
NOMAD delivers an MPI-based distributed implementation of graph embedding models achieving 10-100x median speedups over multi-threaded baselines and 35-76x over prior distributed systems on large clusters.
-
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
-
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
Remember what you did so you know what to do next
GPT-J with full action history achieves 3.5x improvement over RL in ScienceWorld and matches a two-stage system using 29x larger models.
-
Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips
On Grace Hopper superchips, energy efficiency during multimodal training is governed by data movement and overlap rather than compute utilization, and runtime-optimal configurations are not always energy-optimal.
-
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
AdaFRUGAL automates FRUGAL's static hyperparameters with linear decay on subspace ratio and loss-aware update frequency, delivering competitive accuracy with lower memory and faster training on C4, VietVault, and GLUE.
-
A New Broadcast Model for Several Network Topologies
BBS is a broadcast algorithm that maximizes node utilization through balanced saturation cycles, outperforming standard methods in simulations across multiple network topologies.
-
AI-Powered Surrogate Modelling for Multiscale Combustion: A Critical Review and Opportunities
A critical review of AI surrogate models for multiscale combustion that compares supervised, unsupervised, and physics-guided methods, identifies transferability and consistency challenges, and outlines future opportunities.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.