Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
Canonical reference
Title resolution pending
Canonical reference. 71% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
WildChat releases a dataset of 1 million ChatGPT conversations with timestamps, demographics, and headers, claimed to be the most diverse and multilingual such resource available.
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
M³C replaces the hard hyperparameter optimization with a sequence of simpler problems using a majorant for the log-determinant approximated via Monte Carlo, with proven high-probability convergence to a critical point under assumptions.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
ASAP amortizes Sinkhorn-based doubly-stochastic attention by learning a parametric map from 1D potentials to the Sinkhorn dual and reconstructing the plan via two-sided entropic c-transform, delivering 5.3x faster inference at matched accuracy.
Port-Hamiltonian neural networks extended to PDEs recover the Hamiltonian and dissipation of nonlinear string dynamics from data and outperform non-physics-informed baselines.
CAWI replaces standard random initialization of input-to-hidden weights in randomized neural networks with samples drawn from a data-fitted copula that preserves observed feature dependencies, yielding consistent accuracy gains on 83 classification benchmarks.
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships and achieving SOTA results in most benchmarks without relying on augmentations.
Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.
TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.
ABGD parametrizes piecewise linear functions as difference of max-affine functions and converges linearly to an epsilon-accurate solution with O(d max(sigma/epsilon,1)^2) samples under sub-Gaussian noise, which is minimax optimal up to logs.
K-DSM uses per-feature kurtosis to set noise scales in DSM, enabling effective single-scale anomaly detection on tabular benchmarks in both semi-supervised and unsupervised settings.
LAPRAS uses predictions to answer likely queries with the offline Matrix Mechanism and paces residual budget for unpredicted queries via unbiased stopping-time estimation from the first few unexpected arrivals, achieving near-offline utility when overlap is high.
FieryGS integrates LLM-based material reasoning, volumetric combustion simulation, and a unified renderer with 3D Gaussian Splatting to generate physically plausible and user-controllable fire in in-the-wild scenes.
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
FaceParts performs unsupervised segmentation of facial features in Gaussian Splatting avatars and supports precise editing and cross-avatar part transfer using feature disentanglement, density clustering, and FLAME anchoring.
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
citing papers explorer
-
SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships and achieving SOTA results in most benchmarks without relying on augmentations.
-
FieryGS: In-the-Wild Fire Synthesis with Physics-Integrated Gaussian Splatting
FieryGS integrates LLM-based material reasoning, volumetric combustion simulation, and a unified renderer with 3D Gaussian Splatting to generate physically plausible and user-controllable fire in in-the-wild scenes.
-
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
-
DreamFusion: Text-to-3D using 2D Diffusion
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
-
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
-
HORST: Composing Optimizer Geometries for Sparse Transformer Training
HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
-
TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates
TabKDE generates synthetic tabular data using copula transformations followed by kernel density estimation, matching prior accuracy with negligible training time and reduced storage via coresets.
-
From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper
Sparse interleaved multi-view inputs with DenseWarper outperform traditional dense simultaneous multi-view methods for 3D human pose estimation on Human3.6M and MPI-INF-3DHP datasets.
-
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm
Theoretical analysis of continual factual knowledge acquisition shows data replay stabilizes pretrained knowledge by shifting convergence dynamics while regularization only slows forgetting, leading to the STOC method for attention-based replay selection.
-
Learning the Preferences of a Learning Agent
Formalizes preference learning from a no-regret or Boltzmann-converging learner with theoretical guarantees or impossibility results for IRL algorithms.
-
Kinematics-Driven Gaussian Shape Deformation for Blurry Monocular Dynamic Scenes
Kinematics-GS reparameterizes Gaussian shapes along motion trajectories with a kinematic prior to reconstruct dynamic 3D scenes from blurry monocular videos by separating dynamic and static components and using coarse-to-fine optimization.
-
CONTRA: Conformal Prediction Region via Normalizing Flow Transformation
CONTRA generates sharp multi-dimensional conformal prediction regions by defining nonconformity scores as distances from the center in the latent space of a normalizing flow.
-
Local Intrinsic Dimension Unveils Hallucinations in Diffusion Models
Hallucinations in diffusion models are driven by local intrinsic dimension instabilities on the manifold, which Intrinsic Quenching corrects by deflating it.
-
Distributional Off-Policy Evaluation with Deep Quantile Process Regression
DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
Information theoretic underpinning of self-supervised learning by clustering
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
-
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.
-
Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari
Transformer world models on Atari exhibit game-specific scaling regimes, but joint training on 26 environments produces consistent monotonic gains that improve downstream control policies to a median normalized score of 0.770.
-
Evaluating Federated Learning approaches for mammography under breast density heterogeneity
FedAvg matches centralized training accuracy on mammography data split by breast density heterogeneity, showing standard FL can handle this clinical variation without special fixes.
-
Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification
Energy-based fine-tuning outperforms other OOD detection methods on the real-world Plant Pathology 2021 dataset, improving detection over softmax while maintaining in-distribution accuracy.
- From Holo Pockets to Electron Density: GPT-style Drug Design with Density