Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
super hub Mixed citations
Advances in neural information processing systems , volume=
Mixed citation behavior. Most common role is background (64%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background phenomenon might further improve the performance of the co-scientist as a tool for scientific discovery. Improved multimodal reasoning and tool-use capabilities.Some of the most interesting data in scientific publications is not written in text but may be encoded visually in figures and charts. However, even state-of-the-art frontier models may not comprehensively utilize such data with optimal reasoning [89] and the AI co-scientist system is unlikely to be an exception. Stronger benchmarks and
- other Bernoulli random variables with α(q) success probability. According to the Hoeffding's inequal- ity Hoeffding (1963) we have P(|m−α(q)N| ≥t)≤2 exp −2t2 N ,(8) wheret≥0. It implies P(|m−α(q)N| ≤t)≥1−2 exp −2t2 N ⇔P(−t≤m−α(q)N≤t)≥1−2 exp −2t2 N ⇒P(−t≤m−α(q)N)≥1−2 exp −2t2 N . By settingt= q N 2 log 2 ϵ from anyϵ >0, one may check, P m≥α(q)N− r N 2 log 1 ϵ ! ≥1−ϵ.(9) In order to ensurem≥N/2, we need to have α(q)N− r N 2 log 1 ϵ ≥ 1 2 N ⇒α(q)≥ 1 2 + r 1 2N log 1 ϵ (10) B Discussion T
- background performance would not be significantly degraded if projBx were replaced with a nonzero constant value more representative of the training distribution. To correct for this, in our causality experiments we ablate directions by replacing them with theirmeanvalues computed across a dataset, instead of zeroing them out. Specifically, to ablate a directionu, we use the formula: x′ =x+P u(x−x)(21) whereP u is the projection matrix foruand xis the mean representation. E. Static interpretability analysi
- background an All-Reduce operation so that they can use the identical gradient to update the model parameters. The All-Reduce operation accumulates distributed gradients (say Xi at i worker) from all workers (say P workers) using a reduction operation (typically sum or mean in training), which can be formally represented X=AllReduce(X 1, X2, ..., XP ) = PX i=1 Xi.(5) The gradients have the same dimensionality as the model weights, which means additional memory is required to store them for communication an
authors
co-cited works
representative citing papers
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic results for M-estimators.
BrepForge factorizes B-rep synthesis into face-aware autoregressive wireframe composition followed by boundary-conditioned surface instantiation using learning-free geometric priors.
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.
Language-Induced Priors from LLMs guide source selection in cold-start domain adaptation through an EM algorithm, matching oracle MSE under a correct prior and remaining asymptotically consistent.
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
SeBA is a joint-embedding framework that separates tabular data into two complementary views and aligns one view's representations to the nearest-neighbor structure of the other, improving feature-label relationships and achieving SOTA results in most benchmarks without relying on augmentations.
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
TCRTransBench provides a new benchmark with bidirectional TCR-peptide generation tasks, a large validated dataset, and metrics to evaluate neural models for immunological sequence modeling.
TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
citing papers explorer
-
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.