Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
hub Canonical reference
The Curious Case of Neural Text Degeneration
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model. Our findings motivate Nucleus Sampling, a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. In this paper, we reveal surprising distributional differences between human text and machine text. In addition, we find that decoding strategies alon
co-cited works
representative citing papers
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
HellaSwag dataset shows state-of-the-art models fail commonsense inference tasks that humans solve easily, built via adversarial filtering of distractors.
Develops a model-agnostic attribution score as the log-ratio of conditional response probabilities with and without a marginalized prompt token, derived via Bayes inversion of next-token distributions, and relates it to conditional entropies.
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.
Frontier LLMs generate creative ideas with excess population-level crowding below human-relative parity across tasks, but targeted generation protocols can reduce it.
CP-SynC uses coordinated LLM agents to generate, validate via synthesized checkers, and select MiniZinc models from natural language, substantially outperforming baselines on a 100-problem benchmark.
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
Distinct Leaf Enumeration (DLE) replaces stochastic self-consistency sampling with deterministic traversal of a truncated decoding tree to enumerate distinct leaves, increasing coverage and reducing redundant computation while improving performance on math, coding, and reasoning benchmarks.
PS-DME is a new framework that controls post-selection false coverage rate for distributional KPI estimates via e-values and is provably more sample-efficient than data splitting under explicit conditions.
Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.
Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
OPT-Engine shows pure-text chain-of-thought reasoning in LLMs loses robustness as optimization complexity grows, external tools fix only local arithmetic, and solver-integrated methods are bottlenecked by automated constraint formulation.
The first HTR pipeline for Old Nepali manuscripts achieves 4.9% character error rate with released training code and scripts for low-resource historical scripts.
BEAVER is the first practical deterministic verifier that maintains sound probability bounds on LLM safety properties using token tries and frontier data structures, finding 2-3x more violations than sampling at 1/10 the compute.
TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.
citing papers explorer
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
-
PAL: Program-aided Language Models
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
HellaSwag: Can a Machine Really Finish Your Sentence?
HellaSwag dataset shows state-of-the-art models fail commonsense inference tasks that humans solve easily, built via adversarial filtering of distractors.
-
Probabilistic Attribution For Large Language Models
Develops a model-agnostic attribution score as the log-ratio of conditional response probabilities with and without a marginalized prompt token, derived via Bayes inversion of next-token distributions, and relates it to conditional entropies.
-
Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
-
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.
-
Ex Ante Evaluation of AI-Induced Idea Diversity Collapse
Frontier LLMs generate creative ideas with excess population-level crowding below human-relative parity across tasks, but targeted generation protocols can reduce it.
-
CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers
CP-SynC uses coordinated LLM agents to generate, validate via synthesized checkers, and select MiniZinc models from natural language, substantially outperforming baselines on a 100-problem benchmark.
-
Toward a Principled Framework for Agent Safety Measurement
BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
-
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
Distinct Leaf Enumeration (DLE) replaces stochastic self-consistency sampling with deterministic traversal of a truncated decoding tree to enumerate distinct leaves, increasing coverage and reducing redundant computation while improving performance on math, coding, and reasoning benchmarks.
-
Post-Selection Distributional Model Evaluation
PS-DME is a new framework that controls post-selection false coverage rate for distributional KPI estimates via e-values and is provably more sample-efficient than data splitting under explicit conditions.
-
Improving Sampling for Masked Diffusion Models via Information Gain
Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.
-
Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models
Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.
-
On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency
Parallel thinking in LLMs suffers from overscaling where fixed global budgets waste samples; LanBo predicts per-sample budgets from latent states to raise utilization without hurting accuracy.
-
OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
OPT-Engine shows pure-text chain-of-thought reasoning in LLMs loses robustness as optimization complexity grows, external tools fix only local arithmetic, and solver-integrated methods are bottlenecked by automated constraint formulation.
-
Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts
The first HTR pipeline for Old Nepali manuscripts achieves 4.9% character error rate with released training code and scripts for low-resource historical scripts.
-
BEAVER: An Efficient Deterministic LLM Verifier
BEAVER is the first practical deterministic verifier that maintains sound probability bounds on LLM safety properties using token tries and frontier data structures, finding 2-3x more violations than sampling at 1/10 the compute.
-
TSVer: A Benchmark for Fact Verification Against Time-Series Evidence
TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.
-
Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Top-H decoding is a computationally efficient greedy algorithm for an entropy-constrained mass maximization problem that improves the creativity-coherence trade-off over min-p sampling in LLM text generation.
-
Constrained Decoding for Safe Robot Navigation Foundation Models
SafeDec uses constrained decoding to ensure autoregressive robot navigation foundation models generate actions that provably satisfy STL safety specifications under assumed dynamics.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Chronos: Learning the Language of Time Series
Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
Learning to summarize from human feedback
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
-
Convex Optimization for Alignment and Preference Learning on a Single GPU
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
-
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
STiTch refines LLM captions via embedding transition and uses set-to-set bidirectional transportation alignment to improve training-free zero-shot composed image retrieval.
-
Staging by the Book: Automatic Sleep Stage Classification Using Scoring Rules
A rule-based sleep staging method operationalizing AASM scoring rules achieves 60.5% agreement with human majority-vote consensus on 50 PSG recordings while providing epoch-level explanations.
-
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
Combines LTL formal methods with LLMs for auditing, predictive monitoring, and runtime intervention on temporally extended behavioral constraints, outperforming LLM baselines and reducing violations.
-
Asking Back: Interaction-Layer Antidistillation Watermarks
Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.
-
Why are language models less surprised than humans? Testing the Parse Multiplicity Mismatch Hypothesis
Varying the number of simultaneous parses in RNNGs increases predicted garden-path effects but does not fully reconcile LM surprisal with human reading times.
-
TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection
TextSeal provides a localized, distortion-free LLM watermark that outperforms baselines in detection strength, remains effective in mixed human-AI text, preserves model performance, and transfers through distillation for provenance tracking.
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
Adversarial SQL Injection Generation with LLM-Based Architectures
RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective
Greedy decoding is optimal for VQA under derived calibration conditions and outperforms stochastic sampling on benchmarks.
-
On the Importance and Evaluation of Narrativity in Natural Language AI Explanations
XAI explanations should be narratives with continuous structure, cause-effect, fluency and diversity, and new metrics are needed to evaluate this better than standard NLP scores.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
-
Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models
Reward-weighted classifier-free guidance approximates Q-function policy improvement in autoregressive models, enabling test-time reward optimization and faster RL convergence via distillation.
-
LoopGuard: Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention
LoopGuard detects attention collapse loops during LLM decoding and prunes repetitive KV cache tail spans under fixed budget, cutting loop incidence by over 90 percentage points on the new LoopBench benchmark.
-
A cross-species neural foundation model for end-to-end speech decoding
A cross-species pretrained neural encoder combined with end-to-end training and audio LLMs reduces word error rate in neural speech decoding from 24.69% to 10.22% while aligning attempted and imagined speech.
-
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.