MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
arXiv preprint arXiv:2406.16838 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.
BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and complex metrics.
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
A framework learns context-sensitive constraints automatically from LLM outputs to enforce perfect adherence during generation without manual specification.
SafeDec uses constrained decoding to ensure autoregressive robot navigation foundation models generate actions that provably satisfy STL safety specifications under assumed dynamics.
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.
citing papers explorer
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives
The choice of closeness measure in diffusion reward alignment determines the computational primitives and tractable reward classes, with linear exponential tilts sufficing for KL with convex rewards and proximal oracles for Wasserstein with concave or low-dimensional Lipschitz rewards.
-
Beyond Static Best-of-N: Bayesian List-wise Alignment for LLM-based Recommendation
BLADE uses Bayesian list-wise alignment with dynamic estimation to create a self-evolving target that overcomes limitations of static references in LLM-based recommendation, yielding sustained gains in ranking and complex metrics.
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
Learning and Enforcing Context-Sensitive Control for LLMs
A framework learns context-sensitive constraints automatically from LLM outputs to enforce perfect adherence during generation without manual specification.
-
Constrained Decoding for Safe Robot Navigation Foundation Models
SafeDec uses constrained decoding to ensure autoregressive robot navigation foundation models generate actions that provably satisfy STL safety specifications under assumed dynamics.
-
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
-
Your Model Diversity, Not Method, Determines Reasoning Strategy
The optimal reasoning strategy for LLMs depends on the model's diversity profile rather than the exploration method itself.
-
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space
Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.