TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.
Title resolution pending
98 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.
GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.
Humans show broad weak directional confusions while DNNs show sparse strong collapses; these structures shift rate-distortion geometry differently and reveal divergent inductive biases.
Token-level interleaving in multi-agent LLMs allows honest agents to overpower adversarial majorities through dynamic logic chaining, unlike brittle response-level majority voting.
LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, though it misses many known BRCA genes.
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
LLM novel summaries emphasize endings more than human ones, measured by aligning summary sentences to referenced chapters.
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
A dual-stream transformer decoder with constraint-aware post-processing achieves error thresholds of 10.99% and 18.6% on toric codes, approaching ML bounds while scaling linearly.
MTR-DuplexBench is a multi-round benchmark for full-duplex speech language models that evaluates turn consistency, dialogue quality, instruction following, and safety.
Malicious nodes in decentralized GRPO can poison models with up to 100% success in 50 iterations on math and coding tasks, but logit probability checks and LLM judges filter most poisoned completions.
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
MoTIF adds temporal self-attention and automatic VLM-based concept discovery to concept bottleneck models for interpretable video classification, showing gains over prior global CBMs on benchmarks.
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
New analysis framework yields tighter linear convergence for FedExProx on non-strongly convex quadratics and PL functions, proving outperformance over GD once communication costs are counted.
Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.
citing papers explorer
-
Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
TIDE enables the first cross-architecture distillation of dLLMs, improving a 0.6B student by 1.53 average points over baselines when trained from 8B dense and 16B MoE teachers.
-
Adaptive Stopping for Multi-Turn LLM Reasoning
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
-
RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
-
VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.
-
GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs
GraphPlanner augments multi-agent LLM routing with a heterogeneous graph memory and RL-optimized MDP workflow generation, delivering up to 9.3% higher accuracy and over 99% lower GPU cost than prior routers while supporting zero-shot generalization.
-
Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision
Humans show broad weak directional confusions while DNNs show sparse strong collapses; these structures shift rate-distortion geometry differently and reveal divergent inductive biases.
-
The Consensus Trap: Rescuing Multi-Agent LLMs from Adversarial Majorities via Token-Level Collaboration
Token-level interleaving in multi-agent LLMs allows honest agents to overpower adversarial majorities through dynamic logic chaining, unlike brittle response-level majority voting.
-
Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery
LLM chain-of-thought filtering of Mamba saliency features on TCGA-BRCA data produces a 17-gene set with AUC 0.927 that beats both the raw 50-gene saliency list and a 5000-gene baseline while using far fewer features, though it misses many known BRCA genes.
-
Reinforcement Learning via Value Gradient Flow
VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.
-
CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation
CodeSpecBench shows LLMs achieve at most 20.2% pass rate on repository-level executable behavioral specification generation, revealing that strong code generation does not imply deep semantic understanding.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries
LLM novel summaries emphasize endings more than human ones, measured by aligning summary sentences to referenced chapters.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge
CresOWLve benchmark shows frontier LLMs retrieve relevant real-world facts but struggle to form creative connections, with up to 17% lower performance on creative questions than factual ones.
-
SAQ: Stabilizer-Aware Quantum Error Correction Decoder
A dual-stream transformer decoder with constraint-aware post-processing achieves error thresholds of 10.99% and 18.6% on toric codes, approaching ML bounds while scaling linearly.
-
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
MTR-DuplexBench is a multi-round benchmark for full-duplex speech language models that evaluates turn consistency, dialogue quality, instruction following, and safety.
-
Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO
Malicious nodes in decentralized GRPO can poison models with up to 100% success in 50 iterations on math and coding tasks, but logit probability checks and LLM judges filter most poisoned completions.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
LayerNorm Induces Recency Bias in Transformer Decoders
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
-
Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification
MoTIF adds temporal self-attention and automatic VLM-based concept discovery to concept bottleneck models for interpretable video classification, showing gains over prior global CBMs on benchmarks.
-
Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling
Prefix-RFT blends SFT and RFT via prefix sampling from demonstrations to outperform standalone SFT, RFT, and mixed-policy baselines on math reasoning problems.
-
Tighter Performance Theory of FedExProx
New analysis framework yields tighter linear convergence for FedExProx on non-strongly convex quadratics and PL functions, proving outperformance over GD once communication costs are counted.
-
Power-Softmax: Towards Secure LLM Inference over Encrypted Data
Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
-
LoRA: Low-Rank Adaptation of Large Language Models
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
-
Learning to Forget: Continual Learning with Adaptive Weight Decay
FADE adapts per-parameter weight decay rates online via approximate meta-gradient descent to improve controlled forgetting over fixed decay in online tracking and streaming classification.
-
A paradox of AI fluency
Fluent AI users adopt an active, iterative collaboration mode that produces more visible failures but better recovery and success on hard tasks, whereas novices experience more invisible failures from passive use.
-
Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
Strong-model variance is the strongest empirical predictor of blind-spot deception in weak-to-strong alignment, backed by a misfit-based upper bound on population risk.
-
Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
Explicit prompt baselines cut NLI contradictions by up to 42.6% with zero training, while learned gated context projectors deliver a 34% reduction in planning-stage contradictions and 50% higher cross-stage entailment on DriveLM-nuScenes.
-
Faster LLM Inference via Sequential Monte Carlo
SMC-SD replaces rejection sampling with particle resampling in speculative decoding to deliver 2.36x speedup over standard SD and 5.2x over autoregressive decoding while staying within 3% of target accuracy.
-
ProtoTTA: Prototype-Guided Test-Time Adaptation
ProtoTTA is a test-time adaptation framework for prototype models that uses intermediate prototype signals and entropy minimization to improve robustness and semantic focus under distribution shifts.
-
Quantifying Cross-Query Contradictions in Multi-Query LLM Reasoning
A benchmark and solver-augmented method reduces cross-query contradictions in LLMs (SetCons from 0.56 to 0.94) while preserving per-query accuracy across four domains.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
-
CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts
CodeQuant unifies learnable rotation smoothing with cluster-centroid absorption of outliers to reduce quantization error in low-precision MoE models, reporting up to 4.15x speedup and higher accuracy than prior PTQ methods.
-
Is There Knowledge Left to Extract? Evidence of Fragility in Medically Fine-Tuned Vision-Language Models
Medically fine-tuned VLMs exhibit fragile performance that degrades with task difficulty and shows no reliable advantage over general models, with high sensitivity to prompt changes.
-
Perception Is All You Need: A Neuroscience Framework for Low Cost Sensorless Gaze in HRI
A passive cardboard robot design exploits the brain's convexity prior in face perception to create the illusion of mutual gaze from any angle without sensors or computation.
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.
-
Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner
Scaling Decision Pre-Trained Transformer with Flow Matching on hundreds of tasks yields an agent with improved generalization in in-context reinforcement learning.
-
Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interest
LLMs deviate from announced actions in 56.6% of scenarios across six games and nine models, frequently without awareness of breaking promises.
-
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
-
When AI Agents Disagree Like Humans: Reasoning Trace Analysis for Human-AI Collaborative Moderation
Agent verdict agreement in multi-agent hate speech moderation correlates with lower human annotator disagreement, with large effect sizes, motivating uncertainty-surfacing designs over consensus-seeking.
-
Align then Train: Efficient Retrieval Adapter Learning
A two-stage adapter method aligns query and document embedding spaces to improve dense retrieval for complex queries using lightweight encoders and few labels.
-
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
LLM alignments redirect stereotypes to implicit tasks instead of removing them, producing bias score divergences up to 0.43 across explicit and implicit probes in audits of seven models.
-
Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models
Scene Dynamic Field integrates physics simulators into MLLM fine-tuning to boost intuitive physics understanding, delivering up to 20.7% gains on fluid tasks with generalization to unseen domains.
-
Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation
Constrained MLE fuses human calibration data, LLM judge labels, and judge performance bounds to yield accurate low-variance estimates of LLM failure rates.
-
RAST-MoE-RL: A Regime-Aware Spatio-Temporal MoE Framework for Deep Reinforcement Learning in Ride-Hailing
RAST-MoE-RL equips RL agents with a regime-aware spatio-temporal MoE encoder that reduces matching delay by 10% and pickup delay by 15% on real Uber data from San Francisco while showing robustness to unseen regimes.
-
House of Dextra: Cross-embodied Co-design for Dexterous Hands
A co-design framework learns task-specific hand shapes and complementary control policies, supporting design, training, fabrication, and deployment of new dexterous hands in under 24 hours.
-
Structured Uncertainty guided Clarification for LLM Agents
Structured uncertainty with EVPI enables more efficient clarification and better training for tool-calling LLM agents on ambiguous tasks.
-
Discrete Bayesian Sample Inference for Graph Generation
GraphBSI uses Bayesian Sample Inference as noise-controlled SDEs to generate discrete graphs in one shot, achieving state-of-the-art results on molecular benchmarks Moses and GuacaMol.