super hub Mixed citations

GPT-4o System Card

author=, Gpt-4o system card · 2024 · cs.CL · arXiv 2410.21276

Mixed citation behavior. Most common role is background (54%).

993 Pith papers citing it

Background 54% of classified citations

open full Pith review browse 993 citing papers more from author= arXiv PDF

abstract

GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 98 baseline 51 method 23 dataset 3

citation-polarity summary

background 94 baseline 51 use method 22 unclear 4 use dataset 3 support 1

claims ledger

abstract GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while

authors

author= Gpt-4o system card

co-cited works

representative citing papers

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

cs.AI · 2026-06-06 · unverdicted · novelty 8.0

UniQL is a human-verified benchmark providing aligned natural language questions and dialect-specific SQL queries for 16 SQL systems to evaluate cross-dialect generalization.

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

cs.CV · 2026-05-31 · accept · novelty 8.0

Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

cs.CV · 2026-05-28 · unverdicted · novelty 8.0

VideoFDB is a new benchmark and LM-as-judge framework for evaluating full-duplex audio-visual-to-audio-visual conversational agents on nonverbal dynamics from real video calls.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

cs.SE · 2026-04-30 · unverdicted · novelty 8.0

MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusing blank images at >92%.

CHASM: Unveiling Covert Advertisements on Chinese Social Media

cs.LG · 2026-04-22 · unverdicted · novelty 8.0

CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ReConText3D: Replay-based Continual Text-to-3D Generation

cs.CV · 2026-04-15 · conditional · novelty 8.0

ReConText3D is the first replay-memory framework for continual text-to-3D generation that prevents catastrophic forgetting on new textual categories while preserving quality on previously seen classes.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues

cs.AI · 2026-04-09 · unverdicted · novelty 8.0

DialBGM is a new benchmark dataset revealing that existing AI models fall far short of human performance when recommending fitting background music for open-domain conversations.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

cs.CV · 2026-02-04 · unverdicted · novelty 8.0

VLRS-Bench is the first benchmark dedicated to complex vision-language reasoning in remote sensing, with 2000 QA pairs across 14 tasks in cognition, decision, and prediction dimensions.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

cs.AI · 2025-09-30 · unverdicted · novelty 8.0

CritPt benchmark shows state-of-the-art LLMs reach only 5.7% average accuracy on full-scale unpublished physics research tasks, rising to about 10% with coding tools.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

LongEgoRefer is a new benchmark of 1,498 referring expressions in 45-minute average egocentric videos that exposes the failure of existing Video REC models on sparse long-form spatio-temporal grounding.

A Cost-Aware, Paired Protocol for Auditing Dynamic Tool Synthesis in Agentic Video Question Answering

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

Introduces a cost-aware paired protocol with six outcome groups and applies it to Dynamic-SAGE versus SAGE, reporting 7.5-point accuracy gain, 28% fewer tool calls, but 34% higher token use.

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.

EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.

(A)I Sees What You Don't: Exploiting New Attack Surfaces in Third-Party Mobile Agents

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

Identifies Screen Perception and Misused Channel attack surfaces in VLM-powered mobile agents and demonstrates seven attacks enabling arbitrary command execution on five frameworks without privileges.

citing papers explorer

Showing 50 of 70 citing papers after filters.

CHASM: Unveiling Covert Advertisements on Chinese Social Media cs.LG · 2026-04-22 · unverdicted · none · ref 20 · internal anchor
CHASM is a new benchmark dataset showing that existing multimodal large language models fail to reliably detect covert advertisements on Chinese social media even after fine-tuning.
A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents cs.LG · 2026-06-29 · unverdicted · none · ref 2 · internal anchor
A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.
Alignment Defends LLMs from Property Inference Attacks cs.LG · 2026-06-08 · unverdicted · none · ref 10 · internal anchor
Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.
Differentiable Efficient Operator Search cs.LG · 2026-06-03 · unverdicted · none · ref 6 · internal anchor
Introduces Efficient Operator Search, a differentiable framework that jointly optimizes token reduction locations, retention budgets, and operator behaviors in multimodal models under cost constraints, recovering manual baselines and finding hybrid operators with competitive efficiency.
Algorithmic Recourse of In-Context Learning for Tabular Data cs.LG · 2026-05-29 · unverdicted · none · ref 21 · internal anchor
The paper delivers the first theoretical analysis and practical zeroth-order framework for algorithmic recourse under in-context learning for tabular prediction.
K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance cs.LG · 2026-05-28 · unverdicted · none · ref 23 · internal anchor
K-FinHallu is the first multi-turn Korean financial RAG hallucination benchmark; frontier LLMs struggle especially on justified abstention while an 8B fine-tuned model reaches competitive performance.
Honest Lying: Understanding Memory Confabulation in Reflexive Agents cs.LG · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
Reflexive agents confabulate incorrect task interpretations in memory, detected via Reflection Repetition Rate metric, with a programmatic mitigation raising correct object mentions from 0% to 86% in frozen ALFWorld cases.
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation cs.LG · 2026-05-14 · unverdicted · none · ref 18 · internal anchor
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Test-Time Learning with an Evolving Library cs.LG · 2026-05-14 · unverdicted · none · ref 45 · internal anchor
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
FORGE: Fragment-Oriented Ranking and Generation for Context-Aware Molecular Optimization cs.LG · 2026-05-11 · unverdicted · none · ref 33 · internal anchor
FORGE reformulates molecular optimization as context-aware fragment ranking and replacement using mined low-to-high edit pairs, outperforming larger language models and graph methods on standard benchmarks.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework cs.LG · 2026-05-01 · unverdicted · none · ref 12 · internal anchor
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 30 · 2 links · internal anchor
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control cs.LG · 2026-04-22 · unverdicted · none · ref 18 · internal anchor
ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
Dynamic Tool Dependency Retrieval for Lightweight Function Calling cs.LG · 2025-12-18 · unverdicted · none · ref 13 · internal anchor
DTDR dynamically retrieves relevant tools by modeling dependencies from demonstrations and conditioning on the evolving agent plan, improving function calling success rates by 23-104% over static retrievers across benchmarks.
TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis cs.LG · 2024-10-05 · unverdicted · none · ref 53 · internal anchor
TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.
Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents cs.LG · 2026-06-15 · unverdicted · none · ref 2 · 2 links · internal anchor
Multimodal self-evaluation amplifies preference collapse and introduces cross-modal coupling that transfers evaluator biases between text and visual tasks, with self-evaluation showing near-complete immunity.
APPO: Agentic Procedural Policy Optimization cs.LG · 2026-06-10 · unverdicted · none · ref 27 · internal anchor
APPO refines branching and credit assignment in agentic RL via a Branching Score and procedure-level scaling, improving baselines by nearly 4 points on 13 benchmarks.
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling cs.LG · 2026-06-09 · unverdicted · none · ref 61 · internal anchor
K-Forcing introduces progressive self-forcing distillation to train a conditional push-forward model that jointly decodes k future tokens per forward pass, yielding 2.4-3.5x speedup at k=4 with modest quality loss on LM1B and OpenWebText.
CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction cs.LG · 2026-06-04 · unverdicted · none · ref 88 · internal anchor
CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 84 · internal anchor
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning cs.LG · 2026-05-13 · unverdicted · none · ref 55 · internal anchor
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
Understanding and Accelerating the Training of Masked Diffusion Language Models cs.LG · 2026-05-13 · conditional · none · ref 27 · internal anchor
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Leveraging RAG for Training-Free Alignment of LLMs cs.LG · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration cs.LG · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents cs.LG · 2026-05-08 · unverdicted · none · ref 13 · 2 links · internal anchor
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents cs.LG · 2026-05-07 · unverdicted · none · ref 1 · internal anchor
SHARP is a neuro-symbolic method that evolves bounded, auditable rule rubrics for LLM trading agents via cross-sample attribution and walk-forward validation, raising compact-model performance by 10-20 percentage points across equity sectors.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe cs.LG · 2026-05-05 · unverdicted · none · ref 15 · internal anchor
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment cs.LG · 2026-04-22 · unverdicted · none · ref 86 · internal anchor
MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing cs.LG · 2026-04-17 · unverdicted · none · ref 47 · internal anchor
PRJA achieves 83.6% average success injecting harmful content into LRM reasoning chains on five QA datasets without altering final answers.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space cs.LG · 2026-04-15 · unverdicted · none · ref 25 · internal anchor
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs? cs.LG · 2026-02-20 · conditional · none · ref 36 · 2 links · internal anchor
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B cs.LG · 2025-12-10 · conditional · none · ref 15 · internal anchor
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision cs.LG · 2025-09-17 · unverdicted · none · ref 9 · internal anchor
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
PiERN: Token-Level Routing for Integrating High-Precision Computation and Reasoning cs.LG · 2025-09-17 · unverdicted · none · ref 4 · internal anchor
PiERN proposes token-level routing of physically-isolated experts to embed high-precision computation directly into LLMs, reporting higher accuracy and lower latency, token count, and energy use than fine-tuning or multi-agent baselines.
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains cs.LG · 2025-07-23 · unverdicted · none · ref 14 · internal anchor
RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.
Vidar: Embodied Video Diffusion Model for Generalist Manipulation cs.LG · 2025-07-17 · unverdicted · none · ref 29 · internal anchor
Vidar shows that a video diffusion prior continuously pre-trained on 750K multi-view robot trajectories plus a label-free masked inverse dynamics adapter can generalize manipulation to new robot embodiments with 1% of typical demonstration data.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 17 · internal anchor
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
Tokenizing Single-Channel EEG with Time-Frequency Motif Learning cs.LG · 2025-02-22 · unverdicted · none · ref 7 · internal anchor
TFM-Tokenizer learns a vocabulary of time-frequency motifs from single-channel EEG via a dual-path masked architecture and encodes signals into discrete tokens, reporting up to 11% Cohen's Kappa gains on benchmarks and 14% on ear-EEG sleep staging.
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning cs.LG · 2026-06-16 · unverdicted · none · ref 25 · internal anchor
EnvRL incorporates environment dynamics learning via state prediction and inverse dynamics auxiliary objectives into agentic RL, reporting higher success rates than RL-only baselines on ALFWorld and WebShop.
From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning cs.LG · 2026-06-08 · unverdicted · none · ref 69 · internal anchor
Thinking-RFT improves Theory of Mind accuracy by 6% over SFT on shortcut-free datasets, with 10% gains on higher-order reasoning and better generalization to new domains.
When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff cs.LG · 2026-06-07 · unverdicted · none · ref 55 · internal anchor
Excessive SFT reduces LLM plasticity for RL; Rejuvenation restores it via base-anchored fusion and targeted neuron resets, yielding better RL performance and OOD generalization.
From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing cs.LG · 2026-06-05 · unverdicted · none · ref 166 · internal anchor
DARS replaces single-shot response labels with distribution-aware supervision derived from input and output uncertainty to produce more reliable LLM routing policies.
LMT: A Bayesian Framework for Causal Discovery from Textual Alarm Records in Manufacturing Systems cs.LG · 2026-06-03 · unverdicted · none · ref 78 · internal anchor
LMT is a Bayesian method that fuses LLM-derived textual priors with temporal Poisson likelihoods to discover causal graphs from alarm event records.
SE-GA: Memory-Augmented Self-Evolution for GUI Agents cs.LG · 2026-05-16 · unverdicted · none · ref 17 · internal anchor
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
Composable Crystals: Controllable Materials Discovery via Concept Learning cs.LG · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
VQ-VAE concept learning enables controllable recombination of crystal motifs to generate structures with reported gains in validity-stability-uniqueness-novelty metrics on MP-20 and Alex-MP-20.
Curriculum Learning-Guided Progressive Distillation in Large Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
CLPD improves LLM distillation for reasoning by combining explicit data curriculum with progressive teacher scheduling of increasing capacity.
FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning cs.LG · 2026-05-10 · unverdicted · none · ref 23 · internal anchor
FLAME is an MoE architecture using modality-specific routers and low-rank compression of expert knowledge to support efficient continual multimodal multi-task learning while reducing catastrophic forgetting.

GPT-4o System Card

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer