FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
hub
Hilbert’s sixth problem: derivation of fluid equations via Boltzmann’s kinetic theory
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.AI 4 cs.SE 2 astro-ph.CO 1 cond-mat.mes-hall 1 cs.CL 1 cs.CV 1 cs.LG 1 cs.RO 1 eess.IV 1 quant-ph 1roles
background 2polarities
background 2representative citing papers
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.
SPT-3G delivers the most precise CMB EE and TE spectra at high multipoles to date, giving LCDM parameters with H0 = 66.66 ± 0.60 km/s/Mpc from ground-based data alone and reaching Planck-level constraints when combined with ACT.
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
A hybrid tensor network framework interpolates between classical and quantum models via controllable post-selection, with a trainable hyperparameter that complements bond dimension to enhance quantum machine learning.
A synthesis of van der Waals Josephson junction research showing how 2D material diversity and symmetry control open routes to novel quantum devices and sensors.
Soft probabilities serve as an infinitesimal refinement of classical probability to support axiomatization of physics, with a Mobius strip constructed from soft numbers offering insight into Hilbert's sixth problem.
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-reflection into LLM-based agents.
citing papers explorer
-
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
-
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
-
TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
-
Robust Mutation Analysis of Quantum Programs Under Noise
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
-
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
-
MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.
-
SPT-3G D1: CMB temperature and polarization power spectra and cosmology from 2019 and 2020 observations of the SPT-3G Main field
SPT-3G delivers the most precise CMB EE and TE spectra at high multipoles to date, giving LCDM parameters with H0 = 66.66 ± 0.60 km/s/Mpc from ground-based data alone and reaching Planck-level constraints when combined with ACT.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
Entanglement is Half the Story: Post-Selection vs. Partial Traces
A hybrid tensor network framework interpolates between classical and quantum models via controllable post-selection, with a trainable hyperparameter that complements bond dimension to enhance quantum machine learning.
-
New frontiers in quantum science and technology using van der Waals Josephson junctions
A synthesis of van der Waals Josephson junction research showing how 2D material diversity and symmetry control open routes to novel quantum devices and sensors.
-
Hilbert's Sixth Problem and Soft Logic
Soft probabilities serve as an infinitesimal refinement of classical probability to support axiomatization of physics, with a Mobius strip constructed from soft numbers offering insight into Hilbert's sixth problem.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-reflection into LLM-based agents.