WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
Title resolution pending
41 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 41representative citing papers
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.
A solvable hierarchical model with power-law feature strengths yields explicit power-law scaling of prediction error through sequential recovery of latent directions by a layer-wise spectral algorithm.
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
TF-LLMER resolves optimization barriers in LLM-enhanced recommenders through embedding normalization and Rec-PCA that aligns semantic representations with collaborative co-occurrence graphs.
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.
AnyAct generates editable human reenactments from character videos via conditional motion generation from transferable sparse local 2D articulated cues, with designs for human-only supervision and global-local decoupling.
EgoForce recovers absolute camera-space 3D hand pose from monocular egocentric images using forearm guidance, a unified arm-hand transformer, and a closed-form ray-space solver that handles fisheye, perspective, and wide-FOV cameras.
DeconDTN-Toolkit simulates provenance shifts to expose ERM vulnerabilities and provides tools plus a robust OOD indicator for mitigating confounding by data provenance.
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
SURGE proposes a dual-path gradient compensator and adaptive gradient scaler to mitigate gradient mismatch in binary neural network training via auxiliary backpropagation.
S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
Scaling pretrained representations improves label-free OOD detection on frozen backbones, causing performance gaps between global and local detectors to vanish across vision and language tasks.
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.
citing papers explorer
-
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
-
Pointwise Generalization in Deep Neural Networks
Proposes pointwise Riemannian Dimension from feature eigenvalues to derive tighter, representation-aware generalization bounds for deep networks in the nonlinear regime.
-
Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation
Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.
-
Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model
A solvable hierarchical model with power-law feature strengths yields explicit power-law scaling of prediction error through sequential recovery of latent directions by a layer-wise spectral algorithm.
-
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
-
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
-
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
-
Break the Optimization Barrier of LLM-Enhanced Recommenders: A Theoretical Analysis and Practical Framework
TF-LLMER resolves optimization barriers in LLM-enhanced recommenders through embedding normalization and Rec-PCA that aligns semantic representations with collaborative co-occurrence graphs.
-
Uncovering the Latent Potential of Deep Intermediate Representations
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
-
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
-
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.
-
AnyAct: Towards Human Reenactment of Character Motion From Video
AnyAct generates editable human reenactments from character videos via conditional motion generation from transferable sparse local 2D articulated cues, with designs for human-only supervision and global-local decoupling.
-
EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
EgoForce recovers absolute camera-space 3D hand pose from monocular egocentric images using forearm guidance, a unified arm-hand transformer, and a closed-form ray-space solver that handles fisheye, perspective, and wide-FOV cameras.
-
DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift
DeconDTN-Toolkit simulates provenance shifts to expose ERM vulnerabilities and provides tools plus a robust OOD indicator for mitigating confounding by data provenance.
-
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.
-
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
SURGE proposes a dual-path gradient compensator and adaptive gradient scaler to mitigate gradient mismatch in binary neural network training via auxiliary backpropagation.
-
Continuity Laws for Sequential Models
S4 models exhibit stable time-continuity unlike sensitive S6 models, with task continuity predicting performance and enabling temporal subsampling for better efficiency.
-
Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning
Scaling pretrained representations improves label-free OOD detection on frozen backbones, causing performance gaps between global and local detectors to vanish across vision and language tasks.
-
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
-
GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.
-
PHALAR: Phasors for Learned Musical Audio Representations
PHALAR achieves up to 70% relative accuracy gain in stem retrieval over prior art using under half the parameters and 7x faster training by enforcing musical equivariances via spectral pooling and complex heads.
-
FACTOR: Counterfactual Training-Free Test-Time Adaptation for Open-Vocabulary Object Detection
FACTOR uses counterfactual image perturbations to quantify and suppress attribute-dependent predictions in open-vocabulary object detection, improving robustness on corrupted datasets without any training.
-
Mapping Text to Multiplex Graph: Prompt Compression as L\'evy Walk-Guided Graph Pruning
RAGP models prompt compression as redundancy-aware pruning on a multiplex graph using Lévy walks, achieving 49.3 average on LongBench at 4x compression versus 48.8 for LongLLMLingua at 3x.
-
Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning
Polaris separates semantic meaning from hierarchical structure in embeddings via angular geometry and radius on a hypersphere, yielding up to 19-point gains in taxonomy expansion retrieval over baselines.
-
Finding Meaning in Embeddings: Concept Separation Curves
Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.
-
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
ReaGeo is an end-to-end LLM framework for geocoding that uses geohash text generation, Chain-of-Thought spatial reasoning, and distance-based RL to accurately predict points and regions from explicit and vague queries.
-
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
-
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
LLMs show mixed results on authorship verification, post generation, and attribute inference from Twitter data, with new frameworks and user studies establishing benchmarks for these analytics tasks.
-
DuConTE: Dual-Granularity Text Encoder with Topology-Constrained Attention for Text-attributed Graphs
DuConTE is a dual-granularity text encoder that incorporates graph topology into language model attention for improved node representations in text-attributed graphs.
-
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification
MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.
-
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
IntentTune: Using user demand and personalization to resolve "unknown" query intents for e-commerce search
User-specific behavioral signals, especially prior search queries, outperform population-level demand patterns and static profiles for inferring gender, age, category, and size from underspecified e-commerce queries.
-
ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks
ClaHF converts instance labels into preference signals via candidate predictions and a reward model, then applies RL optimization to improve text classification accuracy and calibration.
-
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
Introduces an interpretable DL framework to analyze grammatical gender restructuring from Latin to Occitan, with a new tokenizer and quantification of morphological and POS contributions to gender prediction.
-
PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting
PnP-Corrector decouples pre-trained physics engines from a correction agent to mitigate reciprocal error amplification in coupled spatiotemporal forecasting, cutting error by 28% on a 300-day ocean-atmosphere task.
-
Evaluation of Automatic Speech Recognition Using Generative Large Language Models
Decoder-based LLMs achieve 92-94% agreement with human annotators for ASR hypothesis selection on HATS, substantially outperforming WER (63%) and embedding-based semantic metrics.