The first SoK on LLM-as-a-Judge security organizes attacks targeting judges, attacks using judges, defenses leveraging judges, and security-domain applications while flagging vulnerabilities.
Canonical reference
Title resolution pending
Canonical reference. 73% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
AgroTools is a new benchmark for tool-augmented multimodal agents in agriculture featuring 539 QA pairs, 1,097 images, five task families, and 14 tools, with evaluations showing major limitations in current models' tool planning and execution.
CAREBench provides the first benchmark with full inferential chain annotations for appraisal reasoning and emotion understanding in LLMs, showing that stronger models still fall short on reasoning steps and capturing subjective human differences.
Geometric consistency in embedding space predicts LLM-human disagreement on ordinal difficulty ratings better than probability baselines in CEFR sentence assessment.
XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pattern indicating generation failure rather than alignment.
RootGuard delivers turn-invariant privacy for multi-turn agents by noising root private attributes once and applying deterministic post-processing to all derived releases.
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
A co-trained multifidelity mixture-of-experts MLIP partitions simulations into high- and low-capacity regions, maintains exact energy conservation and bulk modulus alignment, and runs more than twice as fast as a single high-fidelity model on a Pt+CO system.
TF-SMOT composes pretrained vision-language models into a training-free pipeline that reaches state-of-the-art tracking and improved summary quality on the BenSMOT benchmark.
A-MBER is a new benchmark for evaluating AI models on using interaction history to recognize and explain a user's present affective state across judgment, retrieval, and explanation tasks.
GraphSSR introduces an adaptive SSR pipeline with SSR-SFT data synthesis and SSR-RL (Authenticity-Reinforced and Denoising-Reinforced stages) to overcome one-size-fits-all subgraph noise in zero-shot LLM graph reasoning.
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.
DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal performance.
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
Pre-trained MoE models exhibit up to 90% intra-expert activation sparsity that enables up to 2.5x faster MoE layer execution when exploited in the vLLM inference system.
Tabular diffusion models leak membership information via attacks even with partial attacker knowledge, and common heuristic privacy metrics like distance-to-closest-record are unreliable.
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.
PTNet is a prototype-guided task-adaptive model that jointly performs change detection and captioning on bi-temporal UAV imagery by modeling structured change semantics, outperforming prior methods on the new UCCD urban construction benchmark and WHU-CDC.
A validation-driven LLM workflow generates 1,500 charts from 74 UCI datasets with 30,003 aligned QA pairs, revealing that current multimodal models handle chart syntax well but struggle with value extraction and reasoning.
citing papers explorer
-
Security in LLM-as-a-Judge: A Comprehensive SoK
The first SoK on LLM-as-a-Judge security organizes attacks targeting judges, attacks using judges, defenses leveraging judges, and security-domain applications while flagging vulnerabilities.
-
AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture
AgroTools is a new benchmark for tool-augmented multimodal agents in agriculture featuring 539 QA pairs, 1,097 images, five task families, and 14 tools, with evaluations showing major limitations in current models' tool planning and execution.
-
CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning
CAREBench provides the first benchmark with full inferential chain annotations for appraisal reasoning and emotion understanding in LLMs, showing that stronger models still fall short on reasoning steps and capturing subjective human differences.
-
Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals
Geometric consistency in embedding space predicts LLM-human disagreement on ordinal difficulty ratings better than probability baselines in CEFR sentence assessment.
-
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity
XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pattern indicating generation failure rather than alignment.
-
Dependency-Aware Privacy for Multi-turn Agents
RootGuard delivers turn-invariant privacy for multi-turn agents by noising root private attributes once and applying deterministic post-processing to all derived releases.
-
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
-
Mixture of Experts Framework in Machine Learning Interatomic Potentials for Atomistic Simulations
A co-trained multifidelity mixture-of-experts MLIP partitions simulations into high- and low-capacity regions, maintains exact energy conservation and bulk modulus alignment, and runs more than twice as fast as a single high-fidelity model on a Pt+CO system.
-
Training-Free Semantic Multi-Object Tracking with Vision-Language Models
TF-SMOT composes pretrained vision-language models into a training-free pipeline that reaches state-of-the-art tracking and improved summary quality on the BenSMOT benchmark.
-
A-MBER: Affective Memory Benchmark for Emotion Recognition
A-MBER is a new benchmark for evaluating AI models on using interaction history to recognize and explain a user's present affective state across judgment, retrieval, and explanation tasks.
-
Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models
GraphSSR introduces an adaptive SSR pipeline with SSR-SFT data synthesis and SSR-RL (Authenticity-Reinforced and Denoising-Reinforced stages) to overcome one-size-fits-all subgraph noise in zero-shot LLM graph reasoning.
-
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
-
Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.
-
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
-
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
PAI-2 improves factual correctness in LLM answers by 4% on average across benchmarks using adaptive graph traversal and planning, with 6% gains from traversal algorithms and 18% from enabled planning.
-
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
DiM3 is a direction- and magnitude-aware merging method that composes heterogeneous multilingual and multimodal updates in LLM backbones, outperforming baselines on 57-language benchmarks while retaining multimodal performance.
-
When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs
Layer-wise Laplacian energy of visual attention reveals hallucination emergence in MLLMs and enables LaSCD, a closed-form logit remapping strategy that mitigates hallucinations while preserving general performance.
-
Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution
Pre-trained MoE models exhibit up to 90% intra-expert activation sparsity that enables up to 2.5x faster MoE layer execution when exploited in the vLLM inference system.
-
On Privacy Leakage in Tabular Diffusion Models: Influential Factors, Attacker Knowledge, and Metrics
Tabular diffusion models leak membership information via attacks even with partial attacker knowledge, and common heuristic privacy metrics like distance-to-closest-record are unreliable.
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
-
DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
DTap is a new red-teaming platform for AI agents that uses autonomous exploration across realistic simulations to discover vulnerabilities and creates a verifiable benchmark dataset.
-
UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model
PTNet is a prototype-guided task-adaptive model that jointly performs change detection and captioning on bi-temporal UAV imagery by modeling structured change semantics, outperforming prior methods on the new UCCD urban construction benchmark and WHU-CDC.
-
Generating Statistical Charts with Validation-Driven LLM Workflows
A validation-driven LLM workflow generates 1,500 charts from 74 UCI datasets with 30,003 aligned QA pairs, revealing that current multimodal models handle chart syntax well but struggle with value extraction and reasoning.
-
Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains
Benign fine-tuning of foundation models induces large, heterogeneous, and often contradictory changes in safety metrics across general and domain-specific benchmarks.
-
Agentic Discovery with Active Hypothesis Exploration for Visual Recognition
HypoExplore uses LLMs for hypothesis-driven evolutionary search with a Trajectory Tree and Hypothesis Memory Bank to discover lightweight vision architectures, reaching 94.11% accuracy on CIFAR-10 from an 18.91% baseline and generalizing to other datasets including state-of-the-art on MedMNIST.
-
Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy
AI-generated text detectors achieve high benchmark accuracy by exploiting unstable dataset-specific linguistic features, as evidenced by cross-domain degradation and differing SHAP explanations across corpora.
-
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
Reinforcement learning with three causal constraints enables multimodal models to internalize diagram-reasoning links in geometry, unlike SFT which only mimics surface format and harms performance.
-
AgentGuard: A Multi-Agent Framework for Robust Package Confusion Detection via Hybrid Search and Metadata-Content Fusion
AgentGuard detects package confusion attacks via multi-agent hybrid name search plus fused metadata-content ML analysis, raising precision 12-49% and cutting false positives 11-35% versus baselines on ConfuDB and NeupaneDB.
-
Cortex AISQL: A Production SQL Engine for Unstructured Data
Snowflake's Cortex AISQL adds native semantic operations to SQL via AI-aware optimization, adaptive model cascades, and semantic join rewriting, delivering 2-70x speedups in production workloads.
-
Graph Concept Bottleneck Models
GraphCBMs extend concept bottleneck models by building latent concept graphs to model correlations between concepts, yielding better image classification accuracy, more informative structure for interpretability, and stronger intervention results.
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
Long Context Transfer from Language to Vision
Extending language model context length enables LMMs to process over 200K visual tokens from long videos without video training, achieving SOTA on Video-MME via dense frame sampling.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
-
Cubit: Token Mixer with Kernel Ridge Regression
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
-
BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine
BioResearcher is a new multi-agent system that leads baselines on single-step biomedical tests, BixBench, BaisBench, and a 30-query clinical discovery benchmark with 74.7% positive hit rate.
-
Proactive Dialogue Model with Intent Prediction
A Temporal Bayesian Network derived from MultiWOZ intent annotations predicts user intent transitions and guides proactive dialogue generation, raising Coverage AUC from 0.742 to 0.856 while cutting turns to 75% coverage from 3.95 to 2.73.
-
Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study
Transformer models detect applicant gender in de-gendered academic recommendation letters via implicit linguistic patterns such as associations with words like 'emotional' and 'humanitarian', and removing these cues reduces but does not eliminate prediction accuracy above chance.
-
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and reasoning tasks.
-
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
-
Uncertainty Estimation for the Open-Set Text Classification systems
Adapting HolUE to open-set text classification yields 40-365% gains in Prediction Rejection Ratio over baselines on authorship, intent, and topic datasets.
-
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.
-
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.
-
MERIT: Modular Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
MERIT achieves 81.65% F1 on MMFakeBench for multimodal misinformation detection via a four-module framework, outperforming zero-shot baselines like GPT-4V with MMD-Agent at 74.0% F1, with gains attributed to architectural design.
-
SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs
SLIP combines a soft label mechanism with key-extraction-guided CoT to reduce instruction backdoor attack success rate to 25.13% and raise clean accuracy to 87.15% in LLM agents.
-
TableMaster: A Recipe to Advance Table Understanding with Language Models
TableMaster improves LM table understanding by verbalizing tables with enriched semantics and using adaptive textual-symbolic reasoning, reaching 78.13% accuracy on WikiTQ with GPT-4o-mini.
-
Causal Fine-Tuning under Latent Confounded Shift
Causal Fine-Tuning decomposes BERT representations into causal and spurious parts via SCM inductive bias to improve robustness under latent confounded shifts in text classification.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.
-
A Multi-Dimensional Audit of Politically Aligned Large Language Models
A multi-dimensional audit framework for politically aligned LLMs finds consistent trade-offs: larger models are more effective and truthful but less fair with higher bias, while fine-tuned models reduce bias but increase hallucinations and reasoning decline, and all tested models show deficiencies.
-
Detecting Alarming Student Verbal Responses using Text and Audio Classifier
A hybrid text-plus-audio classifier framework is introduced to identify potentially troubling student responses by analyzing both what is said and how it is said.