The information bottleneck method

Fernando C. Pereira (ATT Shannon Laboratory); Naftali Tishby (Hebrew University; NEC Research Institute); William Bialek (NEC Research Institute)

arxiv: physics/0004057 · v1 · submitted 2000-04-24 · ⚛️ physics.data-an · cond-mat.dis-nn· cs.LG· nlin.AO

The information bottleneck method

Naftali Tishby (Hebrew University , NEC Research Institute) , Fernando C. Pereira (ATT Shannon Laboratory) , William Bialek (NEC Research Institute) This is my paper

Pith reviewed 2026-05-11 11:12 UTC · model grok-4.3

classification ⚛️ physics.data-an cond-mat.dis-nncs.LGnlin.AO

keywords information bottleneckmutual informationrate distortion theorydata compressionrelevant informationfeature extractionsignal processinglearning theory

0 comments

The pith

Compressing a signal X through limited codewords can preserve all the information it provides about another signal Y.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to formalize the extraction of relevant information from one signal about another as an optimization task. Relevant information means the part of X that helps predict Y, such as speech sounds helping identify spoken words. The authors propose squeezing X into a compressed representation T so that T carries as much information about Y as possible while using as little information from X as possible. This matters because it supplies a concrete mathematical procedure to decide which features of a signal are worth keeping for a given prediction task. The approach treats the tradeoff as a generalization of rate-distortion ideas in which the cost of error arises automatically from the observed relationship between X and Y.

Core claim

We define the relevant information in a signal x as the information it provides about y. We formalize the task of finding a short code for x that preserves the maximum information about y as squeezing that information through a bottleneck formed by a limited set of codewords t. This constrained optimization can be seen as a generalization of rate distortion theory in which the distortion measure emerges from the joint statistics of x and y. The variational principle yields an exact set of self-consistent equations for the coding rules from x to t and from t to y, which can be solved by a convergent re-estimation method that generalizes the Blahut-Arimoto algorithm.

What carries the argument

The bottleneck variable T, the compressed representation of X that is found by optimizing the tradeoff between the information lost in compression and the information retained about Y.

If this is right

The optimal coding rules X to T and T to Y are given by the fixed points of the self-consistent equations.
These equations are solved by an iterative re-estimation algorithm that converges to the solution.
The effective distortion measure in the equivalent rate-distortion problem is determined directly by the joint statistics p(x,y).
The same variational principle supplies a framework for analyzing problems in signal processing and learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

When the joint distribution must be estimated from finite samples, the method may need additional regularization to remain stable.
Choosing different target signals Y could turn the same optimization into a tool for supervised or semi-supervised feature extraction.
The framework suggests that clustering or dimensionality reduction can be performed by treating class labels or future observations as the Y variable.

Load-bearing premise

The joint distribution p(x,y) is known or can be estimated reliably from data so that the mutual information quantities can be computed exactly.

What would settle it

Running the re-estimation procedure on a dataset whose joint distribution p(x,y) is known exactly and finding that the resulting coding rules fail to satisfy the self-consistent equations or achieve the predicted levels of information preservation about Y.

read the original abstract

We define the relevant information in a signal $x\in X$ as being the information that this signal provides about another signal $y\in \Y$. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. Understanding the signal $x$ requires more than just predicting $y$, it also requires specifying which features of $\X$ play a role in the prediction. We formalize this problem as that of finding a short code for $\X$ that preserves the maximum information about $\Y$. That is, we squeeze the information that $\X$ provides about $\Y$ through a `bottleneck' formed by a limited set of codewords $\tX$. This constrained optimization problem can be seen as a generalization of rate distortion theory in which the distortion measure $d(x,\x)$ emerges from the joint statistics of $\X$ and $\Y$. This approach yields an exact set of self consistent equations for the coding rules $X \to \tX$ and $\tX \to \Y$. Solutions to these equations can be found by a convergent re-estimation method that generalizes the Blahut-Arimoto algorithm. Our variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the original paper that introduced the information bottleneck as a variational generalization of rate-distortion theory, and the derivation is clean and direct.

read the letter

The main thing here is that this paper sets out the information bottleneck for the first time. It frames the task as finding a compressed code T for X that keeps as much mutual information with Y as possible, and it turns that into a Lagrangian whose stationary points give explicit self-consistent equations for the mappings p(t|x) and p(y|t). The iteration they describe is a straightforward extension of the Blahut-Arimoto procedure and they show it decreases the objective monotonically for finite alphabets. That part is new relative to the rate-distortion work they cite and follows immediately from the usual definitions of mutual information and the Markov chain X-T-Y. The math is transparent and does not hide any circular steps or extra assumptions in the derivation itself. They also note that the effective distortion measure arises naturally from the joint statistics rather than being imposed by hand, which is a useful conceptual move. The practical limitation is that the whole construction assumes p(x,y) is known or can be estimated reliably. The paper treats this as given and does not discuss sampling, high-dimensional estimation, or what happens when the joint is only approximate. That is a genuine gap for anyone who wants to apply the method to real data, though it does not undermine the theoretical contribution. This is the sort of paper that is useful to people working on information-theoretic approaches in machine learning, signal processing, or theoretical neuroscience. A reader who wants a principled way to combine compression and relevance will find the core idea worth their time. The central argument is solid enough that it should go to peer review rather than being desk-rejected.

Referee Report

0 major / 3 minor

Summary. The paper defines the relevant information in a signal X about another signal Y as the information preserved through a compressed bottleneck representation T. It formalizes this as a constrained optimization problem maximizing I(T;Y) subject to a bound on I(X;T), shows that this is a generalization of rate-distortion theory in which the distortion measure emerges from the joint p(x,y), derives the exact self-consistent equations for the optimal mappings p(t|x) and p(y|t), and presents a convergent iterative re-estimation algorithm that generalizes the Blahut-Arimoto procedure.

Significance. If the central derivation holds, the work supplies a principled, parameter-light variational framework for relevance-preserving compression with direct applicability to signal processing and learning tasks. Its strengths include the clean derivation of the fixed-point equations from standard mutual-information identities and the Markov chain X–T–Y, the explicit generalization of rate-distortion theory, and the guarantee of monotonic improvement and convergence for finite alphabets.

minor comments (3)

The abstract states that applications 'will be described in detail elsewhere'; a brief forward reference or one-sentence outline of the intended follow-up would improve self-contained readability.
Notation for the bottleneck variable alternates between T and tX in the abstract; consistent use of a single symbol (e.g., T) throughout the manuscript would reduce minor confusion.
The weakest assumption—that p(x,y) is known or reliably estimated—is stated clearly but could be highlighted with a short remark on practical estimation procedures in the main text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our manuscript, the recognition of its strengths, and the recommendation to accept. The referee's description accurately captures the central contributions of the work.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from mutual information definitions and variational calculus

full rationale

The paper's central derivation starts from the definitions of mutual information I(X;T) and I(T;Y) under the Markov chain X–T–Y, formulates the bottleneck as a constrained optimization problem, introduces a Lagrange multiplier for the I(X;T) term, and obtains the fixed-point equations via functional derivatives. These steps rely only on standard information-theoretic identities and calculus of variations; no parameters are fitted and then relabeled as predictions, no self-citations carry load-bearing uniqueness claims, and the generalization of rate-distortion theory is presented as an interpretive analogy rather than a renaming that substitutes for derivation. The iterative re-estimation procedure is shown to be a valid alternating optimization that monotonically decreases the functional, but this is a consequence of the variational setup rather than a circular reduction. The joint p(x,y) is an external input, matching the stated weakest assumption.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The framework rests on standard information-theoretic definitions plus one free trade-off parameter and the introduction of the bottleneck variable T.

free parameters (1)

beta
Lagrange multiplier that trades off compression rate against preserved mutual information with Y; its value is chosen according to the desired operating point.

axioms (2)

standard math Mutual information I(X;Y) = H(X) - H(X|Y) is the measure of relevance.
Used to quantify both the compression cost and the preserved relevance.
domain assumption The mapping from X to T is a stochastic kernel p(t|x) that can be optimized independently of the downstream mapping from T to Y.
Enables the Markov chain structure X–T–Y required for the bottleneck.

invented entities (1)

bottleneck variable T no independent evidence
purpose: Compressed representation of X that retains maximal information about Y
New random variable introduced to enforce the rate constraint; no external falsifiable prediction is supplied for T itself.

pith-pipeline@v0.9.0 · 5565 in / 1495 out tokens · 58142 ms · 2026-05-11T11:12:37.629353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

This constrained optimization problem can be seen as a generalization of rate distortion theory in which the distortion measure d(x, x̃) emerges from the joint statistics of X and Y. This approach yields an exact set of self consistent equations for the coding rules X → X̃ and X̃ → Y.
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Our variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning
IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the information that this signal provides about another signal y∈Y

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Textual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation
cs.LG 2026-06 unverdicted novelty 8.0

Introduces textual belief states and factorized GRPO to enforce strict latent state mediation in text-based world models, yielding preserved prediction accuracy with large gains in representation quality and rollout p...
Gradient-Based Program Synthesis with Neurally Interpreted Languages
cs.LG 2026-04 unverdicted novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
The Query Channel: Information-Theoretic Limits of Masking-Based Explanations
cs.AI 2026-04 unverdicted novelty 8.0

Masking-based explanations are governed by the information capacity of the query channel, with reliable recovery achievable below capacity via sparse maximum-likelihood decoding but impossible above it.
Learning 1-Bit LiDAR-based Localization with Auxiliary Objective
cs.CV 2026-06 unverdicted novelty 7.0

BiLoc is the first binary neural network framework for 6-DoF LiDAR pose estimation that uses an auxiliary objective to adaptively regulate information retention and achieve SOTA among BNNs on large outdoor datasets.
S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning
cs.SD 2026-06 unverdicted novelty 7.0

S-JEPA uses soft GMM posteriors in a JEPA framework for self-supervised speech learning, achieving lowest WER below 90M parameters without offline re-clustering.
In Defense of Information Leakage in Concept-based Models
cs.LG 2026-06 conditional novelty 7.0

Concept-based models can use controlled 'benign' information leakage to remain accurate and intervenable under real-world concept incompleteness by reframing their training objective.
P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization
cs.CV 2026-06 unverdicted novelty 7.0

P²-DPO generates on-policy preference pairs targeting focus-and-enhance perception and visual robustness, combined with a calibration loss, to reduce hallucinations in LVLMs more effectively than human-feedback baselines.
A Fiber Criterion for Representation Identifiability in Supervised Learning
cs.LG 2026-05 conditional novelty 7.0

A representation property is identifiable from the induced predictor iff it is constant on the fibers of the map from admissible (representation, head) pairs to the composite predictor.
Quantum Subliminal Learning
quant-ph 2026-05 unverdicted novelty 7.0

QNNs retain most hidden-task signals through public-task interfaces while classical networks transmit little, with transmission governed by teacher drift magnitude and the visible fraction of hidden drift in a unified...
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
cs.AI 2026-05 unverdicted novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.
AffectVerse: Emotional World Models for Multimodal Affective Computing
cs.CV 2026-05 unverdicted novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and...
Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning
cs.SD 2026-05 unverdicted novelty 7.0

ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.
Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schr\"odinger Samplers
cs.LG 2026-05 unverdicted novelty 7.0

Derives a conditional-marginal entropy-rate objective for bridge-aware discretization that yields U-shaped schedules and improves low-NFE sample quality on 2D, CIFAR-10, and protein tasks.
Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models
cs.LG 2026-05 conditional novelty 7.0

DyGFM introduces decoupled pre-training and divergence-conditioned prompts to create the first multi-domain dynamic graph foundation model that outperforms baselines on node classification and link prediction.
On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Knowledge distillation generalization bounds are derived via a new distillation divergence measuring teacher-student kernel difference, with tighter bounds from teacher loss flatness.
On the Generalization of Knowledge Distillation: An Information-Theoretic View
cs.IT 2026-05 unverdicted novelty 7.0

Derives upper and lower generalization bounds for the student relative to the teacher using a new distillation divergence, plus a loss-sharpness-aware bound and a bias-variance-rank decomposition in the linear Gaussian case.
Lost and Found in Translation: Variational Diagnostics for Neural Codebook Channels
cs.LG 2026-05 unverdicted novelty 7.0

Defines the neural codebook channel K_{e→d}(j|i) and proves a Bernoulli-KL bound on encoder-decoder mismatch in VAEs that cannot be recovered from marginal histograms or mutual information.
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
cs.AI 2026-05 unverdicted novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
Neural Information Causality
quant-ph 2026-05 unverdicted novelty 7.0

Neural-IC separates embedding inequalities from capacity bounds in query-separated computations, with one-bit RAC benchmarks and CHSH-layer stability selecting the Tsirelson threshold for quantum enhancements.
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
cs.CV 2026-05 unverdicted novelty 7.0

A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
cs.LG 2026-05 unverdicted novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information
cs.CV 2026-05 unverdicted novelty 7.0

Channel importance splits into task relevance and local replaceability; local-axis metrics predict safe removal under pruning better than target-axis metrics across multiple CNNs and datasets.
Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection
cs.LG 2026-05 unverdicted novelty 7.0

MP-IB uses an 8x information asymmetry via FP16 trait heads and INT4 state heads to disentangle speaker identity from agitation in voice biomarkers, outperforming larger models on edge devices with low latency and sup...
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
cs.LG 2026-04 unverdicted novelty 7.0

KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior...
Modeling Higher-Order Brain Interactions via a Multi-View Information Bottleneck Framework for fMRI-based Psychiatric Diagnosis
cs.LG 2026-04 unverdicted novelty 7.0

A tri-view information-bottleneck model that fuses pairwise, triadic and tetradic O-information outperforms eleven baselines on four fMRI psychiatric datasets while revealing region-level synergy-redundancy patterns.
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
Factorization Regret mediates compositional generalization in latent space
cs.LG 2026-03 unverdicted novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
cs.AI 2026-03 conditional novelty 7.0

PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
Semantic Level of Detail for Knowledge Graphs: Discovering Abstraction Boundaries via Spectral Heat Diffusion
cs.LG 2026-03 unverdicted novelty 7.0

SLoD detects emergent scale boundaries in knowledge graphs by applying spectral heat diffusion to Poincare embeddings, recovering planted hierarchies in synthetic data and aligning with taxonomic depths in WordNet wit...
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
cs.CV 2026-03 unverdicted novelty 7.0

MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
cs.CL 2026-02 unverdicted novelty 7.0

Prune-then-Merge combines adaptive pruning of low-signal patches with hierarchical merging to achieve higher compression rates and better performance than prior single-stage methods in visual document retrieval.
Perfect Privacy and Strong Stationary Times for Markovian Sources
cs.IT 2026-01 unverdicted novelty 7.0

For Markov sources, redaction up to strong stationary times achieves perfect privacy with optimal utility using constant average redactions independent of length.
Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity
cs.IT 2026-01 accept novelty 7.0 full

Collision fiber sizes determine precise zero-error compression bounds and rate-distortion laws for semantic identity, establishing symbolic mechanisms as necessary complements to non-injective neural representations.
Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning
cs.LG 2025-12 unverdicted novelty 7.0

BVME uses variational Gaussian message encoding with KL regularization to maintain or improve multi-agent coordination performance while using 67-83% fewer message dimensions than naive compression on SMAC and MPE benchmarks.
A Markov Categorical Framework for Language Modeling
cs.LG 2025-07 unverdicted novelty 7.0

A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in...
Nonasymptotic Oblivious Relaying and Variable-Length Noisy Lossy Source Coding
cs.IT 2025-01 unverdicted novelty 7.0

Establishes nonasymptotic achievability for the information bottleneck channel with fixed- and variable-length relaying and introduces a novel variable-length noisy lossy source coding bound.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Explaining Temporal Graph Neural Networks via Feature-induced Information Flow
cs.LG 2026-06 unverdicted novelty 6.0

A new attribution method for ETGNNs extends NRM with modular decomposition to analyze entire event-induced information flow and outperforms prior explainers on synthetic epidemic/social and real political event data.
Semi-Supervised Vision-Language-Action Model
cs.CV 2026-06 unverdicted novelty 6.0

SemiVLA improves VLA adaptation under 10% labeled trajectories via self-distilled pseudo-actions, reaching 89% success on LIBERO with OpenVLA backbone.
Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory
cs.RO 2026-06 unverdicted novelty 6.0

Tri-Info uses three information theory signals on action diversity, temporal consistency, and state coupling to predict VLA model failures with cross-domain generalization to 83% real-world accuracy.
LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck
cs.CV 2026-06 unverdicted novelty 6.0

LaME performs latent multimodal embedding reasoning with K learnable reason tokens in a weakly supervised information bottleneck, matching some explicit CoT models while running 60x faster.
CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control
cs.RO 2026-06 unverdicted novelty 6.0

CT-VAM is a 68M-parameter cerebello-thalamic-inspired model that achieves competitive LIBERO success rates with lower inference latency than larger VLA models by using a stream-separated attention decoder called TARS.
Beyond Homophily: Towards Generalized Graph Reconstruction Attack and Defense
cs.LG 2026-06 unverdicted novelty 6.0

Proposes MC-GRA attack and MC-GPB defense for graph reconstruction from GNNs via Markov chain approximation of topology-dependent representations, showing improved attack fidelity and reduced leakage with minor accuracy cost.
Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability
cs.LG 2026-06 unverdicted novelty 6.0

HPME proposes hard-perturbation mixup explainer grounded in generalized Graph Information Bottleneck to extract discrete subgraphs and generate in-distribution explanations that outperform soft-mask approaches on synt...
Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers
cs.CV 2026-06 unverdicted novelty 6.0

MaskAQ generates synthetic samples for ViT quantization by identifying and selectively aligning sparse informative regions in self-attention maps, combined with periodic sample refreshing.
SA-DTS: Semantic-Aware Digital Twin Synchronization over 6G Networks
cs.ET 2026-06 unverdicted novelty 6.0

SA-DTS achieves up to 94% bandwidth savings and 87% lower latency in digital twin synchronization by transmitting semantic features and reconstructing states with a partitioned knowledge graph.
The Security Budget of Code-LLM Prompt Hardening: Provable Limits Under Pass-Only Acceptance
cs.CR 2026-06 unverdicted novelty 6.0

Any deterministic prompt filter for code LLMs has a provable mutual-information lower bound of at least 0.84 nats on HumanEval and 1.20 nats on MBPP under pass-only acceptance, with no tested filter achieving zero pro...
A Geometric Lens on Physics-Aligned Data Compression
cs.LG 2026-06 unverdicted novelty 6.0

Develops a local tangent-space rate-distortion theory and eigenspace-overlap diagnostic showing when physics-aligned compression necessarily degrades standard fidelity due to misaligned sensitivity directions.
Posterior Collapse as Automatic Spectral Pruning
cs.LG 2026-05 unverdicted novelty 6.0

Posterior collapse in β-VAEs is derived as automatic spectral pruning via Landau stability analysis, with collapse thresholds matching normalized PCA spectra in the linear Gaussian case and tested on WorldClim data.
Entropy-Based Characterisation of the Polarised Regime in Latent Variable Models
cs.LG 2026-05 unverdicted novelty 6.0

An entropy criterion on mean representations characterises the polarised regime in VAEs and related models, with theoretical links to KL minimisation and empirical tests across several architectures.
Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
The Diffusion Encoder
cs.LG 2026-05 unverdicted novelty 6.0

A diffusion model serves as the encoder in an autoencoder when trained alternately with the decoder to resolve opposing update directions while retaining the standard diffusion training objective.
MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing
cs.LG 2026-05 unverdicted novelty 6.0

MLGIB formulates multi-label graph message passing as constrained information transmission using variational bounds that maximize mutual information with target labels while limiting redundant source information.
MLGIB: Multi-Label Graph Information Bottleneck for Expressive and Robust Message Passing
cs.LG 2026-05 unverdicted novelty 6.0

MLGIB derives variational bounds for multi-label message passing to maximize predictive information while constraining redundant noise from irrelevant labels.
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift
cs.LG 2026-05 unverdicted novelty 6.0

DeconDTN-Toolkit simulates provenance shifts to expose ERM vulnerabilities and provides tools plus a robust OOD indicator for mitigating confounding by data provenance.
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
cs.LG 2026-05 unverdicted novelty 6.0

HEPA combines JEPA self-supervised pretraining with horizon-conditioned fine-tuning to predict rare events in multivariate time series as a monotonic survival distribution, outperforming PatchTST, iTransformer, MAE, a...
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
cs.LG 2026-05 unverdicted novelty 6.0

HEPA combines self-supervised JEPA pretraining on time series representations with horizon-conditioned finetuning to predict rare events via survival CDFs, outperforming PatchTST, iTransformer, MAE, and Chronos-2 on a...

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · cited by 134 Pith papers

[1]

Extracting relevant informati on,

W. Bialek and N. Tishby, “Extracting relevant informati on,” in prepara- tion. 15

work page
[2]

T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley, New York, 1991)

work page 1991
[3]

Information geometry and alternating mini- mization procedures,

I. Csisz´ ar and G. Tusn´ ady, “Information geometry and alternating mini- mization procedures,” Statistics and Decisions Suppl. 1, 205–237 (1984)

work page 1984
[4]

Computation of channel capacity and rate d istortion func- tion,

R. E. Blahut, “Computation of channel capacity and rate d istortion func- tion,” IEEE Trans. Inform. Theory IT-18, 460–473 (1972)

work page 1972
[5]

Agglomerative information bot tleneck,

N. Slonim and N. Tishby, “Agglomerative information bot tleneck,” To appear in Advances in Neural Information Processing systems (NIPS-1 2) 1999

work page 1999
[6]

Distributional clu stering of En- glish words,

F. C. Pereira, N. Tishby, and L. Lee, “Distributional clu stering of En- glish words,” in 30th Annual Mtg. of the Association for Computational Linguistics, pp. 183–190 (1993). 16

work page 1993

[1] [1]

Extracting relevant informati on,

W. Bialek and N. Tishby, “Extracting relevant informati on,” in prepara- tion. 15

work page

[2] [2]

T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley, New York, 1991)

work page 1991

[3] [3]

Information geometry and alternating mini- mization procedures,

I. Csisz´ ar and G. Tusn´ ady, “Information geometry and alternating mini- mization procedures,” Statistics and Decisions Suppl. 1, 205–237 (1984)

work page 1984

[4] [4]

Computation of channel capacity and rate d istortion func- tion,

R. E. Blahut, “Computation of channel capacity and rate d istortion func- tion,” IEEE Trans. Inform. Theory IT-18, 460–473 (1972)

work page 1972

[5] [5]

Agglomerative information bot tleneck,

N. Slonim and N. Tishby, “Agglomerative information bot tleneck,” To appear in Advances in Neural Information Processing systems (NIPS-1 2) 1999

work page 1999

[6] [6]

Distributional clu stering of En- glish words,

F. C. Pereira, N. Tishby, and L. Lee, “Distributional clu stering of En- glish words,” in 30th Annual Mtg. of the Association for Computational Linguistics, pp. 183–190 (1993). 16

work page 1993