SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

Christos Chatzichristos; Konstantinos Kontras; Maarten De Vos; Matthew Blaschko; Paul Pu Liang; Teodora Gagaleska; Thomas Strypsteen

arxiv: 2606.09853 · v1 · pith:IVQIFILAnew · submitted 2026-05-12 · 💻 cs.LG · cs.IT· math.IT

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

Konstantinos Kontras , Teodora Gagaleska , Thomas Strypsteen , Christos Chatzichristos , Matthew Blaschko , Maarten De Vos , Paul Pu Liang This is my paper

Pith reviewed 2026-06-30 21:53 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords multimodal learninginformation bottlenecksynergycross-modal reasoningtraining objectivemodality maskingaffective computinghateful memes

0 comments

The pith

SynIB is a training objective that maximizes synergistic information by penalizing confident predictions from any single modality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to capture synergy, the task-relevant information that appears only when multiple modalities are used together and cannot be recovered from any one alone. Instead of changing model architecture, it modifies the training objective to prioritize this joint information. SynIB adds a penalty term that runs forward passes with modalities masked one at a time and discourages the model from staying confident, which would signal reliance on unimodal cues. On synthetic XOR tasks where synergy is known by construction, the method recovers the joint signal while standard training does not. The same objective yields measurable gains on real multimodal benchmarks that contain synergy-dependent examples.

Core claim

The Synergistic Information Bottleneck (SynIB) formalizes multimodal synergy in information-theoretic terms and augments the standard task loss with a term that penalizes remaining predictive confidence after any one modality is masked; this forces the model to extract information available only from the combination of modalities, recovering ground-truth synergy on constructed XOR tasks and raising accuracy on synergy-dependent examples in five real-world benchmarks.

What carries the argument

The Synergistic Information Bottleneck (SynIB) objective, which adds a penalty for confident predictions after masking individual modalities to isolate synergistic information.

If this is right

Standard training leaves synergy-dependent examples under-served; SynIB closes that gap without altering the fusion architecture.
On tasks with known ground-truth synergy such as the XOR constructions, SynIB recovers the joint signal that unimodal or redundant paths cannot provide.
Accuracy on synergy-dependent subsets rises by up to 7.8 percent and overall accuracy by up to 3.8 percent across the tested benchmarks.
The objective remains compatible with existing backbones and can be added to any multimodal pipeline that supports modality masking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking-penalty logic could be tested in non-modal multi-source settings such as multi-sensor time series or multi-view geometry.
SynIB might be combined with architectural fusion improvements to produce additive rather than overlapping gains.
By reducing dependence on single-modality shortcuts, the method could improve robustness when one modality is noisy or missing at test time.

Load-bearing premise

Penalizing remaining confidence after masking one modality specifically isolates and maximizes synergistic information rather than producing unrelated regularization or optimization changes.

What would settle it

If applying the SynIB penalty on the synthetic XOR tasks fails to recover the known synergistic label while standard training also fails, or if the real-world accuracy gains disappear when the penalty term is replaced by an equivalent amount of random noise in the loss.

Figures

Figures reproduced from arXiv: 2606.09853 by Christos Chatzichristos, Konstantinos Kontras, Maarten De Vos, Matthew Blaschko, Paul Pu Liang, Teodora Gagaleska, Thomas Strypsteen.

**Figure 1.** Figure 1: Gradient geometry across PID sources under vanilla fusion. A model is trained on examples drawn from the three PID sources, U1, R, and S, with U2 = 0 by construction (details in Sec. 4.2). Left: Per-group learning signal strength is substantial for all sources, meaning that the examples of that source create gradient capable of changing the parameters, with synergistic examples producing the largest λg. Ce… view at source ↗

**Figure 2.** Figure 2: SynIB overview. Standard multimodal fusion (black) trains a model to predict Y from (Z1, Z2), leaving optimization free to settle on unimodal or redundant cues. SynIB (blue) adds counterfactual passes in which one modality is replaced with a feature-masked version Z˜ i that removes its task-relevant content, and penalizes the model when its predictions remain confident under this corruption. Confidence und… view at source ↗

**Figure 3.** Figure 3: Two strategies for constructing the counterfactual mask [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: F1 scores on the CREMA-D irony recognition task under varying irony rates [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison across real-world multimodal benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Performance across PID-controlled data regimes on the synthetic XOR task. Each triangle is a probability simplex over information types (pU1, pRed, pSyn); top row shows total accuracy, bottom row accuracy on synergistic examples only. Vanilla fusion (a) degrades sharply as synergy dominates. SynIB with oracle masking (b) resolves the task uniformly. Random masking (c) provides inconsistent gains, while lea… view at source ↗

**Figure 6.** Figure 6: Test accuracy vs. spurious correlation strength β on bimodal XOR. Vanilla fusion collapses to chance as the shortcut strengthens. SynIB with oracle masking stays at ∼100%, while learned and random masks degrade gracefully, with the learned mask consistently closer to the oracle. Robustness to spurious shortcuts. One modality contains a spurious feature linearly correlated with the label at training streng… view at source ↗

**Figure 8.** Figure 8: PID-XOR training dynamics across all four methods. Per-source training (solid) and validation (dashed) accuracy across 30 epochs, mean ± standard error over three seeds. Sources are colored by type: unique-to-modality-1 (U1, blue), redundant (R, green), and synergistic (S, red). Final synergy test accuracies (annotated per panel): 0.50 vanilla, 0.91 oracle, 0.87 random, 0.89 learned. PID mixture (pU1 , pU2… view at source ↗

**Figure 9.** Figure 9: Gradient geometry across PID sources under SynIB. Training augments the vanilla setup with the SynIB learned-mask inner loop (λ=10); data, architecture, and optimiser are otherwise identical to [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Validation accuracy over training on both synthetic benchmarks. Mean ± standard error across three seeds. Left: Spurious XOR (β = 1.0). All four methods reach ≈ 1.0 on the indistribution validation set, fitting the training distribution equally well. The gray dashed curve shows training accuracy for vanilla fusion, which saturates within the first epoch as the model locks onto the shortcut. The annotated… view at source ↗

**Figure 11.** Figure 11: Masking-based baselines vs SynIB across PID compositions. Columns: (a) no regulariza [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Construction of the synthetic irony class. Each ironic sample is built by pairing a video [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches operate at the architectural level through larger or more complex fusion models, we propose a complementary axis: shaping the training objective itself. Standard training often emphasizes unimodal or redundant information, falling short on examples that require cross-modal reasoning. We formalize multimodal synergy through information theory and introduce the Synergistic Information Bottleneck (SynIB), a scalable objective that targets synergy directly. To prioritize learning synergy, SynIB motivates the model to predict accurately from all modalities while penalizing confidence when information from any modality is withheld. Alongside the standard task loss, the model runs forward passes with one modality masked at a time and is penalized for remaining confident, which would indicate reliance on unimodal cues rather than cross-modal interactions. We validate SynIB in two regimes. On synthetic XOR tasks where the ground-truth synergy is known by construction, standard training fails to recover it while SynIB does. On five real-world benchmarks, including three MultiBench affective tasks, Hateful Memes with CLIP-ViT and DeBERTa backbones, and a controllable irony extension of CREMA-D we introduce, SynIB improves accuracy on synergy-dependent examples by up to 7.8% and overall accuracy by up to 3.8%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynIB adds a masked-modality confidence penalty to the task loss to push models toward cross-modal synergy, with clear recovery on synthetic XOR but weaker isolation on real benchmarks.

read the letter

The core contribution is a training objective that runs extra forward passes with one modality masked and penalizes remaining confidence, on top of the standard loss. This is a direct attempt to shape learning toward joint information rather than unimodal or redundant cues.

It works on the synthetic XOR case, where ground-truth synergy is known and standard training fails while SynIB recovers it. That part is clean and falsifiable. The real-data results on MultiBench tasks, Hateful Memes, and the CREMA-D irony set show accuracy lifts on the selected synergy-dependent examples (up to 7.8%) and smaller overall gains (up to 3.8%). Reporting both numbers is useful.

The soft spot is the gap between the penalty and actual synergistic mutual information. Penalizing post-mask confidence discourages single-modality shortcuts, but it can also function as modality dropout or general regularization; nothing in the reported controls separates those effects. On real benchmarks the synergy-dependent examples are identified post-hoc without ground truth, so the accuracy deltas could reflect changed optimization dynamics rather than maximized synergy. No error bars or ablation on the penalty strength appear in the abstract.

This is for multimodal researchers already using information-theoretic losses or working on fusion objectives. The synthetic result and the objective formulation are solid enough to justify sending it out for review, though the authors will need to address whether the penalty truly isolates synergy versus other regularization.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Synergistic Information Bottleneck (SynIB) objective, which augments the standard task loss with a penalty on model confidence in forward passes where one modality is masked. This is intended to discourage unimodal reliance and prioritize synergistic cross-modal information. Validation includes recovery of known synergy on synthetic XOR tasks and reported accuracy gains (up to 7.8% on synergy-dependent examples, 3.8% overall) on five real benchmarks including MultiBench tasks, Hateful Memes, and a CREMA-D irony extension.

Significance. If the objective can be shown to specifically maximize synergistic mutual information rather than generic regularization effects, the approach would provide a scalable, architecture-agnostic method for improving multimodal performance on tasks requiring joint reasoning. The synthetic XOR validation, where ground-truth synergy is known by construction, is a clear strength that grounds the method.

major comments (2)

[Abstract] Abstract (paragraph describing the objective): the penalty on remaining confidence after masking one modality is asserted to isolate and maximize synergy, but no derivation is supplied showing that this term equals or bounds a formal synergy measure such as I(X;Y;Z) minus marginal terms. Without this, the 7.8% gains on real benchmarks cannot be attributed specifically to synergy maximization versus altered optimization or implicit dropout.
[Real-world benchmarks] Real-world benchmarks (five tasks section): synergy-dependent examples lack ground-truth labels, so measured improvements rest on the untested assumption that the penalty targets the synergistic component; ablations or controls that vary only the penalty while holding other factors fixed are required to rule out complementary-cue or regularization explanations.

minor comments (2)

Error bars, multiple random seeds, and statistical significance tests for the reported accuracy deltas are absent and should be added.
Full methods, hyperparameter ranges, and data exclusion rules for the real benchmarks are referenced only at high level and should be expanded for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below with clarifications and commit to specific revisions that strengthen the attribution of results to synergy maximization.

read point-by-point responses

Referee: [Abstract] the penalty on remaining confidence after masking one modality is asserted to isolate and maximize synergy, but no derivation is supplied showing that this term equals or bounds a formal synergy measure such as I(X;Y;Z) minus marginal terms. Without this, the 7.8% gains on real benchmarks cannot be attributed specifically to synergy maximization versus altered optimization or implicit dropout.

Authors: We acknowledge that the abstract presents the objective at a high level. The main text motivates the penalty via information theory as discouraging unimodal mutual information to favor joint representations, but an explicit derivation bounding the term against interaction information I(X;Y;Z) or similar is not supplied. In the revision we will add a dedicated subsection deriving that the expected penalty provides a variational upper bound on the reduction of single-modality mutual information while preserving task-relevant joint information, thereby targeting synergy. This will support clearer attribution of the reported gains. revision: yes
Referee: [Real-world benchmarks] synergy-dependent examples lack ground-truth labels, so measured improvements rest on the untested assumption that the penalty targets the synergistic component; ablations or controls that vary only the penalty while holding other factors fixed are required to rule out complementary-cue or regularization explanations.

Authors: We agree that real-world synergy-dependent examples are identified indirectly via unimodal vs. multimodal performance gaps rather than ground-truth labels, and that this leaves room for alternative explanations. The synthetic XOR experiments remain the primary controlled validation. For the real benchmarks we will add ablations in the revision that hold all other factors fixed while varying only the SynIB penalty (including comparisons to equivalent dropout or generic confidence penalties) and report results on the same example subsets to isolate the contribution of the modality-masking structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe SynIB as a new training objective (standard loss plus explicit penalty on post-masking confidence) motivated by an information-theoretic view of synergy. No equations, self-definitional loops, or fitted parameters renamed as predictions appear in the given text. The synthetic XOR validation uses externally known ground-truth synergy (not derived from the method itself), and real-benchmark gains are measured outcomes rather than forced by construction. No load-bearing self-citations or uniqueness theorems imported from prior author work are referenced. The central claim therefore remains an independent proposal whose correctness can be evaluated against external benchmarks without reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond reliance on standard information-theoretic definitions of synergy.

axioms (1)

domain assumption Standard definitions of synergistic information from information theory
The formalization of multimodal synergy is invoked without derivation in the abstract.

pith-pipeline@v0.9.1-grok · 5816 in / 1134 out tokens · 22928 ms · 2026-06-30T21:53:07.149645+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

139 extracted references · 30 canonical work pages · 17 internal anchors

[1]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ACM Computing Surveys , volume=

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions , author=. ACM Computing Surveys , volume=. 2024 , publisher=

2024
[3]

Advances in neural information processing systems , volume=

The im algorithm: a variational approach to information maximization , author=. Advances in neural information processing systems , volume=
[4]

Deep Learning , author =
[5]

Advances in Neural Information Processing Systems , volume =

A Simple Weight Decay Can Improve Generalization , author =. Advances in Neural Information Processing Systems , volume =
[6]

Journal of Machine Learning Research , volume =

Dropout: A Simple Way to Prevent Neural Networks from Overfitting , author =. Journal of Machine Learning Research , volume =
[7]

Advances in Neural Information Processing Systems , year =

Does Multimodal Learning Require Fusion? , author =. Advances in Neural Information Processing Systems , year =
[8]

AAAI Conference on Artificial Intelligence , year =

FiLM: Visual Reasoning with a General Conditioning Layer , author =. AAAI Conference on Artificial Intelligence , year =
[9]

IEEE Transactions on Information Theory , volume=

On the maximum entropy of the sum of two dependent random variables , author=. IEEE Transactions on Information Theory , volume=. 1994 , publisher=

1994
[10]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[11]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[12]

Machine learning proceedings 1992 , pages=

A practical approach to feature selection , author=. Machine learning proceedings 1992 , pages=. 1992 , publisher=

1992
[13]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning not to learn: Training deep neural networks with biased data , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[14]

2010 , publisher=

MNIST handwritten digit database , author=. 2010 , publisher=

2010
[15]

IEEE transactions on affective computing , volume=

Crema-d: Crowd-sourced emotional multimodal actors dataset , author=. IEEE transactions on affective computing , volume=. 2014 , publisher=

2014
[17]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

XSleepNet: Multi-view sequential model for automatic sleep staging , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

2021
[18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

What makes training multi-modal classification networks hard? , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Balanced multimodal learning via on-the-fly gradient modulation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

PMR: Prototypical Modal Rebalance for Multimodal Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[21]

Ioord2018representationnternational Conference on Machine Learning , pages=

Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks , author=. Ioord2018representationnternational Conference on Machine Learning , pages=. 2022 , organization=

2022
[22]

arXiv preprint arXiv:2208.10442 , year=

Image as a foreign language: Beit pretraining for all vision and vision-language tasks , author=. arXiv preprint arXiv:2208.10442 , year=

work page arXiv
[23]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[24]

arXiv preprint arXiv:2206.09852 , year=

M&m mix: A multimodal multiview transformer ensemble , author=. arXiv preprint arXiv:2206.09852 , year=

work page arXiv
[25]

, author=

CoRe-Sleep: A Multimodal Fusion Framework for Time Series Robust to Imperfect Modalities. , author=. IEEE Transactions on Neural Systems and Rehabilitation Engineering , year=
[26]

International Conference on Machine Learning , pages=

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably) , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[27]

arXiv preprint arXiv:2305.01233 , year=

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning , author=. arXiv preprint arXiv:2305.01233 , year=

work page arXiv
[28]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

2023
[29]

International Conference on Machine Learning , pages=

Penalizing gradient norm for efficiently improving generalization in deep learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[30]

Mathematical programming , volume=

On the limited memory BFGS method for large scale optimization , author=. Mathematical programming , volume=. 1989 , publisher=

1989
[31]

International Conference on Machine Learning , pages=

Sharpened quasi-newton methods: Faster superlinear rate and larger local convergence neighborhood , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022
[32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Boosting Multi-modal Model Performance with Adaptive Gradient Modulation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[33]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Slowfast networks for video recognition , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[34]

Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26--29, 2019, Revised Selected Papers, Part II 5 , pages=

Modality-specific learning rate control for multimodal classification , author=. Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26--29, 2019, Revised Selected Papers, Part II 5 , pages=. 2020 , organization=

2019
[35]

Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion

Yao, Yiqun and Mihalcea, Rada. Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion. Findings of the Association for Computational Linguistics: ACL 2022. 2022

2022
[36]

Advances in Neural Information Processing Systems , volume=

Modulating early visual processing by language , author=. Advances in Neural Information Processing Systems , volume=
[37]

Proceedings of the European Conference on Computer Vision (ECCV) Workshops , pages=

Centralnet: a multilayer approach for multimodal fusion , author=. Proceedings of the European Conference on Computer Vision (ECCV) Workshops , pages=
[38]

Proceedings of the European conference on computer vision (ECCV) , pages=

Audio-visual event localization in unconstrained videos , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
[39]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

UCF101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Proceedings of the AAAI conference on artificial intelligence , volume=

Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[41]

Proceedings of the AAAI conference on artificial intelligence , volume=

Efficient large-scale multi-modal classification , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[42]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

2023 , eprint=

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models , author=. 2023 , eprint=

2023
[44]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[45]

IEEE transactions on pattern analysis and machine intelligence , volume=

Neural network ensembles , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 1990 , publisher=

1990
[46]

International Conference on Learning Representations (Workshop) , year=

Understanding intermediate layers using linear classifier probes , author=. International Conference on Learning Representations (Workshop) , year=
[47]

Advances in neural information processing systems , volume=

wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=
[48]

arXiv preprint arXiv:2005.08100 , year=

Conformer: Convolution-augmented transformer for speech recognition , author=. arXiv preprint arXiv:2005.08100 , year=

work page arXiv 2005
[49]

arXiv preprint arXiv:2305.07216 , year=

Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks , author=. arXiv preprint arXiv:2305.07216 , year=

work page arXiv
[50]

IEEE signal processing letters , volume=

Joint face detection and alignment using multitask cascaded convolutional networks , author=. IEEE signal processing letters , volume=. 2016 , publisher=

2016
[51]

International conference on machine learning , pages=

Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[52]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Vivit: A video vision transformer , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[53]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[54]

Advances in Neural Information Processing Systems , volume=

Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies , author=. Advances in Neural Information Processing Systems , volume=
[55]

A value for n-person games , author=. , year=
[56]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=
[57]

SGDR: Stochastic Gradient Descent with Warm Restarts

Sgdr: Stochastic gradient descent with warm restarts , author=. arXiv preprint arXiv:1608.03983 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Presentation at Google, Mountain View, 2nd April , volume=

Statistical language models based on neural networks , author=. Presentation at Google, Mountain View, 2nd April , volume=
[59]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[60]

1999 , publisher=

Elements of information theory , author=. 1999 , publisher=

1999
[61]

Advances in neural information processing systems , volume=

Supervised contrastive learning , author=. Advances in neural information processing systems , volume=
[62]

International Conference on Machine Learning , pages=

Dissecting supervised contrastive learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[63]

Representation Learning with Contrastive Predictive Coding

Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

International Conference on Machine Learning , pages=

Sorting out Lipschitz function approximation , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[65]

Intriguing properties of neural networks

Intriguing properties of neural networks , author=. arXiv preprint arXiv:1312.6199 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

Advances in Neural Information Processing Systems , volume=

Perceptual score: What data modalities does your model perceive? , author=. Advances in Neural Information Processing Systems , volume=
[67]

Proceedings of the 25th ACM international conference on Multimedia , pages=

Adversarial cross-modal retrieval , author=. Proceedings of the 25th ACM international conference on Multimedia , pages=
[68]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[69]

Proceedings of the conference

Multimodal transformer for unaligned multimodal language sequences , author=. Proceedings of the conference. Association for computational linguistics. Meeting , volume=. 2019 , organization=

2019
[70]

Tensor Fusion Network for Multimodal Sentiment Analysis

Tensor fusion network for multimodal sentiment analysis , author=. arXiv preprint arXiv:1707.07250 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Advances in neural information processing systems , volume=

Attention bottlenecks for multimodal fusion , author=. Advances in neural information processing systems , volume=
[72]

Demystifying CLIP Data

Demystifying clip data , author=. arXiv preprint arXiv:2309.16671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

arXiv preprint arXiv:2402.16318 , year=

Gradient-Guided Modality Decoupling for Missing-Modality Robustness , author=. arXiv preprint arXiv:2402.16318 , year=

work page arXiv
[74]

arXiv preprint arXiv:2405.07930 , year=

Improving Multimodal Learning with Multi-Loss Gradient Modulation , author=. arXiv preprint arXiv:2405.07930 , year=

work page arXiv
[75]

Journal of Machine Learning Research , volume=

All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously , author=. Journal of Machine Learning Research , volume=
[76]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=

Fooling lime and shap: Adversarial attacks on post hoc explanation methods , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=
[77]

2018 , publisher=

Density estimation for statistics and data analysis , author=. 2018 , publisher=

2018
[78]

2015 , publisher=

Multivariate density estimation: theory, practice, and visualization , author=. 2015 , publisher=

2015
[79]

something something

The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=
[80]

The information bottleneck method

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

2015

Showing first 80 references.

[1] [1]

Auto-Encoding Variational Bayes

Auto-encoding variational bayes , author=. arXiv preprint arXiv:1312.6114 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ACM Computing Surveys , volume=

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions , author=. ACM Computing Surveys , volume=. 2024 , publisher=

2024

[3] [3]

Advances in neural information processing systems , volume=

The im algorithm: a variational approach to information maximization , author=. Advances in neural information processing systems , volume=

[4] [4]

Deep Learning , author =

[5] [5]

Advances in Neural Information Processing Systems , volume =

A Simple Weight Decay Can Improve Generalization , author =. Advances in Neural Information Processing Systems , volume =

[6] [6]

Journal of Machine Learning Research , volume =

Dropout: A Simple Way to Prevent Neural Networks from Overfitting , author =. Journal of Machine Learning Research , volume =

[7] [7]

Advances in Neural Information Processing Systems , year =

Does Multimodal Learning Require Fusion? , author =. Advances in Neural Information Processing Systems , year =

[8] [8]

AAAI Conference on Artificial Intelligence , year =

FiLM: Visual Reasoning with a General Conditioning Layer , author =. AAAI Conference on Artificial Intelligence , year =

[9] [9]

IEEE Transactions on Information Theory , volume=

On the maximum entropy of the sum of two dependent random variables , author=. IEEE Transactions on Information Theory , volume=. 1994 , publisher=

1994

[10] [10]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[11] [11]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[12] [12]

Machine learning proceedings 1992 , pages=

A practical approach to feature selection , author=. Machine learning proceedings 1992 , pages=. 1992 , publisher=

1992

[13] [13]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Learning not to learn: Training deep neural networks with biased data , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[14] [14]

2010 , publisher=

MNIST handwritten digit database , author=. 2010 , publisher=

2010

[15] [15]

IEEE transactions on affective computing , volume=

Crema-d: Crowd-sourced emotional multimodal actors dataset , author=. IEEE transactions on affective computing , volume=. 2014 , publisher=

2014

[16] [17]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

XSleepNet: Multi-view sequential model for automatic sleep staging , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2021 , publisher=

2021

[17] [18]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

What makes training multi-modal classification networks hard? , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[18] [19]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Balanced multimodal learning via on-the-fly gradient modulation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[19] [20]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

PMR: Prototypical Modal Rebalance for Multimodal Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[20] [21]

Ioord2018representationnternational Conference on Machine Learning , pages=

Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks , author=. Ioord2018representationnternational Conference on Machine Learning , pages=. 2022 , organization=

2022

[21] [22]

arXiv preprint arXiv:2208.10442 , year=

Image as a foreign language: Beit pretraining for all vision and vision-language tasks , author=. arXiv preprint arXiv:2208.10442 , year=

work page arXiv

[22] [23]

International Conference on Machine Learning , pages=

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[23] [24]

arXiv preprint arXiv:2206.09852 , year=

M&m mix: A multimodal multiview transformer ensemble , author=. arXiv preprint arXiv:2206.09852 , year=

work page arXiv

[24] [25]

, author=

CoRe-Sleep: A Multimodal Fusion Framework for Time Series Robust to Imperfect Modalities. , author=. IEEE Transactions on Neural Systems and Rehabilitation Engineering , year=

[25] [26]

International Conference on Machine Learning , pages=

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably) , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[26] [27]

arXiv preprint arXiv:2305.01233 , year=

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning , author=. arXiv preprint arXiv:2305.01233 , year=

work page arXiv

[27] [28]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

2023

[28] [29]

International Conference on Machine Learning , pages=

Penalizing gradient norm for efficiently improving generalization in deep learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[29] [30]

Mathematical programming , volume=

On the limited memory BFGS method for large scale optimization , author=. Mathematical programming , volume=. 1989 , publisher=

1989

[30] [31]

International Conference on Machine Learning , pages=

Sharpened quasi-newton methods: Faster superlinear rate and larger local convergence neighborhood , author=. International Conference on Machine Learning , pages=. 2022 , organization=

2022

[31] [32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Boosting Multi-modal Model Performance with Adaptive Gradient Modulation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[32] [33]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Slowfast networks for video recognition , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[33] [34]

Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26--29, 2019, Revised Selected Papers, Part II 5 , pages=

Modality-specific learning rate control for multimodal classification , author=. Pattern Recognition: 5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26--29, 2019, Revised Selected Papers, Part II 5 , pages=. 2020 , organization=

2019

[34] [35]

Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion

Yao, Yiqun and Mihalcea, Rada. Modality-specific Learning Rates for Effective Multimodal Additive Late-fusion. Findings of the Association for Computational Linguistics: ACL 2022. 2022

2022

[35] [36]

Advances in Neural Information Processing Systems , volume=

Modulating early visual processing by language , author=. Advances in Neural Information Processing Systems , volume=

[36] [37]

Proceedings of the European Conference on Computer Vision (ECCV) Workshops , pages=

Centralnet: a multilayer approach for multimodal fusion , author=. Proceedings of the European Conference on Computer Vision (ECCV) Workshops , pages=

[37] [38]

Proceedings of the European conference on computer vision (ECCV) , pages=

Audio-visual event localization in unconstrained videos , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

[38] [39]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

UCF101: A dataset of 101 human actions classes from videos in the wild , author=. arXiv preprint arXiv:1212.0402 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

Proceedings of the AAAI conference on artificial intelligence , volume=

Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[40] [41]

Proceedings of the AAAI conference on artificial intelligence , volume=

Efficient large-scale multi-modal classification , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[41] [42]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [43]

2023 , eprint=

Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models , author=. 2023 , eprint=

2023

[43] [44]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[44] [45]

IEEE transactions on pattern analysis and machine intelligence , volume=

Neural network ensembles , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 1990 , publisher=

1990

[45] [46]

International Conference on Learning Representations (Workshop) , year=

Understanding intermediate layers using linear classifier probes , author=. International Conference on Learning Representations (Workshop) , year=

[46] [47]

Advances in neural information processing systems , volume=

wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=

[47] [48]

arXiv preprint arXiv:2005.08100 , year=

Conformer: Convolution-augmented transformer for speech recognition , author=. arXiv preprint arXiv:2005.08100 , year=

work page arXiv 2005

[48] [49]

arXiv preprint arXiv:2305.07216 , year=

Versatile Audio-Visual Learning for Handling Single and Multi Modalities in Emotion Regression and Classification Tasks , author=. arXiv preprint arXiv:2305.07216 , year=

work page arXiv

[49] [50]

IEEE signal processing letters , volume=

Joint face detection and alignment using multitask cascaded convolutional networks , author=. IEEE signal processing letters , volume=. 2016 , publisher=

2016

[50] [51]

International conference on machine learning , pages=

Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[51] [52]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Vivit: A video vision transformer , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[52] [53]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[53] [54]

Advances in Neural Information Processing Systems , volume=

Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies , author=. Advances in Neural Information Processing Systems , volume=

[54] [55]

A value for n-person games , author=. , year=

[55] [56]

Advances in neural information processing systems , volume=

A unified approach to interpreting model predictions , author=. Advances in neural information processing systems , volume=

[56] [57]

SGDR: Stochastic Gradient Descent with Warm Restarts

Sgdr: Stochastic gradient descent with warm restarts , author=. arXiv preprint arXiv:1608.03983 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [58]

Presentation at Google, Mountain View, 2nd April , volume=

Statistical language models based on neural networks , author=. Presentation at Google, Mountain View, 2nd April , volume=

[58] [59]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022

[59] [60]

1999 , publisher=

Elements of information theory , author=. 1999 , publisher=

1999

[60] [61]

Advances in neural information processing systems , volume=

Supervised contrastive learning , author=. Advances in neural information processing systems , volume=

[61] [62]

International Conference on Machine Learning , pages=

Dissecting supervised contrastive learning , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[62] [63]

Representation Learning with Contrastive Predictive Coding

Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[63] [64]

International Conference on Machine Learning , pages=

Sorting out Lipschitz function approximation , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[64] [65]

Intriguing properties of neural networks

Intriguing properties of neural networks , author=. arXiv preprint arXiv:1312.6199 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [66]

Advances in Neural Information Processing Systems , volume=

Perceptual score: What data modalities does your model perceive? , author=. Advances in Neural Information Processing Systems , volume=

[66] [67]

Proceedings of the 25th ACM international conference on Multimedia , pages=

Adversarial cross-modal retrieval , author=. Proceedings of the 25th ACM international conference on Multimedia , pages=

[67] [68]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[68] [69]

Proceedings of the conference

Multimodal transformer for unaligned multimodal language sequences , author=. Proceedings of the conference. Association for computational linguistics. Meeting , volume=. 2019 , organization=

2019

[69] [70]

Tensor Fusion Network for Multimodal Sentiment Analysis

Tensor fusion network for multimodal sentiment analysis , author=. arXiv preprint arXiv:1707.07250 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[70] [71]

Advances in neural information processing systems , volume=

Attention bottlenecks for multimodal fusion , author=. Advances in neural information processing systems , volume=

[71] [72]

Demystifying CLIP Data

Demystifying clip data , author=. arXiv preprint arXiv:2309.16671 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[72] [73]

arXiv preprint arXiv:2402.16318 , year=

Gradient-Guided Modality Decoupling for Missing-Modality Robustness , author=. arXiv preprint arXiv:2402.16318 , year=

work page arXiv

[73] [74]

arXiv preprint arXiv:2405.07930 , year=

Improving Multimodal Learning with Multi-Loss Gradient Modulation , author=. arXiv preprint arXiv:2405.07930 , year=

work page arXiv

[74] [75]

Journal of Machine Learning Research , volume=

All models are wrong, but many are useful: Learning a variable's importance by studying an entire class of prediction models simultaneously , author=. Journal of Machine Learning Research , volume=

[75] [76]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=

Fooling lime and shap: Adversarial attacks on post hoc explanation methods , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , pages=

[76] [77]

2018 , publisher=

Density estimation for statistics and data analysis , author=. 2018 , publisher=

2018

[77] [78]

2015 , publisher=

Multivariate density estimation: theory, practice, and visualization , author=. 2015 , publisher=

2015

[78] [79]

something something

The" something something" video database for learning and evaluating visual common sense , author=. Proceedings of the IEEE international conference on computer vision , pages=

[79] [80]

The information bottleneck method

The information bottleneck method , author=. arXiv preprint physics/0004057 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [81]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

2015