Quantifying Multimodal Capabilities: Formal Generalization Guarantees in Pairwise Metric Learning

Liyuan Liu; Richeng Zhou; Xuelin Zhang

arxiv: 2605.01424 · v1 · submitted 2026-05-02 · 💻 cs.LG · cs.AI

Quantifying Multimodal Capabilities: Formal Generalization Guarantees in Pairwise Metric Learning

Richeng Zhou , Xuelin Zhang , Liyuan Liu This is my paper

Pith reviewed 2026-05-09 15:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multimodal learningmetric learninggeneralization boundsmodality complementaritypairwise learninghypothesis space

0 comments

The pith

Fine-grained features from multiple modalities reduce the hypothesis space complexity in pairwise metric learning by strengthening complementarity between them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that in multimodal metric learning, using more detailed features drawn from each data type shrinks the set of possible models the algorithm must consider, because the modalities begin to fill in each other's gaps more effectively. This matters for real-world settings where some modalities may be missing or overlapping, since tighter generalization bounds can translate into more reliable performance without needing extra data. By building hierarchies among the function classes tied to different modality combinations and measuring how far the learned mappings stray from the true ones, the analysis produces both upper and lower error bounds that track the combined effects of how many modalities are present and how finely they are described.

Core claim

We establish hierarchical relationships between function classes corresponding to different modality subsets and quantify the discrepancy between learned mappings and ground truth. Through rigorous analysis of pairwise complexity within the multimodal learning framework, we derive novel generalization error bounds that reveal the joint impact of modality quantity and granularity on model performance. Our theoretical findings on both upper and lower bounds demonstrate that incorporating fine-grained modality features reduces the complexity of the hypothesis space by enhancing modality complementarity.

What carries the argument

Hierarchical relationships between function classes for different modality subsets, which quantify how adding finer modality details lowers the discrepancy to ground truth and thereby tightens the overall hypothesis space.

If this is right

Both the number of modalities and the level of detail within each modality jointly control generalization error through their effect on hypothesis complexity.
Tighter bounds on error follow directly when modality complementarity increases.
Convergence rates and final accuracy in multimodal systems improve when fine-grained features are used instead of coarse ones.
The bounds supply a concrete way to compare different modality-selection strategies before training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of multimodal pipelines could prioritize extracting finer details from existing modalities rather than simply collecting additional modalities.
The same hierarchical-function-class approach might be testable in non-metric settings such as multimodal classification or generation tasks.
If the bounds hold under mild dependence between modalities, practitioners could use them to decide when to drop a redundant modality without harming performance.

Load-bearing premise

The analysis depends on being able to order the function classes for different modality subsets into a clear hierarchy without specifying exact statistical assumptions on how the modalities relate to one another.

What would settle it

Train the same multimodal metric learner on a fixed dataset once with coarse modality summaries and once with fine-grained features, then check whether the measured generalization gap on held-out pairs is smaller in the fine-grained case by the amount the derived bounds predict.

read the original abstract

Multimodal learning leverages the integration of diverse data modalities to enhance performance in complex tasks. Yet, it frequently encounters incomplete or redundant modality data in real-world scenarios. This paper presents a fine-grained theoretical analysis of the generalization properties of multimodal metric learning models, addressing critical gaps in understanding the relationship between modality selection and algorithmic performance. We establish hierarchical relationships between function classes corresponding to different modality subsets and quantify the discrepancy between learned mappings and ground truth. Through rigorous analysis of pairwise complexity within the multimodal learning framework, we derive novel generalization error bounds that reveal the joint impact of modality quantity and granularity on model performance. Our theoretical findings on both upper and lower bounds demonstrate that incorporating fine-grained modality features reduces the complexity of the hypothesis space by enhancing modality complementarity. This work offers both theoretical foundations and practical implications for improving convergence rates and accuracy in multimodal learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

The paper sets up hierarchical function classes for modality subsets and derives upper and lower generalization bounds for pairwise multimodal metric learning that tighten with finer granularity, but the step from inclusion to smaller complexity depends on unstated conditions on the joint distribution. That is the main thing to know up front. It does lay out a clean framework: define F_S for each modality subset S, note the inclusion when S2 is finer than S1, then bound the Rademacher complexity of the pairwise loss and the discrepancy between the learned map and ground truth. The attempt to make the benefit of granularity explicit in terms of hypothesis-space size is useful for a field that mostly runs experiments. The bounds are presented as novel in the multimodal metric setting, and the lower bound direction is a nice touch that many theoretical papers skip. The soft spot is exactly where the stress-test note flags it. The inclusion F_S1 ⊃ F_S2 does not automatically shrink the complexity term unless the joint law of the modalities makes the discrepancy contract in a controlled way. If the modalities are arbitrarily dependent or the ground-truth metric is not Lipschitz in the finer features, the claimed reduction may not follow. The abstract and high-level description give no explicit measurability, independence, or moment conditions, so the full proofs need to supply them or the result stays conditional. The paper is aimed at theorists working on generalization for metric learning and multimodal models. A reader who wants formal statements about modality selection will find the setup worth reading, even if the bounds turn out to be incremental rather than transformative. I would send it to peer review. The formal structure is there, the claims are falsifiable, and referees can check whether the distributional assumptions are stated and sufficient.

Referee Report

2 major / 1 minor

Summary. The manuscript develops a theoretical analysis of generalization in pairwise multimodal metric learning. It defines function classes F_S over modality subsets S, establishes hierarchical inclusions between these classes for coarser vs. finer modality sets, quantifies the discrepancy between learned mappings and ground truth, and derives upper and lower generalization bounds on the pairwise loss that are claimed to improve when fine-grained modalities are incorporated because of reduced hypothesis-space complexity and enhanced modality complementarity.

Significance. If the central derivations hold under appropriate conditions, the work would supply explicit generalization guarantees linking modality granularity to hypothesis complexity in metric learning, offering a formal basis for modality-selection decisions and potential improvements in convergence rates.

major comments (2)

[§3.2] §3.2 (hierarchical function-class inclusions): The argument that F_{S_fine} ⊂ F_{S_coarse} yields strictly smaller Rademacher complexity or covering numbers for the pairwise metric-learning loss requires the discrepancy term (learned mapping vs. ground truth) to contract in a controlled way; no measurability, independence, or moment conditions on the joint law of the modalities are stated, so the inclusion does not automatically imply the claimed complexity reduction.
[§4.1, Theorem 2] §4.1, Theorem 2 (upper bound): The upper generalization bound is asserted to decrease with modality granularity via 'enhanced complementarity,' yet the proof sketch relies on the finer class producing a smaller discrepancy without an explicit Lipschitz or boundedness assumption on the ground-truth metric with respect to the added fine-grained features; this step is load-bearing for the main claim.

minor comments (1)

[§2] Notation for the pairwise loss and the discrepancy functional is introduced without a dedicated preliminary subsection, making it difficult to track which terms depend on which modality subset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive comments on our manuscript. We address each of the major comments in detail below, providing clarifications and indicating the revisions we will make to strengthen the theoretical rigor of our results.

read point-by-point responses

Referee: [§3.2] §3.2 (hierarchical function-class inclusions): The argument that F_{S_fine} ⊂ F_{S_coarse} yields strictly smaller Rademacher complexity or covering numbers for the pairwise metric-learning loss requires the discrepancy term (learned mapping vs. ground truth) to contract in a controlled way; no measurability, independence, or moment conditions on the joint law of the modalities are stated, so the inclusion does not automatically imply the claimed complexity reduction.

Authors: We agree that the hierarchical inclusion F_{S_fine} ⊂ F_{S_coarse} alone does not suffice without additional regularity conditions. In the original manuscript, we implicitly relied on standard assumptions from statistical learning theory, such as the measurability of the function classes and independence of samples. To make this explicit, we will revise §3.2 to include the following assumptions: (i) the joint distribution of modalities satisfies a conditional independence property given the target, and (ii) the discrepancy term is bounded by a Lipschitz constant with respect to the modality features. Under these conditions, the Rademacher complexity of the finer class is strictly smaller, supporting the claimed reduction. We have added a new lemma formalizing this contraction. revision: yes
Referee: [§4.1, Theorem 2] §4.1, Theorem 2 (upper bound): The upper generalization bound is asserted to decrease with modality granularity via 'enhanced complementarity,' yet the proof sketch relies on the finer class producing a smaller discrepancy without an explicit Lipschitz or boundedness assumption on the ground-truth metric with respect to the added fine-grained features; this step is load-bearing for the main claim.

Authors: Thank you for highlighting this critical point in the proof of Theorem 2. Upon review, we acknowledge that the contraction of the discrepancy term with finer modalities requires an explicit assumption on the ground-truth metric. We will revise the statement of Theorem 2 and its proof to include the assumption that the ground-truth metric is L-Lipschitz continuous with respect to the Euclidean norm on the concatenated feature space. This ensures that adding fine-grained features reduces the discrepancy by a factor proportional to the complementarity measure we define. We have also expanded the discussion on 'enhanced complementarity' to provide a formal definition in terms of mutual information or correlation between modalities. These changes will be incorporated in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived from standard hierarchical function-class analysis

full rationale

The paper defines function classes F_S over modality subsets S, establishes inclusions based on granularity, and applies standard Rademacher or covering-number arguments to the pairwise metric-learning loss to obtain upper and lower generalization bounds. These steps are self-contained within statistical learning theory; the claimed reduction in effective complexity is presented as a consequence of the derived discrepancy terms rather than presupposed by definition or by a self-citation chain. No equations reduce the final bound to a fitted parameter or to an unverified prior result of the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on any free parameters, axioms, or invented entities used in the derivations.

pith-pipeline@v0.9.0 · 5446 in / 1095 out tokens · 78465 ms · 2026-05-09T15:14:23.386868+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 2 canonical work pages

[1]

Expert Systems with Applications , volume=

An overview of deep learning methods for multimodal medical data mining , author=. Expert Systems with Applications , volume=. 2022 , publisher=

2022
[2]

Nature medicine , volume=

Multimodal biomedical AI , author=. Nature medicine , volume=. 2022 , publisher=

2022
[3]

Expert Systems with Applications , volume=

A decision-making approach under uncertainty based on ensemble learning model with multimodal data and its application in medical diagnosis , author=. Expert Systems with Applications , volume=. 2025 , publisher=

2025
[4]

Structure and Interpretation of Computer Programs

Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

1985
[5]

arXiv preprint arXiv:2507.15765 , year=

Learning from heterogeneity: Generalizing dynamic facial expression recognition via distributionally robust optimization , author=. arXiv preprint arXiv:2507.15765 , year=

work page arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Global non-convex optimization with discretized diffusions , author=. Advances in Neural Information Processing Systems , volume=
[7]

International Conference on Artificial Intelligence and Statistics , pages=

Understanding multimodal contrastive learning and incorporating unpaired data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

2023
[8]

Machine Learning , volume=

Generalization bounds for metric and similarity learning , author=. Machine Learning , volume=. 2016 , publisher=

2016
[9]

Algorithmic learning theory , pages=

Contrastive learning, multi-view redundancy, and linear models , author=. Algorithmic learning theory , pages=. 2021 , organization=

2021
[10]

Advances in neural information processing systems , volume=

Regularized distance metric learning: Theory and algorithm , author=. Advances in neural information processing systems , volume=
[11]

International Conference on Artificial Intelligence and Statistics , pages=

Multi-view metric learning in vector-valued kernel spaces , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

2018
[12]

Neurocomputing , volume=

Robustness and generalization for metric learning , author=. Neurocomputing , volume=. 2015 , publisher=

2015
[13]

Advances in Neural Information Processing Systems , volume=

What makes multi-modal learning better than single (provably) , author=. Advances in Neural Information Processing Systems , volume=
[14]

The Annals of Statistics , volume=

Ranking and empirical minimization of U-statistics , author=. The Annals of Statistics , volume=
[15]

2018 , publisher=

Foundations of machine learning , author=. 2018 , publisher=

2018
[16]

Journal of Machine Learning Research , volume=

Rademacher and Gaussian complexities: Risk bounds and structural results , author=. Journal of Machine Learning Research , volume=
[17]

Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies , volume=

Multimodal deep learning for activity and context recognition , author=. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies , volume=. 2018 , publisher=

2018
[18]

Proceedings of the SIGCHI conference on Human factors in computing systems , pages=

A generic platform for addressing the multimodal challenge , author=. Proceedings of the SIGCHI conference on Human factors in computing systems , pages=
[19]

Nature Machine Intelligence , volume=

Multimodal learning with graphs , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

2023
[20]

Visual Information Extraction with Lixto

Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

2001
[21]

Resonance , volume=

Mahalanobis distance , author=. Resonance , volume=
[22]

Brachman and James G

Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

1985
[23]

Complexity results for nonmonotonic logics

Georg Gottlob. Complexity results for nonmonotonic logics. Journal of Logic and Computation. 1992

1992
[24]

Hypertree Decompositions and Tractable Queries

Georg Gottlob and Nicola Leone and Francesco Scarcello. Hypertree Decompositions and Tractable Queries. Journal of Computer and System Sciences. 2002

2002
[25]

Levesque

Hector J. Levesque. Foundations of a functional approach to knowledge representation. Artificial Intelligence. 1984

1984
[26]

Levesque

Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

1984
[27]

On the compilability and expressive power of propositional planning formalisms

Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

2000
[28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Deep supervised cross-modal retrieval , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cross modal retrieval with querybank normalisation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[30]

Progress in Biomedical Engineering , volume=

Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review , author=. Progress in Biomedical Engineering , volume=. 2023 , publisher=

2023
[31]

Informatics in Medicine Unlocked , volume=

An intelligent multimodal medical diagnosis system based on patients’ medical questions and structured symptoms for telemedicine , author=. Informatics in Medicine Unlocked , volume=. 2021 , publisher=

2021
[32]

Artificial Intelligence in Medicine , volume=

Multiple representations and multi-modal reasoning in medical diagnostic systems , author=. Artificial Intelligence in Medicine , volume=. 2001 , publisher=

2001
[33]

2019 international conference on robotics and automation (icra) , pages=

Multimodal trajectory predictions for autonomous driving using deep convolutional networks , author=. 2019 international conference on robotics and automation (icra) , pages=. 2019 , organization=

2019
[34]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

A survey on multimodal large language models for autonomous driving , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
[35]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Hierarchical multimodal metric learning for multimodal classification , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[36]

IEEE transactions on cybernetics , volume=

Deep multimodal distance metric learning using click constraints for image ranking , author=. IEEE transactions on cybernetics , volume=. 2016 , publisher=

2016
[37]

Pattern Recognition , volume=

Graph-based multimodal fusion with metric learning for multimodal classification , author=. Pattern Recognition , volume=. 2019 , publisher=

2019
[38]

Self-supervised learning from a multi-view perspective

Self-supervised learning from a multi-view perspective , author=. arXiv preprint arXiv:2006.05576 , year=

work page arXiv 2006
[39]

International Conference on Machine Learning , pages=

Hyperbolic image-text representations , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[40]

Proceedings of the AAAI conference on artificial intelligence , volume=

Smil: Multimodal learning with severely missing modality , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[41]

Advances in neural information processing systems , volume=

Similarity-based learning via data driven embeddings , author=. Advances in neural information processing systems , volume=
[42]

Advances in neural information processing systems , volume=

Distance metric learning for large margin nearest neighbor classification , author=. Advances in neural information processing systems , volume=
[43]

Advances in Neural Information Processing Systems , volume=

Predicting what you already know helps: Provable self-supervised learning , author=. Advances in Neural Information Processing Systems , volume=
[44]

2009 , publisher=

Generalization bounds for learning the kernel , author=. 2009 , publisher=

2009
[45]

International Conference on Machine Learning , pages=

A theoretical analysis of metric hypothesis transfer learning , author=. International Conference on Machine Learning , pages=. 2015 , organization=

2015
[46]

International conference on machine learning , pages=

A theoretical analysis of contrastive unsupervised representation learning , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[47]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Polos: Multimodal metric learning from human feedback for image captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[48]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Crossclr: Cross-modal contrastive learning for multi-modal video representations , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[49]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Relaxing contrastiveness in multimodal representation learning , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
[50]

Chinese Conference on Pattern Recognition and Computer Vision (PRCV) , pages=

Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment , author=. Chinese Conference on Pattern Recognition and Computer Vision (PRCV) , pages=. 2024 , organization=

2024
[51]

, author=

Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction. , author=. International Journal of Advanced Computer Science & Applications , volume=
[52]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[53]

Analysis and Applications , volume=

Generalization analysis of multi-modal metric learning , author=. Analysis and Applications , volume=. 2016 , publisher=

2016
[54]

Advances in Neural Information Processing Systems , volume=

A theory of multimodal learning , author=. Advances in Neural Information Processing Systems , volume=
[55]

Signal Processing , volume=

Modeling intra-and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval , author=. Signal Processing , volume=. 2017 , publisher=

2017
[56]

World Wide Web , volume=

High-order nonlocal hashing for unsupervised cross-modal retrieval , author=. World Wide Web , volume=. 2021 , publisher=

2021
[57]

Sensors , volume=

A fine-grained semantic alignment method specific to aggregate multi-scale information for cross-modal remote sensing image retrieval , author=. Sensors , volume=. 2023 , publisher=

2023
[58]

IEEE Transactions on Image Processing , volume=

Coarse-to-fine semantic alignment for cross-modal moment localization , author=. IEEE Transactions on Image Processing , volume=. 2021 , publisher=

2021
[59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[60]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[61]

Proceedings of the IEEE , volume=

Multimodal classification of remote sensing images: A review and future directions , author=. Proceedings of the IEEE , volume=. 2015 , publisher=

2015
[62]

Science China Information Sciences , volume=

From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy , author=. Science China Information Sciences , volume=. 2023 , publisher=

2023
[63]

2022 IEEE international symposium on hardware oriented security and trust (HOST) , pages=

Hardware trojan detection at lut: Where structural features meet behavioral characteristics , author=. 2022 IEEE international symposium on hardware oriented security and trust (HOST) , pages=. 2022 , organization=

2022
[64]

2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) , pages=

Automated hardware trojan detection at LUT using explainable graph neural networks , author=. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) , pages=. 2023 , organization=

2023
[65]

2024 IEEE International Test Conference in Asia (ITC-Asia) , pages=

Pinpointing hardware trojans through semantic feature extraction and natural language processing , author=. 2024 IEEE International Test Conference in Asia (ITC-Asia) , pages=. 2024 , organization=

2024
[66]

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , year=

Towards precise and explainable hardware Trojan localization at LUT level , author=. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , year=
[67]

2022 7th International Conference on Computer and Communication Systems (ICCCS) , pages=

Robustness of classifier to adversarial examples under imbalanced data , author=. 2022 7th International Conference on Computer and Communication Systems (ICCCS) , pages=. 2022 , organization=

2022
[68]

Applied Intelligence , volume=

Robust variable structure discovery based on tilted empirical risk minimization , author=. Applied Intelligence , volume=. 2023 , publisher=

2023
[69]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Stepdown SLOPE for controlled feature selection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[70]

Frontiers of Computer Science , volume=

Neural partially linear additive model , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

2024
[71]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Improved Concentration Bound for CVaR , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

2024
[72]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=

Fine-grained analysis of stability and generalization for stochastic bilevel optimization , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=
[73]

Expert Systems with Applications , volume=

Error Density-dependent Empirical Risk Minimization , author=. Expert Systems with Applications , volume=. 2024 , publisher=

2024
[74]

2024 IEEE International Conference on Data Mining (ICDM) , pages=

Generalized Sparse Additive Model with Unknown Link Function , author=. 2024 IEEE International Conference on Data Mining (ICDM) , pages=. 2024 , organization=

2024
[75]

Forty-second International Conference on Machine Learning , year=

On the Generalization Ability of Next-Token-Prediction Pretraining , author=. Forty-second International Conference on Machine Learning , year=
[76]

Journal of Numerical Simulations in Physics and Mathematics , volume=

On the Convergence of Nonconcave-Nonconvex Max-Min Optimization Problem , author=. Journal of Numerical Simulations in Physics and Mathematics , volume=. 2025 , publisher=

2025
[77]

Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=

Interpretable Meta-weighting Sparse Neural Additive Networks for Datasets with Label Noise and Class Imbalance , author=. Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=

[1] [1]

Expert Systems with Applications , volume=

An overview of deep learning methods for multimodal medical data mining , author=. Expert Systems with Applications , volume=. 2022 , publisher=

2022

[2] [2]

Nature medicine , volume=

Multimodal biomedical AI , author=. Nature medicine , volume=. 2022 , publisher=

2022

[3] [3]

Expert Systems with Applications , volume=

A decision-making approach under uncertainty based on ensemble learning model with multimodal data and its application in medical diagnosis , author=. Expert Systems with Applications , volume=. 2025 , publisher=

2025

[4] [4]

Structure and Interpretation of Computer Programs

Harold Abelson and Gerald Jay Sussman and Julie Sussman. Structure and Interpretation of Computer Programs. 1985

1985

[5] [5]

arXiv preprint arXiv:2507.15765 , year=

Learning from heterogeneity: Generalizing dynamic facial expression recognition via distributionally robust optimization , author=. arXiv preprint arXiv:2507.15765 , year=

work page arXiv

[6] [6]

Advances in Neural Information Processing Systems , volume=

Global non-convex optimization with discretized diffusions , author=. Advances in Neural Information Processing Systems , volume=

[7] [7]

International Conference on Artificial Intelligence and Statistics , pages=

Understanding multimodal contrastive learning and incorporating unpaired data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

2023

[8] [8]

Machine Learning , volume=

Generalization bounds for metric and similarity learning , author=. Machine Learning , volume=. 2016 , publisher=

2016

[9] [9]

Algorithmic learning theory , pages=

Contrastive learning, multi-view redundancy, and linear models , author=. Algorithmic learning theory , pages=. 2021 , organization=

2021

[10] [10]

Advances in neural information processing systems , volume=

Regularized distance metric learning: Theory and algorithm , author=. Advances in neural information processing systems , volume=

[11] [11]

International Conference on Artificial Intelligence and Statistics , pages=

Multi-view metric learning in vector-valued kernel spaces , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

2018

[12] [12]

Neurocomputing , volume=

Robustness and generalization for metric learning , author=. Neurocomputing , volume=. 2015 , publisher=

2015

[13] [13]

Advances in Neural Information Processing Systems , volume=

What makes multi-modal learning better than single (provably) , author=. Advances in Neural Information Processing Systems , volume=

[14] [14]

The Annals of Statistics , volume=

Ranking and empirical minimization of U-statistics , author=. The Annals of Statistics , volume=

[15] [15]

2018 , publisher=

Foundations of machine learning , author=. 2018 , publisher=

2018

[16] [16]

Journal of Machine Learning Research , volume=

Rademacher and Gaussian complexities: Risk bounds and structural results , author=. Journal of Machine Learning Research , volume=

[17] [17]

Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies , volume=

Multimodal deep learning for activity and context recognition , author=. Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies , volume=. 2018 , publisher=

2018

[18] [18]

Proceedings of the SIGCHI conference on Human factors in computing systems , pages=

A generic platform for addressing the multimodal challenge , author=. Proceedings of the SIGCHI conference on Human factors in computing systems , pages=

[19] [19]

Nature Machine Intelligence , volume=

Multimodal learning with graphs , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

2023

[20] [20]

Visual Information Extraction with Lixto

Robert Baumgartner and Georg Gottlob and Sergio Flesca. Visual Information Extraction with Lixto. Proceedings of the 27th International Conference on Very Large Databases. 2001

2001

[21] [21]

Resonance , volume=

Mahalanobis distance , author=. Resonance , volume=

[22] [22]

Brachman and James G

Ronald J. Brachman and James G. Schmolze. An overview of the KL-ONE knowledge representation system. Cognitive Science. 1985

1985

[23] [23]

Complexity results for nonmonotonic logics

Georg Gottlob. Complexity results for nonmonotonic logics. Journal of Logic and Computation. 1992

1992

[24] [24]

Hypertree Decompositions and Tractable Queries

Georg Gottlob and Nicola Leone and Francesco Scarcello. Hypertree Decompositions and Tractable Queries. Journal of Computer and System Sciences. 2002

2002

[25] [25]

Levesque

Hector J. Levesque. Foundations of a functional approach to knowledge representation. Artificial Intelligence. 1984

1984

[26] [26]

Levesque

Hector J. Levesque. A logic of implicit and explicit belief. Proceedings of the Fourth National Conference on Artificial Intelligence. 1984

1984

[27] [27]

On the compilability and expressive power of propositional planning formalisms

Bernhard Nebel. On the compilability and expressive power of propositional planning formalisms. Journal of Artificial Intelligence Research. 2000

2000

[28] [28]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Deep supervised cross-modal retrieval , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[29] [29]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Cross modal retrieval with querybank normalisation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[30] [30]

Progress in Biomedical Engineering , volume=

Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review , author=. Progress in Biomedical Engineering , volume=. 2023 , publisher=

2023

[31] [31]

Informatics in Medicine Unlocked , volume=

An intelligent multimodal medical diagnosis system based on patients’ medical questions and structured symptoms for telemedicine , author=. Informatics in Medicine Unlocked , volume=. 2021 , publisher=

2021

[32] [32]

Artificial Intelligence in Medicine , volume=

Multiple representations and multi-modal reasoning in medical diagnostic systems , author=. Artificial Intelligence in Medicine , volume=. 2001 , publisher=

2001

[33] [33]

2019 international conference on robotics and automation (icra) , pages=

Multimodal trajectory predictions for autonomous driving using deep convolutional networks , author=. 2019 international conference on robotics and automation (icra) , pages=. 2019 , organization=

2019

[34] [34]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

A survey on multimodal large language models for autonomous driving , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

[35] [35]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Hierarchical multimodal metric learning for multimodal classification , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[36] [36]

IEEE transactions on cybernetics , volume=

Deep multimodal distance metric learning using click constraints for image ranking , author=. IEEE transactions on cybernetics , volume=. 2016 , publisher=

2016

[37] [37]

Pattern Recognition , volume=

Graph-based multimodal fusion with metric learning for multimodal classification , author=. Pattern Recognition , volume=. 2019 , publisher=

2019

[38] [38]

Self-supervised learning from a multi-view perspective

Self-supervised learning from a multi-view perspective , author=. arXiv preprint arXiv:2006.05576 , year=

work page arXiv 2006

[39] [39]

International Conference on Machine Learning , pages=

Hyperbolic image-text representations , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[40] [40]

Proceedings of the AAAI conference on artificial intelligence , volume=

Smil: Multimodal learning with severely missing modality , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[41] [41]

Advances in neural information processing systems , volume=

Similarity-based learning via data driven embeddings , author=. Advances in neural information processing systems , volume=

[42] [42]

Advances in neural information processing systems , volume=

Distance metric learning for large margin nearest neighbor classification , author=. Advances in neural information processing systems , volume=

[43] [43]

Advances in Neural Information Processing Systems , volume=

Predicting what you already know helps: Provable self-supervised learning , author=. Advances in Neural Information Processing Systems , volume=

[44] [44]

2009 , publisher=

Generalization bounds for learning the kernel , author=. 2009 , publisher=

2009

[45] [45]

International Conference on Machine Learning , pages=

A theoretical analysis of metric hypothesis transfer learning , author=. International Conference on Machine Learning , pages=. 2015 , organization=

2015

[46] [46]

International conference on machine learning , pages=

A theoretical analysis of contrastive unsupervised representation learning , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[47] [47]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Polos: Multimodal metric learning from human feedback for image captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[48] [48]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Crossclr: Cross-modal contrastive learning for multi-modal video representations , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[49] [49]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Relaxing contrastiveness in multimodal representation learning , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

[50] [50]

Chinese Conference on Pattern Recognition and Computer Vision (PRCV) , pages=

Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment , author=. Chinese Conference on Pattern Recognition and Computer Vision (PRCV) , pages=. 2024 , organization=

2024

[51] [51]

, author=

Cross-Modal Sentiment Analysis Based on CLIP Image-Text Attention Interaction. , author=. International Journal of Advanced Computer Science & Applications , volume=

[52] [52]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[53] [53]

Analysis and Applications , volume=

Generalization analysis of multi-modal metric learning , author=. Analysis and Applications , volume=. 2016 , publisher=

2016

[54] [54]

Advances in Neural Information Processing Systems , volume=

A theory of multimodal learning , author=. Advances in Neural Information Processing Systems , volume=

[55] [55]

Signal Processing , volume=

Modeling intra-and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval , author=. Signal Processing , volume=. 2017 , publisher=

2017

[56] [56]

World Wide Web , volume=

High-order nonlocal hashing for unsupervised cross-modal retrieval , author=. World Wide Web , volume=. 2021 , publisher=

2021

[57] [57]

Sensors , volume=

A fine-grained semantic alignment method specific to aggregate multi-scale information for cross-modal remote sensing image retrieval , author=. Sensors , volume=. 2023 , publisher=

2023

[58] [58]

IEEE Transactions on Image Processing , volume=

Coarse-to-fine semantic alignment for cross-modal moment localization , author=. IEEE Transactions on Image Processing , volume=. 2021 , publisher=

2021

[59] [59]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[60] [60]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[61] [61]

Proceedings of the IEEE , volume=

Multimodal classification of remote sensing images: A review and future directions , author=. Proceedings of the IEEE , volume=. 2015 , publisher=

2015

[62] [62]

Science China Information Sciences , volume=

From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy , author=. Science China Information Sciences , volume=. 2023 , publisher=

2023

[63] [63]

2022 IEEE international symposium on hardware oriented security and trust (HOST) , pages=

Hardware trojan detection at lut: Where structural features meet behavioral characteristics , author=. 2022 IEEE international symposium on hardware oriented security and trust (HOST) , pages=. 2022 , organization=

2022

[64] [64]

2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) , pages=

Automated hardware trojan detection at LUT using explainable graph neural networks , author=. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) , pages=. 2023 , organization=

2023

[65] [65]

2024 IEEE International Test Conference in Asia (ITC-Asia) , pages=

Pinpointing hardware trojans through semantic feature extraction and natural language processing , author=. 2024 IEEE International Test Conference in Asia (ITC-Asia) , pages=. 2024 , organization=

2024

[66] [66]

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , year=

Towards precise and explainable hardware Trojan localization at LUT level , author=. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , year=

[67] [67]

2022 7th International Conference on Computer and Communication Systems (ICCCS) , pages=

Robustness of classifier to adversarial examples under imbalanced data , author=. 2022 7th International Conference on Computer and Communication Systems (ICCCS) , pages=. 2022 , organization=

2022

[68] [68]

Applied Intelligence , volume=

Robust variable structure discovery based on tilted empirical risk minimization , author=. Applied Intelligence , volume=. 2023 , publisher=

2023

[69] [69]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Stepdown SLOPE for controlled feature selection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[70] [70]

Frontiers of Computer Science , volume=

Neural partially linear additive model , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

2024

[71] [71]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Improved Concentration Bound for CVaR , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

2024

[72] [72]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=

Fine-grained analysis of stability and generalization for stochastic bilevel optimization , author=. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , pages=

[73] [73]

Expert Systems with Applications , volume=

Error Density-dependent Empirical Risk Minimization , author=. Expert Systems with Applications , volume=. 2024 , publisher=

2024

[74] [74]

2024 IEEE International Conference on Data Mining (ICDM) , pages=

Generalized Sparse Additive Model with Unknown Link Function , author=. 2024 IEEE International Conference on Data Mining (ICDM) , pages=. 2024 , organization=

2024

[75] [75]

Forty-second International Conference on Machine Learning , year=

On the Generalization Ability of Next-Token-Prediction Pretraining , author=. Forty-second International Conference on Machine Learning , year=

[76] [76]

Journal of Numerical Simulations in Physics and Mathematics , volume=

On the Convergence of Nonconcave-Nonconvex Max-Min Optimization Problem , author=. Journal of Numerical Simulations in Physics and Mathematics , volume=. 2025 , publisher=

2025

[77] [77]

Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=

Interpretable Meta-weighting Sparse Neural Additive Networks for Datasets with Label Noise and Class Imbalance , author=. Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=