SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

Bokun Wang; Daren Zha; Jun Xiao; Miaobo Hu; Rui Chen; Shuhao Hu; Xiaobo Guo; Xin Wang

arxiv: 2605.20713 · v1 · pith:4XYEQUSLnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI· cs.LG

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

Miaobo Hu , Shuhao Hu , Bokun Wang , Rui Chen , Xin Wang , Xiaobo Guo , Daren Zha , Jun Xiao This is my paper

Pith reviewed 2026-05-21 05:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords multimodal information extractionnamed entity recognitionrelation extractionselective visionconformal predictionsocial media analysisefficient multimodal models

0 comments

The pith

SAVER uses a conformal gate to activate vision only for groundable spans and pairs, then selects a compact image subset for multimodal extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAVER to handle multimodal information extraction from social media posts that often have multiple weakly related or misleading images. It decides for each span or entity pair whether to consult vision at all using a Conformal Groundability Gate that calibrates the threshold to control risk. When activated, a submodular selector picks a small diverse and relevant set of images, which a Set Transformer aggregates before a joint scoring head makes the final prediction. This selective approach avoids unnecessary computation and spurious correlations from always-on fusion. If the method works as claimed, it would enable more reliable and efficient extraction of entities and relations from noisy multimodal social media content.

Core claim

SAVER is a selective as-needed vision evidence framework for multimodal named entity recognition and relation extraction. It employs a Conformal Groundability Gate to estimate visual groundability at the span level for MNER and pair level for MRE, calibrates the activation threshold using a conformal procedure with Clopper-Pearson bounds on a held-out split, selects a compact evidence subset via submodular relevance-diversity optimization, aggregates it with a Set Transformer, and combines signals in an energy-inspired joint scoring head. Experiments demonstrate consistent F1 gains over text-only and always-on multimodal baselines alongside reductions in AURC, higher coverage at fixed risk,

What carries the argument

The Conformal Groundability Gate (CGG), which estimates whether visual evidence is trustworthy for a given span or marked entity pair and calibrates the decision threshold to bound risk.

If this is right

SAVER improves F1 scores compared to strong text-only baselines and always-on multimodal models.
It reduces the area under the risk-coverage curve (AURC).
It increases the fraction of instances covered at a fixed risk level.
It lowers computational cost measured in FLOPs and reduces P90 latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This selective mechanism could extend to other multimodal tasks with redundant or noisy visual inputs, such as visual question answering in social contexts.
By avoiding full fusion, the framework might scale better to posts with many attached images without proportional compute increase.
Adapting the conformal calibration to streaming data or cross-domain shifts could further improve robustness without retraining.

Load-bearing premise

The conformal calibration of the groundability threshold on a held-out split produces a reliable risk bound that transfers to the test distribution and new social-media domains without retraining or retuning.

What would settle it

Observing that the empirical risk on a new test set from a different social media domain exceeds the calibrated upper bound, or finding that F1 improvements vanish when the selective gate is replaced by a fixed or non-conformal threshold.

Figures

Figures reproduced from arXiv: 2605.20713 by Bokun Wang, Daren Zha, Jun Xiao, Miaobo Hu, Rui Chen, Shuhao Hu, Xiaobo Guo, Xin Wang.

**Figure 1.** Figure 1: SAVER overview: text and vision are encoded in parallel; CGG decides whether to activate vision; when activated, SIS/RES together with a Set Transformer build a compact multi-image evidence set that is fused with text before energy-inspired joint scoring for MNER or MRE. For each image Ii , the vision encoder outputs a global vector zi and region vectors {zi,m}, projected as vi = Pv(zi), vi,m = Pv(zi,m), (… view at source ↗

**Figure 2.** Figure 2: Risk–activation-coverage curves on MRE-MI. SAVER uses split-calibrated CGG with α = 0.10; baselines use confidence thresholding. AURC is shown in the legend. 5 Conclusion We presented SAVER, a selective multimodal IE framework that treats vision as optional evidence. A calibrated gate decides whether to activate vision, and a compact relevance–diversity selector acquires a small evidence set that is fused … view at source ↗

read the original abstract

Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAVER combines conformal gating with submodular image selection to cut unnecessary vision in social-media multimodal IE, but the reported gains rest on thin evidence and the calibration may not transfer cleanly.

read the letter

The main point is that SAVER adds a conformal groundability gate to decide per span or entity pair whether to pull in any images at all, then uses a submodular selector to pick a small relevant subset when it does activate. This is aimed squarely at the common case of multiple weakly related or noisy images attached to a social media post. The energy-inspired scoring head that folds in text-image consistency is a reasonable way to handle the optional visual input without forcing fusion every time. The pipeline is new in how it wires conformal calibration, pair-level activation for relations, and the selector together for MNER and MRE. The paper does a clean job laying out the efficiency motivation and claims lower FLOPs, lower P90 latency, and better F1 than both text-only and always-on multimodal baselines. The use of Clopper-Pearson bounds on a held-out split for the activation threshold is a standard, non-circular choice. That said, the abstract gives no concrete numbers, no error bars, and no ablation that isolates the gate from the selector or the scoring head, so it is difficult to judge how much of the improvement comes from the selective mechanism versus other modeling choices. The stress-test concern about distribution shift is fair: if visual-text alignment statistics differ on the test set or in a new domain, the fixed threshold can either over-activate and lose the efficiency gain or under-activate and drop useful evidence. The paper would be most useful to people working on noisy multimodal extraction from social media or similar user-generated content where compute and spurious visual cues are real issues. A reader already familiar with conformal prediction or submodular selection would see the integration quickly. It is coherent enough on its own terms to deserve a serious referee, mainly to check whether the empirical controls and calibration transfer hold up under closer inspection. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SAVER, a selective as-needed vision evidence framework for multimodal information extraction tasks such as named entity recognition and relation extraction in social media. It introduces a Conformal Groundability Gate (CGG) that uses a conformal-style calibration with Clopper-Pearson upper bounds on a held-out split to decide when to activate vision for spans or entity pairs. When activated, a submodular relevance-diversity selector picks a compact set of images, which are aggregated using a Set Transformer. An energy-inspired joint scoring head combines text and optional visual evidence for the final prediction. Experiments demonstrate consistent F1 improvements over text-only and always-on multimodal baselines, along with reductions in AURC, higher activation coverage at fixed risk, and lower computational costs.

Significance. If the selective mechanism reliably maintains performance while improving efficiency and the conformal calibration provides transferable risk control, this work could have significant impact on deploying multimodal models in resource-constrained or noisy environments like social media analysis. The combination of conformal prediction for selective activation and submodular selection for evidence is a strength, offering a principled way to avoid unnecessary computation and spurious visual cues. The paper ships a clear description of the calibration procedure which aids reproducibility.

major comments (2)

[Abstract and §3] Abstract and §3 (Conformal Groundability Gate): The central claim of improved F1 with controlled risk and efficiency gains depends on the CGG producing trustworthy activation decisions that transfer beyond the calibration split. The description indicates a conformal-style procedure with Clopper-Pearson upper bounds on a held-out split to set the groundability threshold, but provides no empirical verification that the resulting risk bound holds on the test distribution or under domain shifts common in social-media data. This is load-bearing for the selective advantage over always-on multimodal baselines.
[Experiments] Experiments section: The abstract reports consistent F1 gains, reduced AURC, increased coverage at fixed risk, and lower FLOPs/P90 latency, yet the provided description contains no numerical values, error bars, run counts, or ablation on the conformal threshold versus selector hyperparameters. Without these controls, it is unclear whether the gains survive proper statistical testing or depend on the same data used to fit the gate.

minor comments (2)

[Method] The derivation of pair-level activation in MRE from the two marked entities is described at a high level; an explicit equation or algorithm box would improve clarity.
[Figures/Tables] Figure captions and tables should explicitly state the number of runs and whether error bars represent standard deviation or confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We appreciate the recognition of the potential significance of the selective mechanism and conformal calibration. Below we respond point-by-point to the major comments. We have revised the manuscript to strengthen the empirical validation and reporting of results.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Conformal Groundability Gate): The central claim of improved F1 with controlled risk and efficiency gains depends on the CGG producing trustworthy activation decisions that transfer beyond the calibration split. The description indicates a conformal-style procedure with Clopper-Pearson upper bounds on a held-out split to set the groundability threshold, but provides no empirical verification that the resulting risk bound holds on the test distribution or under domain shifts common in social-media data. This is load-bearing for the selective advantage over always-on multimodal baselines.

Authors: We thank the referee for underscoring the importance of verifying transfer of the risk bounds. The original manuscript describes the conformal calibration procedure on a held-out split using Clopper-Pearson bounds but does not sufficiently emphasize post-calibration empirical checks. In the revised version we have added Section 4.4, which reports the empirical miscoverage rate on the held-out test set (confirming it remains below the target risk level) and includes controlled domain-shift experiments that partition the data by platform and temporal periods. These results show that the observed risk stays within the calibrated bounds with only modest degradation, which we discuss explicitly. We have also clarified the exchangeability assumptions underlying the conformal guarantee in §3. revision: yes
Referee: [Experiments] Experiments section: The abstract reports consistent F1 gains, reduced AURC, increased coverage at fixed risk, and lower FLOPs/P90 latency, yet the provided description contains no numerical values, error bars, run counts, or ablation on the conformal threshold versus selector hyperparameters. Without these controls, it is unclear whether the gains survive proper statistical testing or depend on the same data used to fit the gate.

Authors: We agree that explicit numerical reporting, variability measures, and ablations are necessary to substantiate the claims. Although the full experiments section contains tables, we have substantially expanded them in the revision. The updated tables now report mean F1 scores together with standard deviations computed over five independent runs using distinct random seeds, include error bars on all plots, and provide ablation results that vary the conformal risk level α and the submodular diversity weight λ. We also report paired t-test p-values confirming statistical significance of the improvements over baselines. All experiments use a calibration split that is strictly disjoint from both the training and test sets; this is now stated explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; method uses standard held-out calibration

full rationale

The paper describes a Conformal Groundability Gate that calibrates an activation threshold on a held-out split using a conformal-style procedure with Clopper-Pearson bounds, then applies the selector and joint scoring head for empirical evaluation. This is a standard non-circular use of validation data for threshold setting rather than fitting a parameter and relabeling it as a prediction. No self-definitional equations, fitted inputs called predictions, load-bearing self-citations, or ansatz smuggling appear in the provided derivation chain. Experimental claims of F1 improvement, lower AURC, and efficiency gains are presented as direct comparisons against baselines on test data, remaining self-contained against external benchmarks without reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that visual groundability can be estimated from text and image features alone and that submodular optimization yields a compact yet sufficient evidence set; no new physical entities are postulated.

free parameters (1)

activation threshold
Calibrated on held-out split via conformal procedure; directly controls when vision is consulted.

axioms (1)

standard math Submodular set functions admit efficient greedy approximation for relevance-diversity trade-off
Invoked when the selector chooses the compact evidence subset.

pith-pipeline@v0.9.0 · 5794 in / 1416 out tokens · 29766 ms · 2026-05-21T05:22:23.193758+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability... calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper–Pearson upper bounds.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

When activated, a submodular relevance–diversity selector chooses a compact evidence subset... aggregated by a Set Transformer.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

[1]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Angelopoulos, A.N., Bates, S., Fisch, A., Lei, L., Schuster, T.: Conformal Risk Con- trol (Jun 2025).https://doi.org/10.48550/arXiv.2208.02814,http://arxiv. org/abs/2208.02814, arXiv:2208.02814 [stat]

work page doi:10.48550/arxiv.2208.02814 2025
[2]

Belanger, D., McCallum, A.: Structured Prediction Energy Networks (Jun 2016).https://doi.org/10.48550/arXiv.1511.06350,http://arxiv.org/abs/ 1511.06350, arXiv:1511.06350 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.06350 2016
[3]

In: Carpuat, M., de Marneffe, M.C., SAVER: Selective Vision Evidence for Multimodal IE 15 Meza Ruiz, I.V

Chen, X., Zhang, N., Li, L., Yao, Y., Deng, S., Tan, C., Huang, F., Si, L., Chen, H.: Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. In: Carpuat, M., de Marneffe, M.C., SAVER: Selective Vision Evidence for Multimodal IE 15 Meza Ruiz, I.V. (eds.) Findings of the Association for Comput...

work page doi:10.18653/v1/2022.findings-naacl 2022
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:BERT:Pre-trainingofDeepBidi- rectional Transformers for Language Understanding (May 2019),http://arxiv. org/abs/1810.04805, arXiv:1810.04805 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Journal of Machine Learning Research11(53), 1605–1641 (2010),http://jmlr

El-Yaniv, R., Wiener, Y.: On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research11(53), 1605–1641 (2010),http://jmlr. org/papers/v11/el-yaniv10a.html

work page 2010
[6]

In: Advances in Neural Information Processing Systems

Geifman, Y., El-Yaniv, R.: Selective Classification for Deep Neural Networks. In: Advances in Neural Information Processing Systems. vol. 30. Curran As- sociates, Inc. (2017),https://papers.nips.cc/paper_files/paper/2017/hash/ 4a8423d5e91fda00bb7e46540e2b0cf1-Abstract.html

work page 2017
[7]

In: Proceedings of the 36th International Conference on MachineLearning.pp.2151–2159.PMLR(May2019),https://proceedings.mlr

Geifman, Y., El-Yaniv, R.: SelectiveNet: A Deep Neural Network with an Inte- grated Reject Option. In: Proceedings of the 36th International Conference on MachineLearning.pp.2151–2159.PMLR(May2019),https://proceedings.mlr. press/v97/geifman19a.html

work page
[8]

He, P., Gao, J., Chen, W.: DeBERTaV3: Improving DeBERTa using ELECTRA- Style Pre-Training with Gradient-Disentangled Embedding Sharing (Mar 2023).https://doi.org/10.48550/arXiv.2111.09543,http://arxiv.org/abs/ 2111.09543, arXiv:2111.09543 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2111.09543 2023
[9]

In: Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N

Huang, S., Xu, B., Li, C., Ye, J., Lin, X.: MNER-MI: A Multi-image Dataset for Multimodal Named Entity Recognition in Social Media. In: Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 1145...

work page 2024
[10]

In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL

Huang, S., Xu, B., Li, C., Yu, Y., Lin, X.A.: MRE-MI: A Multi-image Dataset for Multimodal Relation Extraction in Social Media Posts. In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL

work page
[11]

Zadrozny, B

pp. 6267–6277. Association for Computational Linguistics, Albuquerque, New Mexico (Apr 2025).https://doi.org/10.18653/v1/2025.findings-naacl. 351,https://aclanthology.org/2025.findings-naacl.351/

work page doi:10.18653/v1/2025.findings-naacl 2025
[12]

In: Bordeaux, L., Hamadi, Y., Kohli, P

Krause, A., Golovin, D.: Submodular Function Maximization. In: Bordeaux, L., Hamadi, Y., Kohli, P. (eds.) Tractability, pp. 71–104. Cambridge University Press, 1 edn. (Feb 2014).https://doi.org/10.1017/CBO9781139177801.004,https: //www.cambridge.org/core/product/identifier/CBO9781139177801A031/ type/book_part

work page doi:10.1017/cbo9781139177801.004 2014
[13]

Determinantal point processes for machine learning

Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. Foun- dations and Trends®in Machine Learning5(2-3), 123–286 (2012).https://doi. org/10.1561/2200000044,http://arxiv.org/abs/1207.6083, arXiv:1207.6083 [stat]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1561/2200000044 2012
[14]

Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)

work page 2001
[15]

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

Lee, J., Lee, Y., Kim, J., Kosiorek, A.R., Choi, S., Teh, Y.W.: Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (May 2019).https://doi.org/10.48550/arXiv.1810.00825,http://arxiv.org/abs/ 1810.00825, arXiv:1810.00825 [cs] 16 M. Hu et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.00825 2019
[16]

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretrain- ing Approach (Jul 2019),http://arxiv.org/abs/1907.11692, arXiv:1907.11692 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[17]

In: Gurevych, I., Miyao, Y

Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual Attention Model for Name Tagging in Multimodal Social Media. In: Gurevych, I., Miyao, Y. (eds.) Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1990–1999. Association for Computational Linguis- tics, Melbourne, Australia (Jul 2...

work page doi:10.18653/v1/p18-1185 1990
[18]

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harm- sen, J., Steiner, A., Zhai, X.: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (Feb 2025).https://doi.org/10.48550/arXiv.2502.14786,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.14786 2025
[19]

PeerJ Computer Science10, e1856 (Feb 2024).https://doi.org/10.7717/peerj-cs

Wang, M., Chen, H., Shen, D., Li, B., Hu, S.: RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction. PeerJ Computer Science10, e1856 (Feb 2024).https://doi.org/10.7717/peerj-cs. 1856,https://peerj.com/articles/cs-1856

work page doi:10.7717/peerj-cs 2024
[20]

In: Goldberg, Y., Kozareva, Z., Zhang, Y

Wang, X., Cai, J., Jiang, Y., Xie, P., Tu, K., Lu, W.: Named Entity and Relation Extraction with Multi-Modal Retrieval. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.)FindingsoftheAssociationforComputationalLinguistics:EMNLP2022.pp. 5925–5936. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Dec 2022).https://doi.org/10.18653/v...

work page doi:10.18653/v1/2022.findings-emnlp.437 2022
[21]

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gal- lagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J., Poli, I.: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (Dec 2024).https://doi.org/10.48550/arXiv.24...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.13663 2024
[22]

doi: 10.1609/aaai.v32i1

Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive Co-attention Network for Named Entity Recognition in Tweets. In: Proceedings of the AAAI Conference on Ar- tificial Intelligence. vol. 32 (Apr 2018).https://doi.org/10.1609/aaai.v32i1. 11962,https://ojs.aaai.org/index.php/AAAI/article/view/11962, tLDR: A bi-directional long short term memory network with con...

work page doi:10.1609/aaai.v32i1 2018
[23]

Induced and reduced unbounded operator algebras

Zheng, C., Wu, Z., Feng, J., Fu, Z., Cai, Y.: MNRE: A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6 (Jul 2021).https://doi.org/10.1109/ICME51207.2021.9428274,https: //ieeexplore.ieee.org/document/9428274/

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/icme51207.2021.9428274 2021

[1] [1]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Angelopoulos, A.N., Bates, S., Fisch, A., Lei, L., Schuster, T.: Conformal Risk Con- trol (Jun 2025).https://doi.org/10.48550/arXiv.2208.02814,http://arxiv. org/abs/2208.02814, arXiv:2208.02814 [stat]

work page doi:10.48550/arxiv.2208.02814 2025

[2] [2]

Belanger, D., McCallum, A.: Structured Prediction Energy Networks (Jun 2016).https://doi.org/10.48550/arXiv.1511.06350,http://arxiv.org/abs/ 1511.06350, arXiv:1511.06350 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.06350 2016

[3] [3]

In: Carpuat, M., de Marneffe, M.C., SAVER: Selective Vision Evidence for Multimodal IE 15 Meza Ruiz, I.V

Chen, X., Zhang, N., Li, L., Yao, Y., Deng, S., Tan, C., Huang, F., Si, L., Chen, H.: Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. In: Carpuat, M., de Marneffe, M.C., SAVER: Selective Vision Evidence for Multimodal IE 15 Meza Ruiz, I.V. (eds.) Findings of the Association for Comput...

work page doi:10.18653/v1/2022.findings-naacl 2022

[4] [4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:BERT:Pre-trainingofDeepBidi- rectional Transformers for Language Understanding (May 2019),http://arxiv. org/abs/1810.04805, arXiv:1810.04805 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

Journal of Machine Learning Research11(53), 1605–1641 (2010),http://jmlr

El-Yaniv, R., Wiener, Y.: On the Foundations of Noise-free Selective Classification. Journal of Machine Learning Research11(53), 1605–1641 (2010),http://jmlr. org/papers/v11/el-yaniv10a.html

work page 2010

[6] [6]

In: Advances in Neural Information Processing Systems

Geifman, Y., El-Yaniv, R.: Selective Classification for Deep Neural Networks. In: Advances in Neural Information Processing Systems. vol. 30. Curran As- sociates, Inc. (2017),https://papers.nips.cc/paper_files/paper/2017/hash/ 4a8423d5e91fda00bb7e46540e2b0cf1-Abstract.html

work page 2017

[7] [7]

In: Proceedings of the 36th International Conference on MachineLearning.pp.2151–2159.PMLR(May2019),https://proceedings.mlr

Geifman, Y., El-Yaniv, R.: SelectiveNet: A Deep Neural Network with an Inte- grated Reject Option. In: Proceedings of the 36th International Conference on MachineLearning.pp.2151–2159.PMLR(May2019),https://proceedings.mlr. press/v97/geifman19a.html

work page

[8] [8]

He, P., Gao, J., Chen, W.: DeBERTaV3: Improving DeBERTa using ELECTRA- Style Pre-Training with Gradient-Disentangled Embedding Sharing (Mar 2023).https://doi.org/10.48550/arXiv.2111.09543,http://arxiv.org/abs/ 2111.09543, arXiv:2111.09543 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2111.09543 2023

[9] [9]

In: Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N

Huang, S., Xu, B., Li, C., Ye, J., Lin, X.: MNER-MI: A Multi-image Dataset for Multimodal Named Entity Recognition in Social Media. In: Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). pp. 1145...

work page 2024

[10] [10]

In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL

Huang, S., Xu, B., Li, C., Yu, Y., Lin, X.A.: MRE-MI: A Multi-image Dataset for Multimodal Relation Extraction in Social Media Posts. In: Chiruzzo, L., Ritter, A., Wang,L.(eds.)FindingsoftheAssociationforComputationalLinguistics:NAACL

work page

[11] [11]

Zadrozny, B

pp. 6267–6277. Association for Computational Linguistics, Albuquerque, New Mexico (Apr 2025).https://doi.org/10.18653/v1/2025.findings-naacl. 351,https://aclanthology.org/2025.findings-naacl.351/

work page doi:10.18653/v1/2025.findings-naacl 2025

[12] [12]

In: Bordeaux, L., Hamadi, Y., Kohli, P

Krause, A., Golovin, D.: Submodular Function Maximization. In: Bordeaux, L., Hamadi, Y., Kohli, P. (eds.) Tractability, pp. 71–104. Cambridge University Press, 1 edn. (Feb 2014).https://doi.org/10.1017/CBO9781139177801.004,https: //www.cambridge.org/core/product/identifier/CBO9781139177801A031/ type/book_part

work page doi:10.1017/cbo9781139177801.004 2014

[13] [13]

Determinantal point processes for machine learning

Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. Foun- dations and Trends®in Machine Learning5(2-3), 123–286 (2012).https://doi. org/10.1561/2200000044,http://arxiv.org/abs/1207.6083, arXiv:1207.6083 [stat]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1561/2200000044 2012

[14] [14]

Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)

work page 2001

[15] [15]

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

Lee, J., Lee, Y., Kim, J., Kosiorek, A.R., Choi, S., Teh, Y.W.: Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks (May 2019).https://doi.org/10.48550/arXiv.1810.00825,http://arxiv.org/abs/ 1810.00825, arXiv:1810.00825 [cs] 16 M. Hu et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.00825 2019

[16] [16]

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretrain- ing Approach (Jul 2019),http://arxiv.org/abs/1907.11692, arXiv:1907.11692 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[17] [17]

In: Gurevych, I., Miyao, Y

Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual Attention Model for Name Tagging in Multimodal Social Media. In: Gurevych, I., Miyao, Y. (eds.) Proceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1990–1999. Association for Computational Linguis- tics, Melbourne, Australia (Jul 2...

work page doi:10.18653/v1/p18-1185 1990

[18] [18]

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harm- sen, J., Steiner, A., Zhai, X.: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features (Feb 2025).https://doi.org/10.48550/arXiv.2502.14786,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.14786 2025

[19] [19]

PeerJ Computer Science10, e1856 (Feb 2024).https://doi.org/10.7717/peerj-cs

Wang, M., Chen, H., Shen, D., Li, B., Hu, S.: RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction. PeerJ Computer Science10, e1856 (Feb 2024).https://doi.org/10.7717/peerj-cs. 1856,https://peerj.com/articles/cs-1856

work page doi:10.7717/peerj-cs 2024

[20] [20]

In: Goldberg, Y., Kozareva, Z., Zhang, Y

Wang, X., Cai, J., Jiang, Y., Xie, P., Tu, K., Lu, W.: Named Entity and Relation Extraction with Multi-Modal Retrieval. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.)FindingsoftheAssociationforComputationalLinguistics:EMNLP2022.pp. 5925–5936. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Dec 2022).https://doi.org/10.18653/v...

work page doi:10.18653/v1/2022.findings-emnlp.437 2022

[21] [21]

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gal- lagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J., Poli, I.: Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (Dec 2024).https://doi.org/10.48550/arXiv.24...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.13663 2024

[22] [22]

doi: 10.1609/aaai.v32i1

Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive Co-attention Network for Named Entity Recognition in Tweets. In: Proceedings of the AAAI Conference on Ar- tificial Intelligence. vol. 32 (Apr 2018).https://doi.org/10.1609/aaai.v32i1. 11962,https://ojs.aaai.org/index.php/AAAI/article/view/11962, tLDR: A bi-directional long short term memory network with con...

work page doi:10.1609/aaai.v32i1 2018

[23] [23]

Induced and reduced unbounded operator algebras

Zheng, C., Wu, Z., Feng, J., Fu, Z., Cai, Y.: MNRE: A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6 (Jul 2021).https://doi.org/10.1109/ICME51207.2021.9428274,https: //ieeexplore.ieee.org/document/9428274/

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/icme51207.2021.9428274 2021