pith. sign in

arxiv: 2605.20273 · v1 · pith:JZ6AALIHnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Modality-Decoupled Online Recursive Editing

Pith reviewed 2026-05-21 08:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords online model editingmultimodal large language modelsmodality decouplingrecursive editingSherman-Morrison recursionlifelong adaptationMLLM editing benchmarks
0
0 comments X

The pith

M-ORE separates text and visual updates in online MLLM editing to cut cross-modal conflicts and long-term interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces M-ORE, a method for editing multimodal large language models as new corrections arrive one by one. Standard editors from text-only models fail here because visual signals overpower the statistics that decide how to change the model, and edits start to interfere with each other over time. M-ORE keeps separate locality statistics for the text parts and the visual projector, then applies all changes inside one fixed low-rank space using a recursive formula that keeps cost constant per edit. If the approach works, models can absorb a stream of fixes without retraining or growing memory use, while preserving accuracy on prior tasks and new inputs alike.

Core claim

M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference.

What carries the argument

Modality-decoupled recursive update that keeps separate locality statistics for text and visual modules and recurses inside one fixed orthogonal low-rank edit subspace.

If this is right

  • Reliability, generality, and locality all rise over strong baselines on several MLLM backbones.
  • Quality-efficiency trade-off improves because each edit costs constant compute and memory.
  • Cross-modal conflicts shrink by isolating visual projector statistics from text-stack statistics.
  • Long-horizon interference drops because every update stays inside the same orthogonal low-rank space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of statistics could be tested on other multimodal architectures that combine language with images or video.
  • If the fixed subspace remains effective after hundreds of edits, the method might support continuous deployment without periodic full resets.
  • The recursive form might let practitioners swap in different locality measures without changing the overall update rule.

Load-bearing premise

Keeping module-wise statistics separate for text and vision is enough to stop visual signals from dominating the update direction, and confining all edits to one fixed low-rank subspace prevents interference from building up across many corrections.

What would settle it

A controlled run on an MLLM where a long sequence of edits is applied and locality scores drop or interference rises at the same rate as in baseline editors despite using the decoupled statistics and fixed subspace.

Figures

Figures reproduced from arXiv: 2605.20273 by Fangming Liu, Jing Li, Siyuan Li, Youyuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of M-ORE and its contrast to existing online MLLM editing paradigms. M-ORE decouples vision/text updates and performs continual writes in a fixed orthogonal low-rank space. 2024; Zhu et al., 2026; Liang et al., 2026; Shi et al., 2026; Lu et al., 2026). However, their knowledge is encoded in pa￾rameters and thus remains static after training, reducing reli￾ability as real-world facts evolve (Jiang … view at source ↗
Figure 2
Figure 2. Figure 2: Cross-modal conflict caused by modality mismatch (BLIP2-OPT). Left: log-variance density indicates higher-energy visual features than text activations. Right: attribution analysis shows visually dominated contributions, implying that shared global statistics bias updates toward the visual subspace and weaken textual preservation. Results for other MLLMs are provided in the Appendix B.1. Current solutions l… view at source ↗
Figure 3
Figure 3. Figure 3: Inter-edit interference induced by MEND on BLIP2-OPT over 10 online sequential edits on the E-IC test set. Energy dominance and propagation. We further mea￾sure layer-wise output energy averaged over the same edits [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sequential visualization of MEND-induced hidden-state shifts on BLIP2-OPT across 100 consecutive locality samples. Motivated by these observations, sequential updates should be geometrically de-entangled. We thus restrict edits to a fixed orthogonal low-rank write space and apply a recursive rule with an implicit orthogonalization bias, discouraging re￾peated reuse of heavily-occupied coordinates and mitig… view at source ↗
Figure 5
Figure 5. Figure 5: Inter-edit interference statistics of M-ORE on BLIP2- OPT over 10 online sequential edits. tures and scales: BLIP2-OPT (2.7B) (Li et al., 2023) and LLaVA-v1.5 (7B) (Liu et al., 2024). Since dedi￾cated online MLLM editors remain limited, we follow MMEdit-style evaluation (Cheng et al., 2023) and adapt widely-used LLM editors for comparison. We group baselines into parameter-modifying methods (FT-L/FT￾M (Che… view at source ↗
Figure 6
Figure 6. Figure 6: accuracy of the post-edited LLaVA-v1.5 (7B) on six tasks used for MLLM general capability testing. constraint and allows the low-rank write space to change over the edit stream. • w/o pooling: This variant replaces the rank-one sketch in Eq. (10) with the full projected statistic P x∈B(l) t−1 z (l) t−1 (x)z (l) t−1 (x) ⊤, and then updates P (l) t [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Sequential visualization of hidden-state shifts on LLaVA across 500 consecutive locality samples. monotonic growth, indicating bounded editor state under a constant-size buffer. 5.5. Representation Shift Analysis (RQ4) We evaluate overfitting by measuring the hidden-state distri￾bution shift on locality samples after sequential edits. Since these samples are unrelated to the edited requests, an ef￾fective … view at source ↗
Figure 7
Figure 7. Figure 7: Efficiency metrics of editors on LLaVA-v1.5 over the online edit sequence t. Top: Edit latency (raw and rolling mean) . Bottom: Incremental peak memory (∆ peak mem). dation observed in other baselines. In contrast, parameter￾modifying baselines suffer clear catastrophic forgetting un￾der long horizons: MEND drops to near-zero accuracy on Existence/Count after 100 edits, and FT-M exhibits a steady decline a… view at source ↗
Figure 9
Figure 9. Figure 9: Cross-modal conflict caused by modality mismatch (LLaVA-v1.5). Left: log-variance density indicates higher-energy visual features than text activations. Right: attribution analysis shows visually dominated contributions, implying that shared global statistics bias updates toward the visual subspace and weaken textual preservation [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity analysis of the LoRA rank [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sensitivity analysis of the regularization weight λ. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: E-IC case studies (challenging visual-understanding edits). Top two panels: Group 1; bottom two panels: Group 2. In each group, the upper panel is the edit sample and the lower panel is the corresponding locality sample. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: E-IC case studies (challenging visual-understanding edits). Top two panels: Group 1; bottom two panels: Group 2. In each group, the upper panel is the edit sample and the lower panel is the corresponding locality sample. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: E-VQA case studies. Top two panels: Group 1; bottom two panels: Group 2. In each group, the upper panel is the edit sample and the lower panel is the corresponding locality sample. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
read the original abstract

Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text-only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross-modal conflict, while sequential writes become entangled in a shared edit space and amplify long-horizon interference, causing inter-edit interference. To address these, we propose M-ORE, a modality-decoupled online recursive editor for lifelong MLLM adaptation. M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M-ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality-efficiency scaling. Our code is publicly available at https://github.com/lab-klc/M-ORE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces M-ORE, a modality-decoupled online recursive editor for lifelong adaptation of multimodal large language models. Derived from a proximal-projection formulation, it provides a closed-form update via Sherman-Morrison recursion with constant per-edit cost. The method maintains separate locality statistics for the text stack and visual projector to counteract visually dominated shaping and performs continual updates inside a fixed orthogonal low-rank edit subspace to reduce long-horizon inter-edit interference. Experiments across multiple MLLM backbones and online editing benchmarks report consistent gains in reliability, generality, and locality together with favorable quality-efficiency scaling.

Significance. If the central claims are substantiated, the work supplies an efficient, constant-overhead mechanism for online editing of MLLMs that explicitly separates modalities and controls subspace drift. The closed-form recursive update, public code release, and multi-backbone evaluation constitute concrete strengths that would support reproducibility and practical deployment.

major comments (2)
  1. [§3] §3 (proximal-projection derivation and Sherman-Morrison recursion): the argument that a fixed orthogonal low-rank subspace suffices to eliminate long-horizon interference rests on the unstated assumption that the initial basis vectors remain linearly independent from all subsequent edit directions across both text and visual modules. No re-orthogonalization step or rank-adaptation mechanism is described; when visual activations dominate the covariance, even modest drift can re-introduce the cross-modal conflict the method claims to avoid.
  2. [Experimental section] Experimental section (locality metrics and benchmark protocol): the reported improvements in locality are presented without accompanying statistical significance tests or explicit description of the data-exclusion rules used to compute the metrics. Because the central claim of reduced inter-edit interference depends on these quantities, the absence of verifiable protocol details prevents independent confirmation of the gains.
minor comments (2)
  1. [Abstract] The abstract states that the method achieves 'favorable quality-efficiency scaling' yet does not define the concrete quality and efficiency metrics plotted in the scaling figures.
  2. [§3.1] Notation for the module-wise locality statistics (text stack versus visual projector) is introduced without an explicit equation linking the two separate covariance estimates to the final update direction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the work.

read point-by-point responses
  1. Referee: [§3] §3 (proximal-projection derivation and Sherman-Morrison recursion): the argument that a fixed orthogonal low-rank subspace suffices to eliminate long-horizon interference rests on the unstated assumption that the initial basis vectors remain linearly independent from all subsequent edit directions across both text and visual modules. No re-orthogonalization step or rank-adaptation mechanism is described; when visual activations dominate the covariance, even modest drift can re-introduce the cross-modal conflict the method claims to avoid.

    Authors: We appreciate the referee highlighting this implicit assumption in the proximal-projection derivation. The fixed orthogonal low-rank subspace is initialized once from the initial edit directions and all subsequent updates are performed strictly inside it via the Sherman-Morrison recursion, which by design keeps new edits orthogonal to prior ones within the subspace. Modality decoupling maintains separate statistics for text and visual modules precisely to limit visual dominance from propagating into the shared subspace. We agree that sustained linear independence across modules is assumed without explicit re-orthogonalization (to preserve constant overhead and the fixed-subspace property). In the revision we will expand §3 to state this assumption explicitly, discuss its validity under the low-rank constraint, and note the theoretical possibility of drift in extreme cases. revision: yes

  2. Referee: [Experimental section] Experimental section (locality metrics and benchmark protocol): the reported improvements in locality are presented without accompanying statistical significance tests or explicit description of the data-exclusion rules used to compute the metrics. Because the central claim of reduced inter-edit interference depends on these quantities, the absence of verifiable protocol details prevents independent confirmation of the gains.

    Authors: We agree that statistical significance testing and transparent protocol details are necessary to substantiate the locality claims. In the revised manuscript we will add appropriate statistical tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) to the locality results across benchmarks. We will also provide a clear description of the data-exclusion criteria, exact metric computation formulas, and full benchmark protocol in the experimental section or a new appendix subsection to enable independent verification. revision: yes

Circularity Check

0 steps flagged

Derivation relies on standard proximal projection and Sherman-Morrison identity; no reduction to inputs

full rationale

The paper states that M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with Sherman-Morrison recursion. Sherman-Morrison is an external, well-known linear algebra identity independent of the paper's data or fitted parameters. Module-wise locality statistics and the fixed orthogonal low-rank subspace are explicit design choices motivated by the problem of cross-modal interference, not quantities defined in terms of the target predictions or fitted to the evaluation benchmarks. No self-citations, ansatzes, or uniqueness theorems from prior author work are invoked as load-bearing justifications in the provided text. The central claims rest on the proposed mechanisms plus experimental results on external benchmarks, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard matrix identities and the assumption that separate per-module statistics plus orthogonal subspace constraints suffice to control interference.

free parameters (1)
  • edit subspace rank
    Dimensionality of the fixed orthogonal low-rank edit space; chosen to trade off locality against interference mitigation.
axioms (1)
  • standard math Sherman-Morrison formula yields exact rank-one update recursion
    Invoked to obtain closed-form constant-overhead updates from the proximal-projection objective.

pith-pipeline@v0.9.0 · 5737 in / 1196 out tokens · 33783 ms · 2026-05-21T08:41:58.108686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 10 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  2. [2]

    Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning

    Chen, Q., Zhang, T., He, X., Li, D., Wang, C., Huang, L., and Xue’, H. Lifelong knowledge editing for LLMs with retrieval-augmented continuous prompt learning. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 13565– 13580,

  3. [3]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Chen, Q., Wang, C., Wang, D., Zhang, T., Li, W., and He, X. Lifelong knowledge editing for vision language models with low-rank mixture-of-experts. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 9455–9466, 2025a. Chen, Q., Zhang, T., Wang, C., He, X., Wang, D., and Liu, T. Attribution analysis meets model editing: Ad- ...

  4. [4]

    Can we edit multimodal large lan- guage models? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

    Cheng, S., Tian, B., Liu, Q., Chen, X., Wang, Y ., Chen, H., and Zhang, N. Can we edit multimodal large lan- guage models? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 13877–13888,

  5. [5]

    Editing factual knowl- edge in language models

    De Cao, N., Aziz, W., and Titov, I. Editing factual knowl- edge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  6. [6]

    BERT: Pre-training of deep bidirectional transformers for lan- guage understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics(NAACL), pp. 4171–4186,

  7. [7]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., Ji, R., Shan, C., and He, R. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394,

  8. [8]

    Transformer feed-forward layers are key-value memories

    Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5484–5495,

  9. [9]

    Model editing harms general abilities of large language models: Regularization to the rescue

    Gu, J.-C., Xu, H.-X., Ma, J.-Y ., Lu, P., Ling, Z.-H., Chang, K.-W., and Peng, N. Model editing harms general abilities of large language models: Regularization to the rescue. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 16801–16819,

  10. [10]

    Llms meet multimodal generation and editing: A survey.arXiv preprint arXiv:2405.19334,

    He, Y ., Liu, Z., Chen, J., Tian, Z., Liu, H., Chi, X., Liu, R., Yuan, R., Xing, Y ., Wang, W., Dai, J., Zhang, Y ., Xue, W., Liu, Q., Guo, Y ., and Chen, Q. Llms meet multimodal generation and editing: A survey.arXiv preprint arXiv:2405.19334,

  11. [11]

    Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

    Liang, G., Wang, Z., Hu, J., Zhou, H., Xue, Z., Zhang, J., Xu, D., and Yu, Q. Render-in-the-loop: Vector graph- ics generation via visual self-feedback.arXiv preprint arXiv:2604.20730,

  12. [12]

    Video-llava: Learning united visual representation by alignment before projection

    Lin, B., Ye, Y ., Zhu, B., Cui, J., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing (EMNLP), pp. 5971–5984,

  13. [13]

    Chordedit: One-step low-energy transport for image edit- ing.arXiv preprint arXiv:2602.19083,

    Lu, L., Chen, X., Guo, M., Li, S., Wang, J., and Shi, Y . Chordedit: One-step low-energy transport for image edit- ing.arXiv preprint arXiv:2602.19083,

  14. [14]

    Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning, C. D. Fast model editing at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2022a. Mitchell, E., Lin, C., Bosselut, A., Manning, C. D., and Finn, C. Memory-based model editing at scale. InProceedings of the International Conference on Machine Learning (ICM...

  15. [15]

    MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

    Shi, Y ., Xie, Y ., Guo, M., Lu, L., Huang, M., Wang, J., Zhu, Z., Xu, B., and Huang, Z. Mmerror: A benchmark for erroneous reasoning in vision-language models.arXiv preprint arXiv:2601.03331,

  16. [16]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

  17. [17]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. Wang, P., Li, Z., Zhang, N., Xu, Z., Yao, Y ., Jiang, Y ., Xie, P., Huang, F., and Chen, H. WISE: rethinking the knowl- edge memory...

  18. [18]

    A Comprehensive Study of Knowledge Editing for Large Language Models

    Zhang, N., Yao, Y ., Tian, B., Wang, P., Deng, S., Wang, M., Xi, Z., Mao, S., Zhang, J., Ni, Y ., Cheng, S., Xu, Z., Xu, X., Gu, J.-C., Jiang, Y ., Xie, P., Huang, F., Liang, L., Zhang, Z., Zhu, X., Zhou, J., and Chen, H. A compre- hensive study of knowledge editing for large language models.arXiv preprint arXiv:2401.01286,

  19. [19]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., Mi- haylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer, L. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

  20. [20]

    A Survey of Large Language Models

    Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models.arXiv preprint arXiv:2303.18223,

  21. [21]

    Can we edit factual knowledge by in- context learning? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

    Zheng, C., Li, L., Dong, Q., Fan, Y ., Wu, Z., Xu, J., and Chang, B. Can we edit factual knowledge by in- context learning? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4862–4876,

  22. [22]

    12 Modality-Decoupled Online Recursive Editing Appendix A. Experimental Setup In this section, we provide detailed descriptions of experimental setup, including introduction to datasets, explanation of evaluation metrics and editing objective, discussion of baseline methods and implementation details. A.1. Datasets • E-VQA (Cheng et al., 2023):Designed fo...

  23. [23]

    Visual generality is assessed using reinterpreted images generated by Stable Diffusion 2.1 (Rombach et al., 2022)

    for E-VQA and manually written prompt templates for E-IC. Visual generality is assessed using reinterpreted images generated by Stable Diffusion 2.1 (Rombach et al., 2022). – Locality:Textual locality is evaluated using NQ (Kwiatkowski et al., 2019), and multimodal locality is evaluated using OK-VQA (Marino et al., 2019), measuring whether unrelated knowl...

  24. [24]

    IKE.In-Context Knowledge Editing (Zheng et al.,

    and the counterfactual model with OPT-125M (Zhang et al., 2022). IKE.In-Context Knowledge Editing (Zheng et al.,

  25. [25]

    Given a target fact pair (x∗, y∗), IKE retrieves k demonstrations C= {c1,

    performs editing byretrieval-augmented in-context prompting, without directly updating model parameters. Given a target fact pair (x∗, y∗), IKE retrieves k demonstrations C= {c1, . . . , ck} from a training set using an unsupervised retriever (e.g., cosine similarity), and concatenates them as in-context examples to guide generation. The demonstrations ar...

  26. [26]

    AlphaEdit configuration and multimodal K0.For AlphaEdit, we adopt the hyperparameters recommended in the original paper (Fang et al., 2025)

    to ensure a fair comparison under identical edit scopes. AlphaEdit configuration and multimodal K0.For AlphaEdit, we adopt the hyperparameters recommended in the original paper (Fang et al., 2025). To estimate the retain key set K0 for MLLMs, we build K0 using samples from E-VQA and E-IC (Cheng et al., 2023), so that both visual and textual knowledge are ...

  27. [27]

    contains 14 evaluation categories. The main paper reports six representative tasks for brevity, while we provide the remaining eight categories here to complete the benchmark: •Artwork: evaluates understanding of artistic images and stylized visual content. •Celebrity: tests recognition and reasoning about well-known public figures in images. •Color: meas...

  28. [28]

    As shown, flipped cases tend to have smaller pre-edit top1–top2 margins, indicating thatthey lie near fragile decoding boundaries and can be affected by small distributional shifts. For multimodal locality, flipped cases also show higher image similarity to the edited sample, suggesting thatimperfect visual irrelevance and residual visual coupling may con...

  29. [29]

    Rows shaded in light purple indicateparameter-modifyingmethods

    denotes the number of online edits performed. Rows shaded in light purple indicateparameter-modifyingmethods. Model Methods E-VQA E-IC Rel. T-Gen. M-Gen. T-Loc. M-Loc. Avg. Rel. T-Gen. M-Gen. T-Loc. M-Loc. Avg. BLIP2-OPT FT-L1 100.00 100.00 60.00 94.74 100.00 90.95 96.77 95.02 90.72 90.05 68.27 88.16 FT-M1 100.00 96.67 63.33 100.00 73.33 86.67 100.00 100....

  30. [30]

    27 Modality-Decoupled Online Recursive Editing Orig

    In each group, the upper panel is the edit sample and the lower panel is the corresponding locality sample. 27 Modality-Decoupled Online Recursive Editing Orig. Base L1 L16 L31 Target: Goodfellas. Prompt: What Hollywood movie is one of the food dishes named after? Answer (Before Editing): One of the food dishes named after a Hollywood movie is the Pizza H...