Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Runyuan Cai; Xiaodong Zeng; Yiming Wang; Yu Lin

arxiv: 2605.20309 · v1 · pith:KI63SAQNnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Runyuan Cai , Yiming Wang , Yu Lin , Xiaodong Zeng This is my paper

Pith reviewed 2026-05-21 07:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords concept personalizationtrigger-based memorytext-to-image generationdiffusion modelsmodular adaptationvisual identityvideo generationn-gram indexing

0 comments

The pith

Tiny-Engram stores new visual concepts in small tables that activate only when a registered word phrase appears in the prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Tiny-Engram as a compact way to add specific visual identities to already-trained image and video generators. Each concept lives in a tiny table of memory entries tied to chosen n-gram triggers. When the matching phrase occurs, the table adjusts only the text-encoder states inside that region while every other part of the conditioning stays exactly as in the frozen model. This keeps full control over how the rest of the prompt combines with the new concept. Experiments show reliable binding for still images and partial success in video, where subject changes occur but long-term identity across frames is weaker.

Core claim

Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches. These entries modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support the conditioning pathway remains identical to that of the frozen base model. The formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt, and the same table structure is tested in text-conditioned video generation.

What carries the argument

The trigger-indexed concept table, a small collection of memory entries that activates solely on matching short word sequences to alter only the relevant text-encoder hidden states.

Load-bearing premise

Adjusting the text-encoder states only inside the trigger phrase region is enough to attach the new visual concept without disturbing how the base model handles every other part of the prompt.

What would settle it

A prompt that includes the registered trigger phrase produces images or video frames whose subject matches the trained identity, while the identical prompt without the trigger phrase produces the base model's original output for the same subject description.

Figures

Figures reproduced from arXiv: 2605.20309 by Runyuan Cai, Xiaodong Zeng, Yiming Wang, Yu Lin.

**Figure 2.** Figure 2: SD1.5 qualitative comparison. Each panel pairs the frozen base output on the left with the Engram-wrapped [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: SD3.5 qualitative comparison. Each panel pairs the frozen base output on the left with the Engram-wrapped [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Video training clips generated by frozen Wan2.2 TI2V. Each row is a frame contact sheet from one short clip [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Wan2.2 train-prompt comparison at step 28000. The top row shows frozen-base Wan2.2, and the bottom [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Wan2.2 held-out prompt comparisons at step 28000. In each two-row strip, the top row shows frozen-base [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Conditioning-path comparison for the three vision backbones. Color key: Engram edit (blue), visual state [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tiny-Engram adds explicit trigger-indexed tables to frozen generators for modular concept control, but the claims rest on description alone with no supporting numbers.

read the letter

The main takeaway is that this paper gives frozen image and video generators a small table of memory entries tied to n-gram triggers. The table modulates text-encoder states only inside the matched region and leaves the rest of the conditioning path unchanged from the base model. This setup aims to bind a rare phrase to a target identity while keeping compositional control from the surrounding prompt intact. The approach is presented as distinct from continuous adapters or weight updates because it supplies an explicit lexical address and activation boundary. The authors test it on both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones and note that the same table works for subject alteration in video but delivers only limited fine-grained identity persistence across held-out prompts. The design is straightforward and the paper is direct about the video shortfall, which it links to the need for tighter coupling between text memory and evolving visual states. The clearest limitation is the total lack of quantitative results. No metrics, baselines, error bars, or exclusion criteria appear, so it is impossible to judge how reliably the local modulation actually isolates the concept or preserves prompt control. The stress-test concern about global cross-attention mixing signals across the full token sequence looks plausible here, since the reported video weakness is consistent with identity information leaking or failing to propagate. This paper is aimed at researchers who want lightweight, addressable ways to customize generative models without retraining the whole network. A reader interested in modular memory mechanisms could extract the core idea even if the current evidence is thin. It deserves peer review so the mechanism can be tested properly and the localization assumption can be checked against actual measurements.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Tiny-Engram, a compact trigger-indexed concept table for modular personalization in frozen generative vision models. Each concept is represented as a small set of memory entries indexed by registered n-gram matches; these entries modulate text-encoder hidden states exclusively inside the matched trigger region while leaving the conditioning pathway outside this lexical support identical to the frozen base model. The formulation is evaluated on single-encoder latent diffusion and multi-encoder diffusion-transformer backbones for image generation, where it is claimed to bind a rare trigger phrase to a target identity while preserving compositional control. The same table-based memory is tested in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out prompts remains limited. The authors conclude that small, explicitly addressed concept tables constitute a practical route to modular visual personalization, with strongest evidence in image generation.

Significance. If the central claims are substantiated, the work offers a lightweight, explicitly addressable mechanism for concept retrieval that avoids continuous adapters or weight updates, potentially improving modularity and control in personalization pipelines. The explicit lexical boundary and preservation of the frozen conditioning pathway outside the trigger region represent a distinct formulation from existing methods. The video results usefully surface a concrete limitation—temporally stable identity likely requires tighter coupling to the evolving visual state—thereby motivating targeted follow-up research.

major comments (2)

[Abstract] Abstract and method description: The claim that modulating text-encoder hidden states exclusively within the matched trigger region binds the target identity while the remainder of the conditioning pathway remains identical to the frozen model is load-bearing for the central contribution. Because text-encoder outputs feed cross-attention layers that operate over the full token sequence, identity information localized to trigger positions may fail to propagate reliably to visual tokens or may leak into non-trigger regions; the reported limitation in fine-grained identity persistence for video is consistent with this risk and requires explicit analysis or mitigation.
[Evaluation] Evaluation sections: The manuscript provides no quantitative metrics, baselines, error bars, or detailed exclusion criteria for the image or video experiments. Without these, it is impossible to assess the strength of support for the claims of reliable subject alteration and preserved compositional control, which are central to the practical-route conclusion.

minor comments (2)

[Method] Clarify the precise definition and registration procedure for n-gram triggers, including any edge cases for partial matches or prompt variations.
[Implementation] Add a brief discussion of computational overhead introduced by the concept table lookup and modulation step relative to the base model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the work.

read point-by-point responses

Referee: [Abstract] Abstract and method description: The claim that modulating text-encoder hidden states exclusively within the matched trigger region binds the target identity while the remainder of the conditioning pathway remains identical to the frozen model is load-bearing for the central contribution. Because text-encoder outputs feed cross-attention layers that operate over the full token sequence, identity information localized to trigger positions may fail to propagate reliably to visual tokens or may leak into non-trigger regions; the reported limitation in fine-grained identity persistence for video is consistent with this risk and requires explicit analysis or mitigation.

Authors: We agree that cross-attention operates over the full sequence and that propagation of trigger-localized information merits explicit examination. Our formulation keeps non-trigger hidden states identical to the frozen model precisely to preserve compositional control, while the modulated trigger tokens supply the identity signal that influences downstream visual tokens via attention. The video limitation we already report is consistent with the need for tighter visual-state coupling, which we discuss as future work. In the revision we will add attention-map visualizations and a targeted ablation demonstrating the extent of propagation versus leakage. revision: partial
Referee: [Evaluation] Evaluation sections: The manuscript provides no quantitative metrics, baselines, error bars, or detailed exclusion criteria for the image or video experiments. Without these, it is impossible to assess the strength of support for the claims of reliable subject alteration and preserved compositional control, which are central to the practical-route conclusion.

Authors: The referee correctly notes the absence of quantitative evaluation. Although the present manuscript emphasizes qualitative demonstrations of modularity and control, we will add quantitative results to the revised version. These will include CLIP similarity scores for subject fidelity and prompt adherence, comparisons to standard personalization baselines, error bars computed over multiple random seeds, and explicit prompt-selection and exclusion criteria. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces Tiny-Engram as an explicit new formulation: a compact trigger-indexed concept table that parameterizes concepts via n-gram-matched memory entries modulating text-encoder hidden states only inside the matched region, with the pathway outside that region identical to the frozen base model. No equations, derivations, or fitted parameters are presented that reduce the central claims to self-definitions, renamed inputs, or self-citation chains by construction. The method is described as a design choice evaluated empirically across standard latent diffusion and diffusion-transformer backbones, with results reported directly from those evaluations rather than tautological reductions. This keeps the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The abstract supplies limited technical detail; the central claim rests on the untested assumption that localized text-encoder modulation suffices for binding without side effects.

free parameters (1)

memory entry count per concept
Each concept is parameterized as a small set of memory entries; exact count or size is not specified.

axioms (1)

domain assumption Modulating text-encoder hidden states only inside the matched trigger region preserves full compositional control from the surrounding prompt.
Invoked when the abstract states that outside the lexical support the conditioning pathway is identical to the frozen base model.

invented entities (1)

Tiny-Engram trigger-indexed concept table no independent evidence
purpose: To give visual memories an explicit lexical address and activation boundary.
New structure introduced to achieve modular personalization.

pith-pipeline@v0.9.0 · 5778 in / 1331 out tokens · 67444 ms · 2026-05-21T07:42:24.620798+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The registry entry is a key-token pair k7→z m i:i+n−1, and each key owns a learned memory vector e m k

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

[1]

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George B. van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

work page 2022
[2]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. InInternational Conference on Learning Representations, 2024

work page 2024
[7]

REALM: Retrieval-augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. InProceedings of the International Conference on Machine Learning, 2020

work page 2020
[8]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[9]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InProceedings of the International Conference on Machine Learning, 2019

work page 2019
[10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.International Conference on Learning Representa- tions, 2022

work page 2022
[11]

OpenCLIP.https://github.com/mlfoundations/open_clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP.https://github.com/mlfoundations/open_clip, 2021

work page 2021
[12]

Generalization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InInternational Conference on Learning Representations, 2020

work page 2020
[13]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[14]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021

work page 2021
[15]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020
[16]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2021

work page 2021
[17]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2022. 11 Tiny-EngramA PREPRINT

work page 2022
[18]

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[19]

Learning transferable visual models from natural language supervision.International Conference on Machine Learning, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.International Conference on Machine Learning, 2021

work page 2021
[20]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[21]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[22]

GPT-5.3 Chat Model

OpenAI. GPT-5.3 Chat Model. https://developers.openai.com/api/docs/models/gpt-5. 3-chat-latest, 2026

work page 2026
[23]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[24]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention, 2015

work page 2015
[25]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[26]

ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Zhe Lin, Honghui Shi, and others. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[27]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, and others. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023
[29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and others. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George B. van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, ...

work page 2022

[2] [2]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. InInternational Conference on Learning Representations, 2024

work page 2024

[7] [7]

REALM: Retrieval-augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval-augmented language model pre-training. InProceedings of the International Conference on Machine Learning, 2020

work page 2020

[8] [8]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021

[9] [9]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InProceedings of the International Conference on Machine Learning, 2019

work page 2019

[10] [10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models.International Conference on Learning Representa- tions, 2022

work page 2022

[11] [11]

OpenCLIP.https://github.com/mlfoundations/open_clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP.https://github.com/mlfoundations/open_clip, 2021

work page 2021

[12] [12]

Generalization through memorization: Nearest neighbor language models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor language models. InInternational Conference on Learning Representations, 2020

work page 2020

[13] [13]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[14] [14]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021

work page 2021

[15] [15]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020

work page 2020

[16] [16]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2021

work page 2021

[17] [17]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2022. 11 Tiny-EngramA PREPRINT

work page 2022

[18] [18]

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[19] [19]

Learning transferable visual models from natural language supervision.International Conference on Machine Learning, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.International Conference on Machine Learning, 2021

work page 2021

[20] [20]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020

[21] [21]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[22] [22]

GPT-5.3 Chat Model

OpenAI. GPT-5.3 Chat Model. https://developers.openai.com/api/docs/models/gpt-5. 3-chat-latest, 2026

work page 2026

[23] [23]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

work page 2022

[24] [24]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention, 2015

work page 2015

[25] [25]

DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[26] [26]

ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation

Yuxiang Wei, Zhe Lin, Honghui Shi, and others. ELITE: Encoding visual concepts into textual embeddings for customized text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[27] [27]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, and others. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

work page 2023

[29] [29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and others. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023