SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

Haggai Maron; Muhan Zhang; Xiyuan Wang; Yansheng Mao; Yewei Liu; Yoav Gelbery

arxiv: 2602.06358 · v2 · pith:AQ26H47Cnew · submitted 2026-02-06 · 💻 cs.CL · cs.AI

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

Yewei Liu , Xiyuan Wang , Yansheng Mao , Yoav Gelbery , Haggai Maron , Muhan Zhang This is my paper

Pith reviewed 2026-05-21 14:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hypernetworkLoRALLM adaptationin-context learningparameter-efficient fine-tuningsingle-pass generationcontext to parameters

0 comments

The pith

SHINE maps any context to a high-quality LoRA adapter for an LLM in one forward pass by reusing the model's frozen parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SHINE, a hypernetwork that converts diverse contexts into effective LoRA adapters for large language models. It reuses the frozen LLM's own parameters inside its architecture and adds specific design choices to reach strong expressive power with relatively few additional parameters. After a pretraining stage followed by instruction fine-tuning, the hypernetwork produces adapters in a single pass that let the frozen LLM answer complex questions about the context without storing or re-accessing that context. This turns temporary in-context knowledge into permanent in-parameter knowledge and cuts the time, computation, and memory costs that normally come with supervised fine-tuning of LLMs.

Core claim

SHINE is a scalable hypernetwork that, after pretraining and instruction fine-tuning, takes a meaningful context and outputs a LoRA adapter in one forward pass. The adapter is then applied to the frozen base LLM so that the model can perform complex tasks tied to that context without any further gradient updates or direct access to the original context. The design reuses the target LLM's frozen weights within the hypernetwork itself and introduces architectural changes that overcome earlier hypernetwork limitations, delivering high-quality adapters with a modest parameter budget.

What carries the argument

The SHINE hypernetwork, which reuses the frozen LLM's parameters inside its in-context architecture to map input contexts directly to LoRA adapters in a single forward pass.

If this is right

An LLM can answer complex questions about a supplied context immediately after one pass through the hypernetwork, without gradient updates or context storage.
Adaptation cost drops sharply because no fine-tuning step and no repeated context access are required at inference time.
The same hypernetwork can generate adapters for many different contexts, supporting repeated use of the same base model across varied inputs.
The approach shows scaling potential, suggesting that larger hypernetworks or longer contexts could be handled with proportional but still modest extra cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support on-device or edge deployment by converting a one-time context into a small, reusable set of adapter weights that stay with the model.
Replacing retrieval steps in retrieval-augmented generation with generated parameter updates might reduce latency for repeated queries on the same material.
The single-pass design invites testing whether similar hypernetworks can produce other forms of parameter updates beyond LoRA.

Load-bearing premise

The described pretraining and instruction fine-tuning pipeline, together with reuse of the LLM's frozen parameters, is enough to produce LoRA adapters that reliably encode context knowledge for use on new tasks.

What would settle it

Measure whether the LoRA adapter produced by SHINE from a held-out context improves accuracy on context-specific questions by a clear margin over the base LLM using standard in-context prompting; if no improvement appears across multiple contexts and tasks, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2602.06358 by Haggai Maron, Muhan Zhang, Xiyuan Wang, Yansheng Mao, Yewei Liu, Yoav Gelbery.

**Figure 1.** Figure 1: An Example of SHINE: It maps context to LoRA in a single pass without any fine-tuning. The LoRA can be used for downstream conversation without accessing the context. head, sensitivity to hyperparameters, and the storage burden of maintaining distinct model parameters for each task. The advances in hypernetworks offer a promising third alternative (Ha et al., 2017; Chauhan et al., 2024). A hypernetwork i… view at source ↗

**Figure 2.** Figure 2: Overall Architecture. The process consists of two passes: (1) Memory Extraction, where the LLM (augmented with Meta LoRA) processes context to produce memory states, and (2) Parameter Generation, where a hypernetwork converts these states into task-specific LoRA adapters for the final inference. Step 2: Sparse Attention Transformer. We employ a lightweight Transformer to process M˜ . Flattening M˜ into a s… view at source ↗

**Figure 3.** Figure 3: Hypernetwork Architecture. The model uses alternating attention along layer and token axes to efficiently process the memory tensor before projecting it into weights. slice and reshape v into different LoRAs. Suppose the first t elements in v have been used. To generate LoRA for W ∈ R I×O, where I is the input feature dimension and O is the output feature dimension, we calculate: A = Reshape(v[t : t + I ·… view at source ↗

**Figure 4.** Figure 4: Reconstruction Task: The hypernetwork encodes the full context into a LoRA. The LLM is then prompted to reconstruct the original text. 4.1. Pretraining Pretraining establishes the hypernetwork’s ability to compress and reconstruct context. We train the model in a self-supervised manner on a large corpus using two complementary objectives: Reconstruction and Completion. Reconstruction. The reconstruction … view at source ↗

**Figure 5.** Figure 5: Pretraining Results: Reconstruction and completion loss/perplexity across varying context lengths. P10/P90 denote 10% quantile/90% quantile. F1 Score vs. Conversation Turn In-Context Naive SFT SHINE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.0 0.2 0.4 0.6 0.8 1.0 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-Turn Conversation F1-Score: Evaluate answer F1 scores using the MS MARCO MQA dataset, which contains 15 QA pairs per context. details are provided in Appendix B.2. 5.2.1. INSTRUCTION FINE-TUNING: MQA We first fine-tune SHINE on a collection of mqa datasets, primarily comprising MS MARCO MQA (76%) alongside open-source alternatives (see Appendix B.3). Training proceeds for 2 epochs with a peak learni… view at source ↗

**Figure 7.** Figure 7: visualizes computation (FLOPs) and memory usage (see Appendix B.5 and B.6 for derivations). SHINE significantly reduces fine-tuning computation compared to SFT and lowers memory/compute costs during generation compared to In-Context methods. 5.2.2. INSTRUCTION FINE-TUNING: 1QA Following mqa tuning, we fine-tune on 1qa datasets (listed in Appendix B.4) for 1 epoch at a learning rate of 1e-5. We evaluate on… view at source ↗

**Figure 8.** Figure 8: Scaling with Backbone LLM: Pretraining performance comparison across different size of backbone LLM [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Reshape to LoRA: Different ways of transpose. Define A⊤ as the transpose of A. We have four kinds of transpose in the LoRA generating process: rl: A = Reshape(T[t : t + I · r]) ∈ R I×r B = Reshape(T[t + I · r : t + I · r + r · R]) ∈ R r×O (15) rr: A = Reshape(T[t : t + I · r]) ∈ R I×r B ⊤ = Reshape(T[t + I · r : t + I · r + r · R]) ∈ R O×r (16) lr: A⊤ = Reshape(T[t : t + I · r]) ∈ R r×I B ⊤ = Reshape(T[t +… view at source ↗

**Figure 10.** Figure 10: Pretraining Validation PPL that Reflects the Bitter Lesson As the Bitter Lesson suggests, “general methods that rely on computation and data ultimately outperform methods that rely on human-designed priors or domain-specific knowledge, even if the latter show faster early progress”. Although the design of our hypernetwork does not explicitly incorporate many available priors, we believe it is already suff… view at source ↗

**Figure 11.** Figure 11: Completion Task: The hypernetwork encodes a truncated context into a LoRA. The LLM must utilize this LoRA to recover original context and complete the missing part. c(2) c(3) c(4) c(5) ...... LLM ❄️ Meta LoRA 🔥 Initial Memory Embeddings🔥 ...... ...... c(N-1) c(N) Middle Memory Hidden States Generated LoRA LLM ❄️ Generated LoRA <USE> q(1) ...... q(S) <ASS> a(1) a(2) ...... a(T) omit omit omit omit a(1) a(2… view at source ↗

**Figure 12.** Figure 12: Instruction Fine-Tuning: The hypernetwork encodes context into LoRA. The model is then prompted to answer questions based on the context 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

read the original abstract

We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SHINE reuses the target LLM's parameters inside an in-context hypernetwork to generate LoRA adapters from context in one forward pass, which is a distinct design move, but the abstract gives no numbers to show whether the adapters actually carry useful knowledge.

read the letter

SHINE's main move is to build a hypernetwork that takes a context and spits out LoRA weights for a frozen LLM in a single pass. It does this by folding the LLM's own parameters into the hypernetwork and training the whole thing with a pretraining stage followed by instruction fine-tuning. The hope is that the resulting adapter embeds enough of the context so the LLM can answer related questions without the context being fed in again. That is a concrete difference from earlier hypernetwork attempts that treated the generator as a separate, smaller model.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SHINE, a scalable in-context hypernetwork that maps diverse contexts to high-quality LoRA adapters for frozen LLMs in a single forward pass. By reusing the LLM's own parameters and introducing architectural innovations, combined with a pretraining and instruction fine-tuning pipeline, the method claims to transform in-context knowledge into in-parameter knowledge, enabling complex QA tasks without direct context access while achieving strong expressive power with few parameters and substantial savings in time, computation, and memory relative to SFT-based adaptation.

Significance. If the empirical claims hold under rigorous evaluation, the work could meaningfully advance efficient, dynamic LLM adaptation by demonstrating that a single-pass hypernetwork can reliably compress context into low-rank updates that support downstream reasoning, offering a scalable alternative to per-task fine-tuning with potential benefits for memory-constrained or real-time applications.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): The abstract asserts 'outstanding results on various tasks' and 'greatly saves time, computation and memory costs compared to SFT-based LLM adaptation' with 'strong expressive power,' yet the provided description supplies no quantitative metrics, specific baselines (e.g., standard LoRA fine-tuning, other hypernetworks), ablation studies, or error analysis. This absence directly weakens the central claim that the architectural reuse plus training pipeline produces high-quality adapters.
[§3] §3 (Method) and skeptic concern on single-pass encoding: The claim that the generated LoRA adapters enable complex QA 'without directly accessing the context' requires explicit validation that the updates encode deep context understanding rather than surface patterns. Experiments comparing performance on context-dependent QA with vs. without the original context in the final prompt (and vs. full-context baselines) are load-bearing for the 'in-parameter knowledge' transformation and are not addressed in the current description.

minor comments (2)

[§3] Notation for the hypernetwork input/output dimensions and the precise reuse of LLM layers should be clarified with a diagram or explicit equations to avoid ambiguity in how parameters are shared.
[Appendix or §5] The GitHub link is provided but no details on reproducibility (e.g., exact hyperparameters, dataset splits for pretraining) are summarized in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We have revised the manuscript to directly address the concerns about quantitative support and validation of the in-parameter knowledge claim, adding the requested metrics, baselines, ablations, and targeted experiments.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): The abstract asserts 'outstanding results on various tasks' and 'greatly saves time, computation and memory costs compared to SFT-based LLM adaptation' with 'strong expressive power,' yet the provided description supplies no quantitative metrics, specific baselines (e.g., standard LoRA fine-tuning, other hypernetworks), ablation studies, or error analysis. This absence directly weakens the central claim that the architectural reuse plus training pipeline produces high-quality adapters.

Authors: We agree that the abstract and results section would be strengthened by explicit quantitative support. In the revised manuscript we have expanded §4 with concrete performance tables reporting accuracy/F1 scores across tasks, direct comparisons to standard LoRA fine-tuning and prior hypernetwork baselines, wall-clock time and memory measurements showing the claimed savings, ablation studies isolating the effect of parameter reuse and architectural innovations, and a brief error analysis of remaining failure modes. These additions are now referenced from the abstract. revision: yes
Referee: [§3] §3 (Method) and skeptic concern on single-pass encoding: The claim that the generated LoRA adapters enable complex QA 'without directly accessing the context' requires explicit validation that the updates encode deep context understanding rather than surface patterns. Experiments comparing performance on context-dependent QA with vs. without the original context in the final prompt (and vs. full-context baselines) are load-bearing for the 'in-parameter knowledge' transformation and are not addressed in the current description.

Authors: We acknowledge the need for explicit validation that the adapters capture deep rather than superficial context information. We have added a new set of controlled experiments in the revised §4 that evaluate context-dependent QA under three conditions: (i) SHINE-generated LoRA with no context provided at inference, (ii) the same questions with the original context but no adapter, and (iii) full-context baselines. The results show that performance with the adapter alone remains competitive with full-context prompting and substantially exceeds the no-adapter baseline, supporting the transformation of in-context knowledge into in-parameter updates. We also include qualitative analysis of attention patterns to further address the surface-pattern concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture and training pipeline evaluated on external benchmarks

full rationale

The paper introduces SHINE as a new hypernetwork design that reuses frozen LLM parameters with architectural innovations, followed by a described pretraining and instruction fine-tuning pipeline to generate LoRA adapters from contexts in a single pass. Claims of strong expressive power, task performance, and efficiency gains are supported by experimental results on various tasks compared to SFT baselines. No equations, derivations, or self-citations are shown that reduce the central claims to fitted parameters or prior results by construction. The method is trained on data and tested on held-out tasks, rendering outcomes independent rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit list of free parameters, axioms, or invented entities; the design implicitly assumes that the hypernetwork training pipeline can learn a general mapping from context to effective LoRA weights.

pith-pipeline@v0.9.0 · 5729 in / 1156 out tokens · 48617 ms · 2026-05-21T14:40:39.221019+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose SHINE ... map diverse meaningful contexts into high-quality LoRA adapters ... in a single forward pass ... transforming in-context knowledge to in-parameter knowledge
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

M2P Transformer ... alternates between column attention and row attention ... bidirectional information flow

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[2]

Brown, T

URL https://proceedings.mlr.press/ v205/beck23a.html. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Ches...

work page 2020
[3]

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y ., Liang, P., and Zettlemoyer, L

URL https://openreview.net/forum? id=bc3sUsS6ck. Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y ., Liang, P., and Zettlemoyer, L. Quac: Question answering in context. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.),Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Octob...

work page 2018
[4]

URL https:// doi.org/10.18653/v1/d18-1241

doi: 10.18653/V1/D18-1241. URL https:// doi.org/10.18653/v1/d18-1241. Delétang, G., Ruoss, A., Duquenne, P., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchi- son, M., Orseau, L., Hutter, M., and Veness, J. Lan- guage modeling is compression. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna...

work page doi:10.18653/v1/d18-1241 2024
[5]

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M

URL https://openreview.net/forum? id=jznbgiynus. Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.),Pro- ceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Compu...

work page doi:10.18653/v1/n19-1246 2019
[6]

arXiv preprint arXiv:2502.13595 , year=

doi: 10.48550/arXiv.2502.13595. URL https: //arxiv.org/abs/2502.13595. Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- learning for fast adaptation of deep networks. In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th Inter- national Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Procee...

work page doi:10.48550/arxiv.2502.13595 2017
[7]

URL http://proceedings

PMLR, 2017. URL http://proceedings. mlr.press/v70/finn17a.html. Ge, T., Hu, J., Wang, L., Wang, X., Chen, S., and Wei, F. In-context autoencoder for context compression in a large language model. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net...

work page 2017
[8]

Ellie Pavlick and Tom Kwiatkowski

URL https://openreview.net/forum? id=rkpACe1lx. Ho, X., Nguyen, A. D., Sugawara, S., and Aizawa, A. Constructing A multi-hop QA dataset for comprehen- sive evaluation of reasoning steps. In Scott, D., Bel, N., and Zong, C. (eds.),Proceedings of the 28th Interna- tional Conference on Computational Linguistics, COL- ING 2020, Barcelona, Spain (Online), Dece...

work page doi:10.18653/v1/2020 2020
[9]

URL https: //doi.org/10.1109/TPAMI.2021.3079209

doi: 10.1109/TPAMI.2021.3079209. URL https: //doi.org/10.1109/TPAMI.2021.3079209. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adapta- tion of large language models. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

work page doi:10.1109/tpami.2021.3079209 2021
[10]

Jukic, J., Tutek, M., and Snajder, J

URL https://openreview.net/forum? id=nZeVKeeFYf9. Jukic, J., Tutek, M., and Snajder, J. Context parametrization with compositional adapters.CoRR, abs/2509.22158,

work page arXiv
[12]

doi: 10.18653/v1/D17-1082

URL https://openreview.net/forum? id=oO6FsMyDBt. Lai, G., Xie, Q., Liu, H., Yang, Y ., and Hovy, E. H. RACE: large-scale reading comprehension dataset from examinations. In Palmer, M., Hwa, R., and Riedel, S. (eds.),Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 20...

work page doi:10.18653/v1/d17-1082 2017
[13]

wb ≡1 recovers the uniform variant

URL https://aclanthology.org/2025. coling-main.89/. Lim, D., Maron, H., Law, M. T., Lorraine, J., and Lucas, J. Graph metanetworks for processing diverse neural architectures. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=ijK5hyxs0n...

work page doi:10.1109/iccv 2025
[15]

MTEB: Massive Text Embedding Benchmark

URL https://openreview.net/forum? id=0DcZxeWfOPt. Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. doi: 10.48550/ ARXIV .2210.07316. URL https://arxiv.org/ abs/2210.07316. Munkhdalai, T. and Yu, H. Meta networks. In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

URL http://proceedings

PMLR, 2017. URL http://proceedings. mlr.press/v70/munkhdalai17a.html. Navon, A., Shamsian, A., Achituve, I., Fetaya, E., Chechik, G., and Maron, H. Equivariant architectures for learning in deep weight spaces. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),Inter- national Conference on Machine Learning, ICML 202...

work page 2017
[17]

A ConvNet for the 2020s

URL https://ceur-ws.org/Vol-1773/ CoCoNIPS_2016_paper9.pdf. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. Training language models to follow instruction...

work page doi:10.1109/cvpr52688.2022.02008 2022
[18]

URL https: //aclanthology.org/P18-1156/

doi: 10.18653/V1/P18-1156. URL https: //aclanthology.org/P18-1156/. Sarafian, E., Keynan, S., and Kraus, S. Recomposing the reinforcement learning building blocks with hypernet- works. In Meila, M. and Zhang, T. (eds.),Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings ...

work page doi:10.18653/v1/p18-1156 2021
[20]

2023/120

URL https://doi.org/10.24963/ijcai. 2025/683. Tan, C., Zhang, G., and Fu, J. Massive editing for large language models via meta learning. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

work page doi:10.24963/ijcai 2025
[21]

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

URL https://openreview.net/forum? id=L6L1CJQ2PE. Tang, P., Wang, Y ., and Zhang, M. Knowledge is not enough: Injecting rl skills for continual adaptation, 2026. URL https://arxiv.org/abs/2601.11258. Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., and Suleman, K. Newsqa: A machine com- prehension dataset. In Blunsom, P., Bordes, A....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/w17-2623 2026
[22]

Dickerson

URL https://openreview.net/forum? id=rkgW0oA9FX. Zhou, A., Yang, K., Burns, K., Cardace, A., Jiang, Y ., Sokota, S., Kolter, J. Z., and Finn, C. Permutation equiv- ariant neural functionals. In Oh, A., Naumann, T., Glober- son, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neur...

work page internal anchor Pith review doi:10.48550/arxiv 2023
[23]

Fully grounded in the context -- meaning the answer is either: - An exact substring of the context, OR - A minor, fluent paraphrase that does not add, remove, or distort any factual detail (e.g., changing ’was founded in 1976’ to ’founded in 1976’ is OK; saying ’ started in the 70s’ is NOT OK). 15

work page 1976
[24]

Factually consistent with the context

work page
[25]

valid": false,

Paired with a clear, relevant question that can be answered from the context. If ANY answer fails these criteria, respond with: {{"valid": false, "reason": "Brief reason"}} If ALL are valid, respond with: {{"valid": true}} Context: {context} QA Pairs: {qa_list_str} """ Any data point that fails either the format or validation check needs to be regenerated...

work page 2018
[26]

the input hidden states of sizeN×H, and

work page
[27]

the MLP intermediate activations of sizeN×3H, along with additionalO(N H)buffers for attention outputs, residual connections, and layer normalization. Consequently, the peak extra memory across all layers scales as Mem(no KV) peak ≈c LN H,(40) wherecis a modest architecture-dependent constant (typically in the range4–6in practice). If we retain only the d...

work page 2026

[1] [2]

Brown, T

URL https://proceedings.mlr.press/ v205/beck23a.html. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Ches...

work page 2020

[2] [3]

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y ., Liang, P., and Zettlemoyer, L

URL https://openreview.net/forum? id=bc3sUsS6ck. Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y ., Liang, P., and Zettlemoyer, L. Quac: Question answering in context. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.),Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Octob...

work page 2018

[3] [4]

URL https:// doi.org/10.18653/v1/d18-1241

doi: 10.18653/V1/D18-1241. URL https:// doi.org/10.18653/v1/d18-1241. Delétang, G., Ruoss, A., Duquenne, P., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchi- son, M., Orseau, L., Hutter, M., and Veness, J. Lan- guage modeling is compression. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna...

work page doi:10.18653/v1/d18-1241 2024

[4] [5]

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M

URL https://openreview.net/forum? id=jznbgiynus. Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.),Pro- ceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Compu...

work page doi:10.18653/v1/n19-1246 2019

[5] [6]

arXiv preprint arXiv:2502.13595 , year=

doi: 10.48550/arXiv.2502.13595. URL https: //arxiv.org/abs/2502.13595. Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- learning for fast adaptation of deep networks. In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th Inter- national Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Procee...

work page doi:10.48550/arxiv.2502.13595 2017

[6] [7]

URL http://proceedings

PMLR, 2017. URL http://proceedings. mlr.press/v70/finn17a.html. Ge, T., Hu, J., Wang, L., Wang, X., Chen, S., and Wei, F. In-context autoencoder for context compression in a large language model. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net...

work page 2017

[7] [8]

Ellie Pavlick and Tom Kwiatkowski

URL https://openreview.net/forum? id=rkpACe1lx. Ho, X., Nguyen, A. D., Sugawara, S., and Aizawa, A. Constructing A multi-hop QA dataset for comprehen- sive evaluation of reasoning steps. In Scott, D., Bel, N., and Zong, C. (eds.),Proceedings of the 28th Interna- tional Conference on Computational Linguistics, COL- ING 2020, Barcelona, Spain (Online), Dece...

work page doi:10.18653/v1/2020 2020

[8] [9]

URL https: //doi.org/10.1109/TPAMI.2021.3079209

doi: 10.1109/TPAMI.2021.3079209. URL https: //doi.org/10.1109/TPAMI.2021.3079209. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adapta- tion of large language models. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

work page doi:10.1109/tpami.2021.3079209 2021

[9] [10]

Jukic, J., Tutek, M., and Snajder, J

URL https://openreview.net/forum? id=nZeVKeeFYf9. Jukic, J., Tutek, M., and Snajder, J. Context parametrization with compositional adapters.CoRR, abs/2509.22158,

work page arXiv

[10] [12]

doi: 10.18653/v1/D17-1082

URL https://openreview.net/forum? id=oO6FsMyDBt. Lai, G., Xie, Q., Liu, H., Yang, Y ., and Hovy, E. H. RACE: large-scale reading comprehension dataset from examinations. In Palmer, M., Hwa, R., and Riedel, S. (eds.),Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 20...

work page doi:10.18653/v1/d17-1082 2017

[11] [13]

wb ≡1 recovers the uniform variant

URL https://aclanthology.org/2025. coling-main.89/. Lim, D., Maron, H., Law, M. T., Lorraine, J., and Lucas, J. Graph metanetworks for processing diverse neural architectures. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=ijK5hyxs0n...

work page doi:10.1109/iccv 2025

[12] [15]

MTEB: Massive Text Embedding Benchmark

URL https://openreview.net/forum? id=0DcZxeWfOPt. Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. doi: 10.48550/ ARXIV .2210.07316. URL https://arxiv.org/ abs/2210.07316. Munkhdalai, T. and Yu, H. Meta networks. In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [16]

URL http://proceedings

PMLR, 2017. URL http://proceedings. mlr.press/v70/munkhdalai17a.html. Navon, A., Shamsian, A., Achituve, I., Fetaya, E., Chechik, G., and Maron, H. Equivariant architectures for learning in deep weight spaces. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),Inter- national Conference on Machine Learning, ICML 202...

work page 2017

[14] [17]

A ConvNet for the 2020s

URL https://ceur-ws.org/Vol-1773/ CoCoNIPS_2016_paper9.pdf. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. Training language models to follow instruction...

work page doi:10.1109/cvpr52688.2022.02008 2022

[15] [18]

URL https: //aclanthology.org/P18-1156/

doi: 10.18653/V1/P18-1156. URL https: //aclanthology.org/P18-1156/. Sarafian, E., Keynan, S., and Kraus, S. Recomposing the reinforcement learning building blocks with hypernet- works. In Meila, M. and Zhang, T. (eds.),Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings ...

work page doi:10.18653/v1/p18-1156 2021

[16] [20]

2023/120

URL https://doi.org/10.24963/ijcai. 2025/683. Tan, C., Zhang, G., and Fu, J. Massive editing for large language models via meta learning. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

work page doi:10.24963/ijcai 2025

[17] [21]

Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

URL https://openreview.net/forum? id=L6L1CJQ2PE. Tang, P., Wang, Y ., and Zhang, M. Knowledge is not enough: Injecting rl skills for continual adaptation, 2026. URL https://arxiv.org/abs/2601.11258. Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., and Suleman, K. Newsqa: A machine com- prehension dataset. In Blunsom, P., Bordes, A....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/w17-2623 2026

[18] [22]

Dickerson

URL https://openreview.net/forum? id=rkgW0oA9FX. Zhou, A., Yang, K., Burns, K., Cardace, A., Jiang, Y ., Sokota, S., Kolter, J. Z., and Finn, C. Permutation equiv- ariant neural functionals. In Oh, A., Naumann, T., Glober- son, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neur...

work page internal anchor Pith review doi:10.48550/arxiv 2023

[19] [23]

Fully grounded in the context -- meaning the answer is either: - An exact substring of the context, OR - A minor, fluent paraphrase that does not add, remove, or distort any factual detail (e.g., changing ’was founded in 1976’ to ’founded in 1976’ is OK; saying ’ started in the 70s’ is NOT OK). 15

work page 1976

[20] [24]

Factually consistent with the context

work page

[21] [25]

valid": false,

Paired with a clear, relevant question that can be answered from the context. If ANY answer fails these criteria, respond with: {{"valid": false, "reason": "Brief reason"}} If ALL are valid, respond with: {{"valid": true}} Context: {context} QA Pairs: {qa_list_str} """ Any data point that fails either the format or validation check needs to be regenerated...

work page 2018

[22] [26]

the input hidden states of sizeN×H, and

work page

[23] [27]

the MLP intermediate activations of sizeN×3H, along with additionalO(N H)buffers for attention outputs, residual connections, and layer normalization. Consequently, the peak extra memory across all layers scales as Mem(no KV) peak ≈c LN H,(40) wherecis a modest architecture-dependent constant (typically in the range4–6in practice). If we retain only the d...

work page 2026