pith. sign in

arxiv: 2602.06358 · v2 · pith:AQ26H47Cnew · submitted 2026-02-06 · 💻 cs.CL · cs.AI

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

Pith reviewed 2026-05-21 14:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords hypernetworkLoRALLM adaptationin-context learningparameter-efficient fine-tuningsingle-pass generationcontext to parameters
0
0 comments X

The pith

SHINE maps any context to a high-quality LoRA adapter for an LLM in one forward pass by reusing the model's frozen parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SHINE, a hypernetwork that converts diverse contexts into effective LoRA adapters for large language models. It reuses the frozen LLM's own parameters inside its architecture and adds specific design choices to reach strong expressive power with relatively few additional parameters. After a pretraining stage followed by instruction fine-tuning, the hypernetwork produces adapters in a single pass that let the frozen LLM answer complex questions about the context without storing or re-accessing that context. This turns temporary in-context knowledge into permanent in-parameter knowledge and cuts the time, computation, and memory costs that normally come with supervised fine-tuning of LLMs.

Core claim

SHINE is a scalable hypernetwork that, after pretraining and instruction fine-tuning, takes a meaningful context and outputs a LoRA adapter in one forward pass. The adapter is then applied to the frozen base LLM so that the model can perform complex tasks tied to that context without any further gradient updates or direct access to the original context. The design reuses the target LLM's frozen weights within the hypernetwork itself and introduces architectural changes that overcome earlier hypernetwork limitations, delivering high-quality adapters with a modest parameter budget.

What carries the argument

The SHINE hypernetwork, which reuses the frozen LLM's parameters inside its in-context architecture to map input contexts directly to LoRA adapters in a single forward pass.

If this is right

  • An LLM can answer complex questions about a supplied context immediately after one pass through the hypernetwork, without gradient updates or context storage.
  • Adaptation cost drops sharply because no fine-tuning step and no repeated context access are required at inference time.
  • The same hypernetwork can generate adapters for many different contexts, supporting repeated use of the same base model across varied inputs.
  • The approach shows scaling potential, suggesting that larger hypernetworks or longer contexts could be handled with proportional but still modest extra cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support on-device or edge deployment by converting a one-time context into a small, reusable set of adapter weights that stay with the model.
  • Replacing retrieval steps in retrieval-augmented generation with generated parameter updates might reduce latency for repeated queries on the same material.
  • The single-pass design invites testing whether similar hypernetworks can produce other forms of parameter updates beyond LoRA.

Load-bearing premise

The described pretraining and instruction fine-tuning pipeline, together with reuse of the LLM's frozen parameters, is enough to produce LoRA adapters that reliably encode context knowledge for use on new tasks.

What would settle it

Measure whether the LoRA adapter produced by SHINE from a held-out context improves accuracy on context-specific questions by a clear margin over the base LLM using standard in-context prompting; if no improvement appears across multiple contexts and tasks, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2602.06358 by Haggai Maron, Muhan Zhang, Xiyuan Wang, Yansheng Mao, Yewei Liu, Yoav Gelbery.

Figure 1
Figure 1. Figure 1: An Example of SHINE: It maps context to LoRA in a single pass without any fine-tuning. The LoRA can be used for downstream conversation without accessing the context. head, sensitivity to hyperparameters, and the storage burden of maintaining distinct model parameters for each task. The advances in hypernetworks offer a promising third alter￾native (Ha et al., 2017; Chauhan et al., 2024). A hypernet￾work i… view at source ↗
Figure 2
Figure 2. Figure 2: Overall Architecture. The process consists of two passes: (1) Memory Extraction, where the LLM (augmented with Meta LoRA) processes context to produce memory states, and (2) Parameter Generation, where a hypernetwork converts these states into task-specific LoRA adapters for the final inference. Step 2: Sparse Attention Transformer. We employ a lightweight Transformer to process M˜ . Flattening M˜ into a s… view at source ↗
Figure 3
Figure 3. Figure 3: Hypernetwork Architecture. The model uses alternat￾ing attention along layer and token axes to efficiently process the memory tensor before projecting it into weights. slice and reshape v into different LoRAs. Suppose the first t elements in v have been used. To generate LoRA for W ∈ R I×O, where I is the input feature dimension and O is the output feature dimension, we calculate: A = Reshape(v[t : t + I ·… view at source ↗
Figure 4
Figure 4. Figure 4: Reconstruction Task: The hypernetwork encodes the full context into a LoRA. The LLM is then prompted to reconstruct the original text. 4.1. Pretraining Pretraining establishes the hypernetwork’s ability to com￾press and reconstruct context. We train the model in a self-supervised manner on a large corpus using two comple￾mentary objectives: Reconstruction and Completion. Reconstruction. The reconstruction … view at source ↗
Figure 5
Figure 5. Figure 5: Pretraining Results: Reconstruction and completion loss/perplexity across varying context lengths. P10/P90 denote 10% quantile/90% quantile. F1 Score vs. Conversation Turn In-Context Naive SFT SHINE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.0 0.2 0.4 0.6 0.8 1.0 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-Turn Conversation F1-Score: Evaluate answer F1 scores using the MS MARCO MQA dataset, which contains 15 QA pairs per context. details are provided in Appendix B.2. 5.2.1. INSTRUCTION FINE-TUNING: MQA We first fine-tune SHINE on a collection of mqa datasets, primarily comprising MS MARCO MQA (76%) alongside open-source alternatives (see Appendix B.3). Training pro￾ceeds for 2 epochs with a peak learni… view at source ↗
Figure 7
Figure 7. Figure 7: visualizes computation (FLOPs) and memory us￾age (see Appendix B.5 and B.6 for derivations). SHINE significantly reduces fine-tuning computation compared to SFT and lowers memory/compute costs during generation compared to In-Context methods. 5.2.2. INSTRUCTION FINE-TUNING: 1QA Following mqa tuning, we fine-tune on 1qa datasets (listed in Appendix B.4) for 1 epoch at a learning rate of 1e-5. We evaluate on… view at source ↗
Figure 8
Figure 8. Figure 8: Scaling with Backbone LLM: Pretraining performance comparison across different size of backbone LLM [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reshape to LoRA: Different ways of transpose. Define A⊤ as the transpose of A. We have four kinds of transpose in the LoRA generating process: rl: A = Reshape(T[t : t + I · r]) ∈ R I×r B = Reshape(T[t + I · r : t + I · r + r · R]) ∈ R r×O (15) rr: A = Reshape(T[t : t + I · r]) ∈ R I×r B ⊤ = Reshape(T[t + I · r : t + I · r + r · R]) ∈ R O×r (16) lr: A⊤ = Reshape(T[t : t + I · r]) ∈ R r×I B ⊤ = Reshape(T[t +… view at source ↗
Figure 10
Figure 10. Figure 10: Pretraining Validation PPL that Reflects the Bitter Lesson As the Bitter Lesson suggests, “general methods that rely on computation and data ultimately outperform methods that rely on human-designed priors or domain-specific knowledge, even if the latter show faster early progress”. Although the design of our hypernetwork does not explicitly incorporate many available priors, we believe it is already suff… view at source ↗
Figure 11
Figure 11. Figure 11: Completion Task: The hypernetwork encodes a truncated context into a LoRA. The LLM must utilize this LoRA to recover original context and complete the missing part. c(2) c(3) c(4) c(5) ...... LLM ❄️ Meta LoRA 🔥 Initial Memory Embeddings🔥 ...... ...... c(N-1) c(N) Middle Memory Hidden States Generated LoRA LLM ❄️ Generated LoRA <USE> q(1) ...... q(S) <ASS> a(1) a(2) ...... a(T) omit omit omit omit a(1) a(2… view at source ↗
Figure 12
Figure 12. Figure 12: Instruction Fine-Tuning: The hypernetwork encodes context into LoRA. The model is then prompted to answer questions based on the context 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
read the original abstract

We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SHINE, a scalable in-context hypernetwork that maps diverse contexts to high-quality LoRA adapters for frozen LLMs in a single forward pass. By reusing the LLM's own parameters and introducing architectural innovations, combined with a pretraining and instruction fine-tuning pipeline, the method claims to transform in-context knowledge into in-parameter knowledge, enabling complex QA tasks without direct context access while achieving strong expressive power with few parameters and substantial savings in time, computation, and memory relative to SFT-based adaptation.

Significance. If the empirical claims hold under rigorous evaluation, the work could meaningfully advance efficient, dynamic LLM adaptation by demonstrating that a single-pass hypernetwork can reliably compress context into low-rank updates that support downstream reasoning, offering a scalable alternative to per-task fine-tuning with potential benefits for memory-constrained or real-time applications.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results): The abstract asserts 'outstanding results on various tasks' and 'greatly saves time, computation and memory costs compared to SFT-based LLM adaptation' with 'strong expressive power,' yet the provided description supplies no quantitative metrics, specific baselines (e.g., standard LoRA fine-tuning, other hypernetworks), ablation studies, or error analysis. This absence directly weakens the central claim that the architectural reuse plus training pipeline produces high-quality adapters.
  2. [§3] §3 (Method) and skeptic concern on single-pass encoding: The claim that the generated LoRA adapters enable complex QA 'without directly accessing the context' requires explicit validation that the updates encode deep context understanding rather than surface patterns. Experiments comparing performance on context-dependent QA with vs. without the original context in the final prompt (and vs. full-context baselines) are load-bearing for the 'in-parameter knowledge' transformation and are not addressed in the current description.
minor comments (2)
  1. [§3] Notation for the hypernetwork input/output dimensions and the precise reuse of LLM layers should be clarified with a diagram or explicit equations to avoid ambiguity in how parameters are shared.
  2. [Appendix or §5] The GitHub link is provided but no details on reproducibility (e.g., exact hyperparameters, dataset splits for pretraining) are summarized in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We have revised the manuscript to directly address the concerns about quantitative support and validation of the in-parameter knowledge claim, adding the requested metrics, baselines, ablations, and targeted experiments.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): The abstract asserts 'outstanding results on various tasks' and 'greatly saves time, computation and memory costs compared to SFT-based LLM adaptation' with 'strong expressive power,' yet the provided description supplies no quantitative metrics, specific baselines (e.g., standard LoRA fine-tuning, other hypernetworks), ablation studies, or error analysis. This absence directly weakens the central claim that the architectural reuse plus training pipeline produces high-quality adapters.

    Authors: We agree that the abstract and results section would be strengthened by explicit quantitative support. In the revised manuscript we have expanded §4 with concrete performance tables reporting accuracy/F1 scores across tasks, direct comparisons to standard LoRA fine-tuning and prior hypernetwork baselines, wall-clock time and memory measurements showing the claimed savings, ablation studies isolating the effect of parameter reuse and architectural innovations, and a brief error analysis of remaining failure modes. These additions are now referenced from the abstract. revision: yes

  2. Referee: [§3] §3 (Method) and skeptic concern on single-pass encoding: The claim that the generated LoRA adapters enable complex QA 'without directly accessing the context' requires explicit validation that the updates encode deep context understanding rather than surface patterns. Experiments comparing performance on context-dependent QA with vs. without the original context in the final prompt (and vs. full-context baselines) are load-bearing for the 'in-parameter knowledge' transformation and are not addressed in the current description.

    Authors: We acknowledge the need for explicit validation that the adapters capture deep rather than superficial context information. We have added a new set of controlled experiments in the revised §4 that evaluate context-dependent QA under three conditions: (i) SHINE-generated LoRA with no context provided at inference, (ii) the same questions with the original context but no adapter, and (iii) full-context baselines. The results show that performance with the adapter alone remains competitive with full-context prompting and substantially exceeds the no-adapter baseline, supporting the transformation of in-context knowledge into in-parameter updates. We also include qualitative analysis of attention patterns to further address the surface-pattern concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture and training pipeline evaluated on external benchmarks

full rationale

The paper introduces SHINE as a new hypernetwork design that reuses frozen LLM parameters with architectural innovations, followed by a described pretraining and instruction fine-tuning pipeline to generate LoRA adapters from contexts in a single pass. Claims of strong expressive power, task performance, and efficiency gains are supported by experimental results on various tasks compared to SFT baselines. No equations, derivations, or self-citations are shown that reduce the central claims to fitted parameters or prior results by construction. The method is trained on data and tested on held-out tasks, rendering outcomes independent rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit list of free parameters, axioms, or invented entities; the design implicitly assumes that the hypernetwork training pipeline can learn a general mapping from context to effective LoRA weights.

pith-pipeline@v0.9.0 · 5729 in / 1156 out tokens · 48617 ms · 2026-05-21T14:40:39.221019+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...

  2. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

  3. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

    cs.LG 2026-04 conditional novelty 6.0

    Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...

  4. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

    cs.LG 2026-04 unverdicted novelty 5.0

    Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [2]

    Brown, T

    URL https://proceedings.mlr.press/ v205/beck23a.html. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Ches...

  2. [3]

    Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y ., Liang, P., and Zettlemoyer, L

    URL https://openreview.net/forum? id=bc3sUsS6ck. Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W., Choi, Y ., Liang, P., and Zettlemoyer, L. Quac: Question answering in context. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.),Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Octob...

  3. [4]

    URL https:// doi.org/10.18653/v1/d18-1241

    doi: 10.18653/V1/D18-1241. URL https:// doi.org/10.18653/v1/d18-1241. Delétang, G., Ruoss, A., Duquenne, P., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchi- son, M., Orseau, L., Hutter, M., and Veness, J. Lan- guage modeling is compression. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna...

  4. [5]

    Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M

    URL https://openreview.net/forum? id=jznbgiynus. Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.),Pro- ceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Compu...

  5. [6]

    arXiv preprint arXiv:2502.13595 , year=

    doi: 10.48550/arXiv.2502.13595. URL https: //arxiv.org/abs/2502.13595. Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- learning for fast adaptation of deep networks. In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th Inter- national Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Procee...

  6. [7]

    URL http://proceedings

    PMLR, 2017. URL http://proceedings. mlr.press/v70/finn17a.html. Ge, T., Hu, J., Wang, L., Wang, X., Chen, S., and Wei, F. In-context autoencoder for context compression in a large language model. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net...

  7. [8]

    Ellie Pavlick and Tom Kwiatkowski

    URL https://openreview.net/forum? id=rkpACe1lx. Ho, X., Nguyen, A. D., Sugawara, S., and Aizawa, A. Constructing A multi-hop QA dataset for comprehen- sive evaluation of reasoning steps. In Scott, D., Bel, N., and Zong, C. (eds.),Proceedings of the 28th Interna- tional Conference on Computational Linguistics, COL- ING 2020, Barcelona, Spain (Online), Dece...

  8. [9]

    URL https: //doi.org/10.1109/TPAMI.2021.3079209

    doi: 10.1109/TPAMI.2021.3079209. URL https: //doi.org/10.1109/TPAMI.2021.3079209. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adapta- tion of large language models. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

  9. [10]

    Jukic, J., Tutek, M., and Snajder, J

    URL https://openreview.net/forum? id=nZeVKeeFYf9. Jukic, J., Tutek, M., and Snajder, J. Context parametrization with compositional adapters.CoRR, abs/2509.22158,

  10. [12]

    doi: 10.18653/v1/D17-1082

    URL https://openreview.net/forum? id=oO6FsMyDBt. Lai, G., Xie, Q., Liu, H., Yang, Y ., and Hovy, E. H. RACE: large-scale reading comprehension dataset from examinations. In Palmer, M., Hwa, R., and Riedel, S. (eds.),Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 20...

  11. [13]

    wb ≡1 recovers the uniform variant

    URL https://aclanthology.org/2025. coling-main.89/. Lim, D., Maron, H., Law, M. T., Lorraine, J., and Lucas, J. Graph metanetworks for processing diverse neural architectures. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https: //openreview.net/forum?id=ijK5hyxs0n...

  12. [15]

    MTEB: Massive Text Embedding Benchmark

    URL https://openreview.net/forum? id=0DcZxeWfOPt. Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. doi: 10.48550/ ARXIV .2210.07316. URL https://arxiv.org/ abs/2210.07316. Munkhdalai, T. and Yu, H. Meta networks. In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th ...

  13. [16]

    URL http://proceedings

    PMLR, 2017. URL http://proceedings. mlr.press/v70/munkhdalai17a.html. Navon, A., Shamsian, A., Achituve, I., Fetaya, E., Chechik, G., and Maron, H. Equivariant architectures for learning in deep weight spaces. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),Inter- national Conference on Machine Learning, ICML 202...

  14. [17]

    A ConvNet for the 2020s

    URL https://ceur-ws.org/Vol-1773/ CoCoNIPS_2016_paper9.pdf. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. Training language models to follow instruction...

  15. [18]

    URL https: //aclanthology.org/P18-1156/

    doi: 10.18653/V1/P18-1156. URL https: //aclanthology.org/P18-1156/. Sarafian, E., Keynan, S., and Kraus, S. Recomposing the reinforcement learning building blocks with hypernet- works. In Meila, M. and Zhang, T. (eds.),Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings ...

  16. [20]

    2023/120

    URL https://doi.org/10.24963/ijcai. 2025/683. Tan, C., Zhang, G., and Fu, J. Massive editing for large language models via meta learning. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  17. [21]

    Knowledge is Not Enough: Injecting RL Skills for Continual Adaptation

    URL https://openreview.net/forum? id=L6L1CJQ2PE. Tang, P., Wang, Y ., and Zhang, M. Knowledge is not enough: Injecting rl skills for continual adaptation, 2026. URL https://arxiv.org/abs/2601.11258. Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., and Suleman, K. Newsqa: A machine com- prehension dataset. In Blunsom, P., Bordes, A....

  18. [22]

    LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud-based Deep Networks

    URL https://openreview.net/forum? id=rkgW0oA9FX. Zhou, A., Yang, K., Burns, K., Cardace, A., Jiang, Y ., Sokota, S., Kolter, J. Z., and Finn, C. Permutation equiv- ariant neural functionals. In Oh, A., Naumann, T., Glober- son, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neur...

  19. [23]

    Fully grounded in the context -- meaning the answer is either: - An exact substring of the context, OR - A minor, fluent paraphrase that does not add, remove, or distort any factual detail (e.g., changing ’was founded in 1976’ to ’founded in 1976’ is OK; saying ’ started in the 70s’ is NOT OK). 15

  20. [24]

    Factually consistent with the context

  21. [25]

    valid": false,

    Paired with a clear, relevant question that can be answered from the context. If ANY answer fails these criteria, respond with: {{"valid": false, "reason": "Brief reason"}} If ALL are valid, respond with: {{"valid": true}} Context: {context} QA Pairs: {qa_list_str} """ Any data point that fails either the format or validation check needs to be regenerated...

  22. [26]

    the input hidden states of sizeN×H, and

  23. [27]

    the MLP intermediate activations of sizeN×3H, along with additionalO(N H)buffers for attention outputs, residual connections, and layer normalization. Consequently, the peak extra memory across all layers scales as Mem(no KV) peak ≈c LN H,(40) wherecis a modest architecture-dependent constant (typically in the range4–6in practice). If we retain only the d...