Here, LoRA rank 64 shows a greater forgetting compared to rank 256, mainly because rank 256 diverges for lr=5e-4, and a smaller learning rate leads to less forgetting

with our Llama results, as presented in Figure 8, where we can observe in the left figure that LoRA learns less (lower math score), forgets less (higher general score) if we all choose the learning rates leading to the best math performa · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

citing papers explorer

Showing 1 of 1 citing paper.

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less cs.LG · 2026-05-07 · unverdicted · none · ref 49
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

Here, LoRA rank 64 shows a greater forgetting compared to rank 256, mainly because rank 256 diverges for lr=5e-4, and a smaller learning rate leads to less forgetting

fields

years

verdicts

representative citing papers

citing papers explorer