EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.
super hub Mixed citations
Gemma 2: Improving Open Language Models at a Practical Size
Mixed citation behavior. Most common role is background (64%).
abstract
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer compe
authors
co-cited works
representative citing papers
Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this protocol SafeLoRA fails the full-card pass on Gemma-2-2B-it.
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
NLL-guided layer selection identifies 1/4 of layers for full attention in hybrid models, matching periodic 1/2-FA baseline accuracy on LongMemEval with Qwen3-4B while halving the full-attention compute budget.
VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.
Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
A deferral mechanism using forward-looking simulations reduces false positives in derailment forecasting by selectively waiting when recovery paths appear plausible.
MentalMap benchmark identifies a universal L3 reasoning cliff in LLMs' text-based spatial reasoning that persists across languages, scales, and prompting, and is replicated in human evaluations.
ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.
Representational convergence across 16 LLMs on 800 reasoning problems is stronger for failed tasks and pre-decision stages but shows minimal causal influence on predictions, pointing to shared processing constraints over shared reasoning.
A test-time zeroth-order optimization of prompt embeddings using a bounded self-supervised proxy from demonstration log-probabilities improves ICL accuracy and correlates with gains across tasks.
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
In 1-3B instruction-tuned LMs on GSM8K, arithmetic CoT readout is dominated by positional copying of the trailing number before the answer delimiter, accounting for 54-92 percentage points of accuracy.
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
Symmetries in next-token prediction targets induce corresponding geometric symmetries such as circulant matrices and equiangular tight frames in the optimal weights and embeddings of a layer-peeled LLM surrogate model.
Large language models achieve macro F1 scores above 0.85 on binary nominal-versus-danger classification from CTAF radio transcripts and METAR weather data using a new synthetic dataset with a 12-category hazard taxonomy.
Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.
Behavioral directions from one LLM family transfer to others via projection into a shared anchor coordinate space, yielding 0.83 ten-way detection accuracy and steering effects up to 0.46% on held-out models.
PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
citing papers explorer
-
Masked Generative Transformer Is What You Need for Image Editing
EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.