super hub Mixed citations

Gemma 2: Improving Open Language Models at a Practical Size

Cassidy Hardin, Gemma Team: Morgane Riviere, Pier Giuseppe Sessa, Shreya Pathak, Surya Bhupatiraju · 2024 · cs.CL · arXiv 2408.00118

Mixed citation behavior. Most common role is background (64%).

333 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 333 citing papers more from Cassidy Hardin arXiv PDF

abstract

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 23 method 6 baseline 2 dataset 1 other 1

citation-polarity summary

background 21 use method 6 unclear 3 baseline 2 use dataset 1

claims ledger

abstract In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer compe

authors

Cassidy Hardin Gemma Team: Morgane Riviere L\'eonard Hussenot Pier Giuseppe Sessa Shreya Pathak Surya Bhupatiraju

co-cited works

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

Masked Generative Transformer Is What You Need for Image Editing

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EditMGT applies masked generative transformers with attention consolidation and region-hold sampling to deliver state-of-the-art localized image editing at 6x the speed of diffusion methods.

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

Acceptance Cards is a new four-diagnostic standard for safe fine-tuning defense claims that requires statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer; under this protocol SafeLoRA fails the full-card pass on Gemma-2-2B-it.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0 · 2 refs

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

DRIFTLENS: Measuring Memory-Induced Reasoning Drift in Personalized Language Models

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

DRIFTLENS quantifies memory-induced reasoning drift in personalized LLMs, finding medium-to-large effects across four models and ten user attributes that post-training only partly reduces.

Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

Conditional Co-Ablation recovers self-repair backup heads in transformers by scoring conditional ablation growth, raising ROC-AUC from 0.33 to 0.91 on the IOI circuit and transferring to induction across models.

FRAME: Learning the Adaptation Domain with a Mixture of Fractional-Fourier Experts

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

FRAME adds a learnable fractional-Fourier order per expert in a MoE-LoRA setup so that low-rank updates are placed in the domain where they are most compact, yielding gains over fixed-domain baselines on LLaMA-3.1-8B and Qwen2.5-7B.

Probing Memorization of Tabular In-Context Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

A new probing framework detects moderate parametric memorization signals in tabular in-context learning models under single-task fine-tuning, strongest on low-cardinality tasks, but signals largely disappear under realistic training.

Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.

NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

NLL-guided layer selection identifies 1/4 of layers for full attention in hybrid models, matching periodic 1/2-FA baseline accuracy on LongMemEval with Qwen3-4B while halving the full-attention compute budget.

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

cs.RO · 2026-06-26 · accept · novelty 7.0

VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.

FinRED: An Expert-Guided Benchmark Generation and Evaluation Framework for Financial LLM Red-Teaming

cs.CR · 2026-06-18 · unverdicted · novelty 7.0

FinRED creates an expert-validated benchmark and rubric for financial LLM safety that maps regulatory standards to specific threats and reduces critical false negatives in evaluation from 28 to 12.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

The normalized inverse-scale direction of LayerNorm's affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input distribution in LayerNorm transformers.

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

MuseVLA adds on-demand sensor selection via tokens and converts readings into grounded sensor images for multimodal fusion, reporting 80.6% average success on real-robot dexterous tasks that need non-visual sensing.

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

cs.LG · 2026-06-14 · unverdicted · novelty 7.0

KV caches function as notebooks of prefilled conclusions, enabling field-level edits that recover decisions (especially with CoT) and position-portable skill composition with near-identical outputs at O(L) cost.

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

cs.AI · 2026-06-12 · unverdicted · novelty 7.0

Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

AfriSUD supplies new SUD-annotated dependency treebanks for nine Sub-Saharan African languages and demonstrates that existing models exhibit clear limitations on their syntax.

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

Doc-to-Atom decomposes documents into composable micro-LoRA adapters selected by a query router for efficient long-context QA.

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

BenSyc is the first benchmark for conversational sycophancy in Bengali, with top LLMs achieving only 61.8 Macro-F1 on binary detection and 61.7 on five-class classification while often generating overly validating responses.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.

citing papers explorer

Showing 50 of 119 citing papers after filters.

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 47 · 2 links · internal anchor
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections cs.CL · 2026-05-07 · unverdicted · none · ref 31 · internal anchor
SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models cs.CL · 2026-05-04 · conditional · none · ref 4 · internal anchor
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model cs.CL · 2026-04-23 · unverdicted · none · ref 18 · internal anchor
IRM derives implicit reward signals from off-the-shelf LLMs to detect generated text zero-shot and reports better results than prior zero-shot and supervised detectors on the DetectRL benchmark.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores cs.CL · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization cs.CL · 2026-04-21 · unverdicted · none · ref 31 · internal anchor
LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
The Provenance Gap in Clinical AI: Evidence-Traceable Temporal Knowledge Graphs for Rare Disease Reasoning cs.CL · 2026-04-18 · unverdicted · none · ref 37 · internal anchor
HEG-TKG grounds LLM clinical reasoning in hierarchical evidence-based temporal knowledge graphs from 4,512 PubMed records, delivering 100% citation verifiability and error detectability where standard RAG and unprompted LLMs produce none.
Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs cs.CL · 2026-04-15 · unverdicted · none · ref 20 · internal anchor
Causal interventions reveal that coordination islands block filler-gap mechanisms in Transformers in a gradient way matching humans, yielding the hypothesis that 'and' encodes relational dependencies differently in extractable vs. conjunctive uses.
Latent Planning Emerges with Scale cs.CL · 2026-04-14 · unverdicted · none · ref 3 · internal anchor
Latent planning ability in LLMs emerges and strengthens with scale, shown through internal features that represent future words and influence token choices on planning and rhyming tasks.
MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis cs.CL · 2026-04-13 · unverdicted · none · ref 5 · internal anchor
Adversarial evolution of constraint graphs generates diverse mathematical reasoning datasets that enable 1K-sample fine-tuning to outperform standard datasets like LIMO and s1K on eight benchmarks with better out-of-distribution generalization.
Confidence Should Be Calibrated More Than One Turn Deep cs.CL · 2026-04-07 · unverdicted · none · ref 1 · internal anchor
Multi-turn calibration reframes LLM confidence as dynamic across conversation turns, where user feedback degrades it, and new methods MTCal and ConfChat restore calibration while improving factuality.
Multilingual Language Models Encode Script Over Linguistic Structure cs.CL · 2026-04-06 · unverdicted · none · ref 9 · internal anchor
Multilingual LMs encode script over linguistic structure, with orthography shaping units more than word order or typology, and abstraction emerging gradually in deeper layers.
Why Attend to Everything? Focus is the Key cs.CL · 2026-03-12 · conditional · none · ref 12 · internal anchor
Focus learns a few centroids to gate long-range token attention, producing sparse attention that matches or beats full attention quality with up to 8.6x speedup at million-token lengths.
GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant cs.CL · 2026-03-01 · unverdicted · none · ref 62 · internal anchor
GroupGPT decouples intervention timing from response generation via edge-cloud collaboration for multi-user chats, scoring 4.72/5 on the new MUIR benchmark of 2500 segments while cutting token use by up to 3x and adding privacy sanitization.
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation cs.CL · 2026-02-24 · unverdicted · none · ref 19 · internal anchor
A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising cs.CL · 2026-02-18 · conditional · none · ref 90 · 2 links · internal anchor
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety cs.CL · 2025-12-08 · unverdicted · none · ref 11 · internal anchor
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
Difficulty-Controllable Cloze Question Distractor Generation cs.CL · 2025-11-03 · unverdicted · none · ref 23 · internal anchor
A new framework creates difficulty-controllable distractors for cloze questions via two-way generation, ensemble QA labeling, and multitask training, outperforming GPT-4o on human-aligned difficulty.
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization cs.CL · 2025-09-28 · unverdicted · none · ref 30 · internal anchor
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation cs.CL · 2025-07-29 · unverdicted · none · ref 19 · internal anchor
CARRIAGE is a RAG framework that improves output diversity in cross-cultural recipe adaptation by enhancing retrieval and context handling, reaching Pareto efficiency on diversity and quality versus closed-book LLMs.
SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding cs.CL · 2025-07-27 · unverdicted · none · ref 26 · internal anchor
SessionIntentBench is a large-scale multimodal benchmark for inter-session intention-shift modeling in e-commerce, with 1.95M intention entries and human-annotated gold labels showing current L(V)LMs struggle but improve when intention is injected.
LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations cs.CL · 2025-05-29 · unverdicted · none · ref 23 · internal anchor
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
Extracting memorized pieces of (copyrighted) books from open-weight language models cs.CL · 2025-05-18 · conditional · none · ref 261 · internal anchor
A new extraction technique applied to 200 books and 14 LLMs finds that memorization of full books is rare except in specific high-capacity models where entire texts can be recovered verbatim.
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models cs.CL · 2025-02-20 · unverdicted · none · ref 46 · internal anchor
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs cs.CL · 2024-12-25 · unverdicted · none · ref 19 · internal anchor
HuatuoGPT-o1 achieves superior medical complex reasoning by using a verifier to curate reasoning trajectories for fine-tuning and then applying RL with verifier-based rewards.
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation cs.CL · 2024-10-17 · unverdicted · none · ref 14 · internal anchor
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
Improve Mathematical Reasoning in Language Models by Automated Process Supervision cs.CL · 2024-06-05 · conditional · none · ref 6 · internal anchor
OmegaPRM automates collection of 1.5 million process supervision labels via binary-search MCTS, raising Gemini Pro math accuracy from 51% to 69.4% on MATH500 and Gemma2 27B from 42.3% to 58.2%.
Argumentative Large Language Models for Explainable and Contestable Claim Verification cs.CL · 2024-05-03 · unverdicted · none · ref 47 · internal anchor
ArgLLMs build argumentation frameworks from LLMs to support explainable and contestable formal reasoning for claim verification.
When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue cs.CL · 2026-06-30 · unverdicted · none · ref 6 · internal anchor
Guided-Retry prompting cuts hallucination from 30.5% to 15.3% on MultiWOZ and 20.9% to 12.2% on SGD in LLM dialogue agents facing database failures.
Do We Still Need Fine Tuning? Turkish Sentiment Analysis in the Era of Large Language Model cs.CL · 2026-06-28 · unverdicted · none · ref 16 · internal anchor
Fine-tuned BERTurk models outperform prompted LLMs in three-class Turkish sentiment classification, with neutral class collapse as the key failure mode for LLMs.
Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models cs.CL · 2026-06-12 · unverdicted · none · ref 19 · internal anchor
A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.
VIA-SD: Verification via Intra-Model Routing for Speculative Decoding cs.CL · 2026-06-10 · unverdicted · none · ref 62 · internal anchor
VIA-SD adds a routed slim-verifier tier between direct acceptance and full-model verification in speculative decoding, cutting rejection rates 0.10-0.22 and yielding 10-20% speedups over prior SD methods.
SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization cs.CL · 2026-06-07 · unverdicted · none · ref 23 · internal anchor
SAEExplainer applies activation-guided preference optimization in two iterative rounds to improve explanations of SAE features and reduce hallucinations.
GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 24 · internal anchor
GRKV applies global ridge regression to KV cache merging for span-based retention in long-context LLMs, claiming to be the only method that improves benchmark performance with minimal overhead.
On the Limits of Model Merging for Multilinguality in Pre-Training cs.CL · 2026-05-25 · unverdicted · none · ref 35 · internal anchor
Merging any combination of monolingual pre-trained models leads to performance collapse due to interference, indicating that merging flexibility from fine-tuning does not extend to pre-training.
Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations? cs.CL · 2026-05-17 · unverdicted · none · ref 41 · 2 links · internal anchor
LLMs assigned high or low status personas in multi-turn dialogues exhibit socio-cognitive effects including language coordination, pronoun patterns, persuasion success, and compliance with unsafe requests.
Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction cs.CL · 2026-05-13 · unverdicted · none · ref 52 · internal anchor
Edit-level majority voting on multiple LLM-generated candidates reduces over-correction in grammatical error correction and outperforms greedy and MBR decoding on nine multilingual benchmarks while remaining stable to prompt variations.
How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 34 · internal anchor
Differential privacy reduces measured bias in sentence-scoring tasks but shows no consistent reduction in output-level bias or unfairness across other evaluation paradigms.
Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness cs.CL · 2026-05-10 · unverdicted · none · ref 56 · internal anchor
Phi-4 and Gemma-2-9B maintain high intra-model consistency (ICC > 0.89) and ASR robustness for HADS scoring while Llama-3.1-8B degrades sharply, with all models showing score-evidence dissociation.
RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI cs.CL · 2026-05-01 · unverdicted · none · ref 20 · internal anchor
LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.
Select to Think: Unlocking SLM Potential with Local Sufficiency cs.CL · 2026-04-29 · unverdicted · none · ref 21 · 2 links · internal anchor
Select to Think reframes LLM help as ranking among SLM top-K candidates and distills the ranking ability back into the SLM for improved single-pass reasoning.
Exploring Concreteness Through a Figurative Lens cs.CL · 2026-04-20 · unverdicted · none · ref 79 · internal anchor
LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.
StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation cs.CL · 2026-04-16 · unverdicted · none · ref 2 · internal anchor
Reformulating code problems as guided narratives improves zero-shot pass@10 by 18.7% on average across 11 models and three benchmarks.
Testing the Assumptions of Active Learning for Translation Tasks with Few Samples cs.CL · 2026-04-10 · unverdicted · none · ref 28 · internal anchor
Informativeness and diversity of samples selected by active learning show no correlation with test performance on translation tasks using few samples; ordering and pre-training effects dominate instead.
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning cs.CL · 2026-04-10 · unverdicted · none · ref 35 · internal anchor
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
Hallucination Basins: A Dynamic Framework for Understanding and Controlling LLM Hallucinations cs.CL · 2026-04-06 · unverdicted · none · ref 22 · internal anchor
LLM hallucinations arise from task-dependent basins in latent space, with separability varying by task and geometry-aware steering reducing their probability.
TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation cs.CL · 2026-03-09 · unverdicted · none · ref 20 · internal anchor
A new 30B open LLM trained with curriculum learning and upsampling outperforms other multilingual models on European languages, especially low-resource ones, with up to 10x fewer linguistic errors in human evaluations.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 44 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
LLM-based User Profile Management for Recommender System cs.CL · 2025-02-20 · unverdicted · none · ref 4 · internal anchor
PURE is a three-component LLM system that extracts and maintains user profiles from reviews to outperform prior LLM recommenders on sequential Amazon tasks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 228 · internal anchor
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

Gemma 2: Improving Open Language Models at a Practical Size

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer