super hub Mixed citations

Measuring Massive Multitask Language Understanding

Andy Zou, Collin Burns, Dan Hendrycks, Dawn Song, Mantas Mazeika, Steven Basart · 2020 · cs.CY · arXiv 2009.03300

Mixed citation behavior. Most common role is background (45%).

521 Pith papers citing it

Background 45% of classified citations

open full Pith review browse 521 citing papers more from Andy Zou arXiv PDF

abstract

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 31 dataset 30 method 5 baseline 3

citation-polarity summary

background 31 use dataset 28 use method 5 baseline 3 unclear 2

claims ledger

abstract We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models

authors

Andy Zou Collin Burns Dan Hendrycks Dawn Song Mantas Mazeika Steven Basart

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Sumi: Open Uniform Diffusion Language Model from Scratch

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

Sumi is an openly released 7B parameter uniform diffusion language model pretrained from scratch on 1.5T tokens that matches autoregressive models on several benchmarks.

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

cs.AI · 2026-05-15 · unverdicted · novelty 8.0 · 2 refs

Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

cs.SE · 2026-01-31 · accept · novelty 8.0 · 2 refs

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Will Scaling Improve Social Simulation with LLMs?

cs.CL · 2026-07-02 · conditional · novelty 7.0

Scaling improves LLM social simulation fidelity in most opinion and behavior tasks but not for human cognitive bias calibration or low-resource domains.

Meta-Benchmarks for Financial-Services LLM Evaluation

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

A meta-benchmarking framework organizes 452 LLM benchmarks into 41 O*NET Generalized Work Activities and 38 BIAN domains, using discrimination-coverage-recency weights to scale K-factors in an Elo tournament for comparable financial-services scores.

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

cs.DC · 2026-07-01 · unverdicted · novelty 7.0 · 2 refs

ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.

FlipGuard: Defending Large Language Models Against Quantization-Conditioned Backdoor Attacks

cs.CR · 2026-06-27 · unverdicted · novelty 7.0

FlipGuard perturbs LLM weights prior to quantization to neutralize quantization-conditioned backdoor attacks, evaluated via the Defense Effectiveness Ratio on multiple models and quantization schemes.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

cs.MA · 2026-06-18 · unverdicted · novelty 7.0

SIGMA introduces skill-incidence graphs to compose agents from reusable skills, yielding higher average performance and robustness than topology-only baselines on reasoning and coding benchmarks.

Comparing Linear Probes with Mahalanobis Cosine Similarity

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

For balanced Gaussian class projections, OOD AUROC is a linear function of MCS to the reference probe because both are sigmoid-shaped functions of the probe SNR on test data.

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

Block-size curriculum learning trains an 8B diffusion model to achieve competitive reasoning performance on math and code benchmarks by transitioning from small to large training block sizes.

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

citing papers explorer

Showing 50 of 521 citing papers.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 27 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training cs.CL · 2026-05-06 · unverdicted · none · ref 8 · 3 links · internal anchor
LoPT splits LLM post-training at the midpoint with task loss on the second half and feature reconstruction on the first half to reduce cost and interference.
Benchmarking Local Language Models for Social Robots using Edge Devices cs.RO · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.
Complexity Horizons of Compressed Models in Analog Circuit Analysis cs.AI · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models cs.CL · 2026-05-02 · unverdicted · none · ref 29 · internal anchor
GIFT guides adapter fine-tuning on base models with confidence signals from instruction-tuned models before merging, yielding task-specialized models that outperform direct fine-tuning on math and knowledge benchmarks.
RouteProfile: Graph-Based Profiling for Cold-Start LLM Routing cs.NI · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
RouteProfile builds graph-based LLM profiles from public technical report signals to enable training-free cold-start routing and new-LLM integration.
Optimized Deferral for Imbalanced Settings cs.LG · 2026-04-30 · unverdicted · none · ref 39 · internal anchor
MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classification and LLM routing tasks.
Evaluating Large Language Models on Computer Science University Exams in Data Structures cs.CL · 2026-04-25 · unverdicted · none · ref 7 · internal anchor
A new benchmark of Tel Aviv University data structures exams is used to evaluate GPT-4o, Claude 3.5, Mathstral 7B, and LLaMA 3 8B on closed and multiple-choice questions.
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression cs.AI · 2026-04-21 · unverdicted · none · ref 14 · internal anchor
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
HiP-LoRA: Budgeted Spectral Plasticity for Robust Low-Rank Adaptation cs.LG · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
HiP-LoRA decomposes LoRA updates into principal and residual spectral channels with a singular-value-weighted stability budget to reduce forgetting and interference during foundation model adaptation.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models cs.CL · 2026-04-17 · unverdicted · none · ref 15 · internal anchor
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension cs.LG · 2026-04-14 · unverdicted · none · ref 15 · internal anchor
OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering up to 1.78x speedup over W8A8 baselines.
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance cs.CL · 2026-04-12 · unverdicted · none · ref 17 · internal anchor
A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over strong baselines.
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation cs.LG · 2026-04-10 · unverdicted · none · ref 18 · 2 links · internal anchor
EdgeRazor uses structural mixed-precision quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to achieve 1.88-bit LLMs that outperform prior 2-bit and 3-bit baselines with 4-10x lower training budget.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs cs.LG · 2026-04-08 · unverdicted · none · ref 24 · internal anchor
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era cs.CL · 2026-04-08 · unverdicted · none · ref 13 · internal anchor
Benchmarking four LLMs on O*NET skills yields SAFI scores showing mathematics and programming as most automatable while active listening and reading comprehension are least, with 78.7% of real AI interactions being augmentation rather than replacement.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 45 · internal anchor
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Seed1.8 Model Card: Towards Generalized Real-World Agency cs.AI · 2026-03-21 · unverdicted · none · ref 27 · internal anchor
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
Attention Residuals cs.CL · 2026-03-16 · unverdicted · none · ref 13 · internal anchor
Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter model pre-trained on 1.4T tokens.
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining cs.LG · 2026-02-11 · conditional · none · ref 12 · internal anchor
SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.
BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models cs.LG · 2026-02-04 · unverdicted · none · ref 9 · internal anchor
BPDQ creates variable quantization grids from bit-planes and scalar coefficients, refined iteratively with second-order data to minimize output error, enabling 2-bit serving of Qwen2.5-72B on one RTX 3090 at 83.85% GSM8K accuracy.
On the Limits of Layer Pruning for Generative Reasoning in Large Language Models cs.LG · 2026-02-02 · unverdicted · none · ref 11 · internal anchor
Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.
NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 131 · internal anchor
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm cs.CV · 2025-11-06 · unverdicted · none · ref 18 · internal anchor
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective cs.AI · 2025-11-01 · conditional · none · ref 9 · internal anchor
The paper analyzes CPU bottlenecks in agentic AI serving, selects representative workloads, and demonstrates that CPU-aware scheduling optimizations COMB and MAS can reduce P50 latency by up to 1.7x and total latency by up to 2.49x on two hardware systems.
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination cs.LG · 2025-10-26 · unverdicted · none · ref 8 · internal anchor
TALE selectively prunes task-detrimental layers in LLMs at inference time to match or exceed baseline performance with lower computational cost across multiple models and tasks.
Users as Annotators: LLM Preference Learning from Comparison Mode cs.CL · 2025-10-10 · unverdicted · none · ref 15 · internal anchor
Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.
Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts cs.LG · 2025-10-09 · unverdicted · none · ref 6 · internal anchor
Orthogonal growth recycles pre-trained MoE checkpoints via layer copying and noisy expert duplication, delivering 10.6% higher accuracy than training from scratch with equivalent extra compute.
Motivating Next-Gen Accelerators with Flexible (N:M) Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches cs.LG · 2025-09-26 · unverdicted · none · ref 10 · internal anchor
Post-training N:M activation pruning preserves generative performance in LLMs better than equivalent weight pruning, with the 8:16 pattern emerging as a practical hardware-friendly choice.
Proximal Supervised Fine-Tuning cs.LG · 2025-08-25 · unverdicted · none · ref 11 · internal anchor
PSFT modifies supervised fine-tuning by incorporating trust-region ideas from RL to constrain policy changes, yielding better out-of-domain generalization in math and human-value tasks without entropy collapse.
gpt-oss-120b & gpt-oss-20b Model Card cs.CL · 2025-08-08 · unverdicted · none · ref 25 · internal anchor
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge cs.CE · 2025-07-29 · unverdicted · none · ref 8 · internal anchor
ChemDFM-R is a chemical reasoning LLM trained via a four-stage pipeline on the ChemFG dataset of functional-group annotations for molecules and reactions, reaching performance comparable to or better than commercial models on chemical benchmarks.
Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 24 · internal anchor
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Establishing a Scale for Kullback-Leibler Divergence in Language Models Across Various Settings cs.CL · 2025-05-21 · unverdicted · none · ref 7 · internal anchor
Log-likelihood vectors establish a consistent KL divergence scale across pretraining, model sizes, seeds, quantization, fine-tuning, and layers, revealing subdiffusive trajectories and early stabilization in Pythia models.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs cs.CL · 2025-03-03 · unverdicted · none · ref 24 · internal anchor
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results cs.DC · 2025-02-13 · unverdicted · none · ref 51 · internal anchor
AIvaluateXR benchmarks 17 LLMs across four XR platforms on performance, speed, memory and battery metrics and proposes a 3D Pareto optimality method to identify optimal on-device model-device pairs.
AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought cs.CL · 2025-01-27 · unverdicted · none · ref 3 · internal anchor
AdaMCoT uses dynamic routing of chain-of-thought reasoning in intermediary languages with a reward-based selector to improve cross-lingual factual consistency in LLMs.
Humanity's Last Exam cs.LG · 2025-01-24 · unverdicted · none · ref 22 · internal anchor
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong cs.CL · 2025-01-16 · unverdicted · none · ref 8 · internal anchor
Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 76 · internal anchor
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression cs.CL · 2024-06-17 · unverdicted · none · ref 19 · internal anchor
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 180 · internal anchor
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cs.CL · 2024-01-11 · unverdicted · none · ref 95 · internal anchor
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 217 · internal anchor
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Mixtral of Experts cs.LG · 2024-01-08 · unverdicted · none · ref 16 · internal anchor
Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices cs.CV · 2023-12-28 · unverdicted · none · ref 48 · internal anchor
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration cs.CL · 2023-11-07 · unverdicted · none · ref 22 · internal anchor
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
Mistral 7B cs.CL · 2023-10-10 · accept · none · ref 12 · internal anchor
Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning cs.CL · 2023-08-17 · unverdicted · none · ref 5 · internal anchor
Empirical tests show LLMs from 1B to 7B parameters exhibit catastrophic forgetting during continual instruction tuning, with forgetting severity increasing with scale and decoder-only models retaining more than encoder-decoder models.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 232 · internal anchor
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Measuring Massive Multitask Language Understanding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer