pith. sign in

hub

Are we done with mmlu? CoRR, abs/2406.04127

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

hub tools

citation-role summary

dataset 2

citation-polarity summary

roles

dataset 2

polarities

use dataset 2

representative citing papers

Dynamic Model Merging Made Slim

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

Qwen2.5-1M Technical Report

cs.CL · 2025-01-26 · accept · novelty 6.0

Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.

Qwen3.5-Omni Technical Report

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

Kimi K2: Open Agentic Intelligence

cs.LG · 2025-07-28 · unverdicted · novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

Qwen3 Technical Report

cs.CL · 2025-05-14 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Qwen2.5-Omni Technical Report

cs.CL · 2025-03-26 · conditional · novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.

Qwen2.5-VL Technical Report

cs.CV · 2025-02-19 · unverdicted · novelty 5.0

Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

Ministral 3

cs.CL · 2026-01-13 · unverdicted · novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

citing papers explorer

Showing 17 of 17 citing papers.

  • Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing cs.CL · 2024-06-12 · unverdicted · none · ref 110

    Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.

  • Dynamic Model Merging Made Slim cs.LG · 2026-05-17 · unverdicted · none · ref 30

    DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

  • Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 26

    Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

  • Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 11

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

  • Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 11

    MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

  • Qwen2.5-1M Technical Report cs.CL · 2025-01-26 · accept · none · ref 10

    Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.

  • SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 31 · 2 links

    Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.

  • Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 13

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.

  • SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining cs.LG · 2026-02-11 · conditional · none · ref 10

    SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.

  • MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 17

    MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

  • Kimi K2: Open Agentic Intelligence cs.LG · 2025-07-28 · unverdicted · none · ref 18

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  • Qwen3 Technical Report cs.CL · 2025-05-14 · unverdicted · none · ref 14

    Pith review generated a malformed one-line summary.

  • Qwen2.5-Omni Technical Report cs.CL · 2025-03-26 · conditional · none · ref 16

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.

  • Qwen2.5-VL Technical Report cs.CV · 2025-02-19 · unverdicted · none · ref 11

    Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.

  • Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 147

    Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

  • Ministral 3 cs.CL · 2026-01-13 · unverdicted · none · ref 20

    Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

  • Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead cs.LG · 2025-07-30 · unverdicted · none · ref 27

    Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.