AfriVoices-KE is a 3,000-hour multilingual speech dataset for Dholuo, Kikuyu, Kalenjin, Maasai, and Somali with 750 hours scripted and 2,250 hours spontaneous speech from 4,777 speakers.
hub Canonical reference
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Canonical reference. 79% of citing Pith papers cite this work as background.
abstract
Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
With enough compute, large models benefit from training on unfiltered data that includes low-quality and distractor examples instead of requiring high-quality filtered data.
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
SDLLM is a spike-driven LLM that uses gamma-SQP two-step encoding, bidirectional symmetric quantization, and membrane potential clipping to achieve 7x lower energy consumption and 4.2% higher accuracy than prior spike-based language models.
CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved accuracy-compression trade-offs over SVD and pruning baselines.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
citing papers explorer
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition
Adversarial competition between attacker and defender teams generates diverse multi-turn conversational data that improves LLM performance on secure code generation benchmarks by 18-29%.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy
LAPD, derived from the provable preference discrepancy in aligned LLMs, improves zero-shot AI text detection by 45.82% over baselines with claimed statistical dominance over Fast-DetectGPT.