MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
Title resolution pending
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
unclear 1representative citing papers
A compact 25M chess move predictor exceeds larger fine-tuned models on puzzles, indicating memorization in earlier claims, while LLM-Modulo raises general LLM move accuracy from 1.2% to 21.2% and validity to 95.3%.
Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
Representations learned by large AI models are converging toward a shared statistical model of reality.
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
citing papers explorer
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models
A compact 25M chess move predictor exceeds larger fine-tuned models on puzzles, indicating memorization in earlier claims, while LLM-Modulo raises general LLM move accuracy from 1.2% to 21.2% and validity to 95.3%.
-
Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data
Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.