SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
MathQA: Towards interpretable math word problem solving with operation-based formalisms
15 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
A router-norm and variance-based bit allocation strategy for quantizing MoE models that claims higher accuracy and lower cost than prior mixed-precision methods.
PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
citing papers explorer
-
SimDiff: Depth Pruning via Similarity and Difference
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
-
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.
-
Generalization in LLM Problem Solving: The Case of the Shortest Path
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
-
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
-
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
MAmmoTH models trained via hybrid CoT-PoT instruction tuning on MathInstruct outperform prior open-source LLMs by 16-32% average accuracy on nine math datasets, reaching 33% and 44% on MATH for 7B and 34B scales.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression
IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
-
Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees
A router-norm and variance-based bit allocation strategy for quantizing MoE models that claims higher accuracy and lower cost than prior mixed-precision methods.
-
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.