A survey of 172 open educational datasets from 204 papers across LAK, EDM, and AIED conferences reveals trends, 143 previously uncatalogued datasets, field gaps, and an 8-item PRACTICE checklist for better data publication.
hub
RACE : Large-scale R e A ding comprehension dataset from examinations
27 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
BBQ is a new benchmark dataset showing that QA models often default to social stereotypes, achieving up to 3.4 points higher accuracy when the correct answer aligns with bias.
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Representation engineering uses population-level representations in deep neural networks to monitor and manipulate cognitive phenomena like honesty and harmlessness, providing simple effective baselines for LLM safety.
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
SMoE substitutes low-importance experts with cached similar ones in MoE inference on edge devices to achieve 48% lower decoding latency and over 60% cache hit rate with nearly lossless accuracy.
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
citing papers explorer
-
Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE
A survey of 172 open educational datasets from 204 papers across LAK, EDM, and AIED conferences reveals trends, 143 previously uncatalogued datasets, field gaps, and an 8-item PRACTICE checklist for better data publication.
-
Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms
Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
SimDiff: Depth Pruning via Similarity and Difference
SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
BBQ: A Hand-Built Bias Benchmark for Question Answering
BBQ is a new benchmark dataset showing that QA models often default to social stereotypes, achieving up to 3.4 points higher accuracy when the correct answer aligns with bias.
-
The Power of Scale for Parameter-Efficient Prompt Tuning
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass
SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
-
Representation Engineering: A Top-Down Approach to AI Transparency
Representation engineering uses population-level representations in deep neural networks to monitor and manipulate cognitive phenomena like honesty and harmlessness, providing simple effective baselines for LLM safety.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
-
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
-
SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution
SMoE substitutes low-importance experts with cached similar ones in MoE inference on edge devices to achieve 48% lower decoding latency and over 60% cache hit rate with nearly lossless accuracy.
-
InternLM2 Technical Report
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.