A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
hub
Improving language models by retrieving from trillions of tokens
21 Pith papers cite this work. Polarity classification is still indexing.
abstract
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
RAG-GNN augments GNNs with retrieved literature knowledge via gated fusion to improve functional clustering of 379 proteins in cancer signaling networks, raising silhouette score by 0.093.
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergence from initialization.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
citing papers explorer
-
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
SD-Search derives step-level supervision for search queries in reasoning agents via on-policy hindsight self-distillation using the policy as both student and teacher.
-
When AI reviews science: Can we trust the referee?
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
-
RAG-GNN: Integrating Retrieved Knowledge with Graph Neural Networks for Precision Medicine
RAG-GNN augments GNNs with retrieved literature knowledge via gated fusion to improve functional clustering of 379 proteins in cancer signaling networks, raising silhouette score by 0.093.
-
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
RAPTOR introduces a tree-organized retrieval method using recursive abstractive summaries, achieving a 20% absolute accuracy improvement on the QuALITY benchmark when paired with GPT-4.
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
-
REPLUG: Retrieval-Augmented Black-Box Language Models
REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
-
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergence from initialization.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
LaMDA: Language Models for Dialog Applications
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
-
Small Language Models are the Future of Agentic AI
Small language models are sufficiently capable, more suitable, and far more economical than large models for the repetitive tasks that dominate agentic AI systems.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
-
KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks
KnowPilot integrates knowledge retrieval and memory systems into generative agents to achieve better results on domain-specific tasks such as text generation.
-
Less LLM, More Documents: Searching for Improved RAG
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
- Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning