mega hub Canonical reference

LLaMA: Open and Efficient Foundation Language Models

· 2023 · cs.CL · arXiv 2302.13971

Canonical reference. 82% of citing Pith papers cite this work as background.

1181 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 1181 citing papers arXiv PDF

abstract

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 206 method 19 baseline 8 other 6 dataset 1 extension 1

citation-polarity summary

background 198 use method 20 unclear 13 baseline 7 extend 1 support 1 use dataset 1

claims ledger

abstract We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

eess.AS · 2026-05-31 · unverdicted · novelty 8.0

SVHalluc benchmark shows open-source audio-visual LLMs achieve near-random accuracy on semantic and temporal speech-vision alignment tasks while Gemini 2.5 Pro performs substantially better.

Privacy Auditing with Zero (0) Training Run

cs.CR · 2026-05-14 · unverdicted · novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

cs.LG · 2026-05-12 · accept · novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Backdoor Attacks on Decentralised Post-Training

cs.CR · 2026-03-31 · conditional · novelty 8.0 · 2 refs

An adversary controlling an intermediate pipeline stage in decentralized LLM post-training can inject a backdoor that reduces alignment from 80% to 6%, with the backdoor persisting in 60% of cases even after subsequent safety training.

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

cs.SE · 2025-06-16 · conditional · novelty 8.0

First study of 1,899 MCP servers finds eight distinct vulnerabilities (only three traditional), 7.2% with general issues, 5.5% with tool poisoning, and 66% with code smells, urging MCP-specific security practices.

BEAVER: An Enterprise Benchmark for Text-to-SQL

cs.CL · 2024-09-03 · unverdicted · novelty 8.0

BEAVER is the first text-to-SQL benchmark from private enterprise data warehouses, revealing SOTA agentic frameworks achieve only 10.8% accuracy on complex real-world queries.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

cs.IR · 2024-03-06 · unverdicted · novelty 8.0

BLaIR is a new benchmark and 570M-review dataset showing that LLM performance rankings on recommendation tasks have little correlation with rankings on general embedding benchmarks like MTEB.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

cs.CL · 2023-05-17 · accept · novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

Language-Assisted Super-Resolution from Real-World Low-Resolution Patches

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

LA-SR redefines unpaired super-resolution in language space by projecting images into a semantically rich representation and applying vision-language model guided losses to handle real-world degradations extracted from depth variations.

Probing Memorization of Tabular In-Context Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

A new probing framework detects moderate parametric memorization signals in tabular in-context learning models under single-task fine-tuning, strongest on low-cardinality tasks, but signals largely disappear under realistic training.

Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.

A Sensitivity-Aware Test Collection for Search Among Personal Information

cs.IR · 2026-06-25 · accept · novelty 7.0

A new sensitivity-labeled test collection is released from Enron emails with crowdsourced queries, relevance judgments, and LLM extensions for evaluating sensitivity-aware search.

citing papers explorer

Showing 39 of 39 citing papers after filters.

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving cs.RO · 2026-05-18 · unverdicted · none · ref 46 · internal anchor
4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segmentation, flow prediction, and motion forecasting.
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning cs.RO · 2026-05-13 · unverdicted · none · ref 17 · internal anchor
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 47 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis cs.RO · 2026-04-23 · unverdicted · none · ref 9 · internal anchor
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness cs.RO · 2026-03-18 · unverdicted · none · ref 31 · internal anchor
HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models cs.RO · 2026-03-02 · unverdicted · none · ref 25 · internal anchor
KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
Large Video Planner Enables Generalizable Robot Control cs.RO · 2025-12-17 · conditional · none · ref 79 · internal anchor
A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
CADET: A Modular Platform for Evaluating Distributed Cooperative Autonomy in Connected Autonomous Vehicles cs.RO · 2026-06-02 · unverdicted · none · ref 34 · internal anchor
CADET is a new modular platform for reproducible evaluation of distributed cooperative autonomy in connected autonomous vehicles using trace-driven emulation.
Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation cs.RO · 2026-06-01 · unverdicted · none · ref 49 · internal anchor
AHEAD augments frozen VLAs with a 4.9M-parameter latent world model that forecasts future visual features using optical-flow motion cues, achieving 79-97% success on dynamic simulation tasks and high real-robot success rates where baselines score near zero.
Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning cs.RO · 2026-06-01 · unverdicted · none · ref 6 · internal anchor
Dexterity-BEV creates 3D vertex-based inputs and BEV-aligned outputs to reduce spatial-temporal misalignments in end-to-end robot policies trained on diverse datasets and embodiments.
From a Single Demonstration to a General Policy for Contact-Rich Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 133 · internal anchor
A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across poses and geometries, achieving over 90% success on seven real-world multi-stage tasks
PCASim: Promptable Closed-loop Adversarial Simulation for Urban Traffic Environment cs.RO · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
PCASim uses LLMs to integrate knowledge, data, and adversarial methods for generating promptable safety-critical urban traffic scenarios, with RL training for vehicle behaviors, reporting 12% better DSL accuracy, 8% higher scenario success rate, and 30% improved obstacle avoidance.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 58 · 2 links · internal anchor
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation cs.RO · 2026-05-07 · unverdicted · none · ref 17 · internal anchor
PLMD applies a denoising diffusion model to predict labels for unknown map regions, allowing goal localization in unexplored environments by substituting completed labels into existing navigation pipelines.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery cs.RO · 2026-05-02 · unverdicted · none · ref 16 · 2 links · internal anchor
Sentinel-VLA adds metacognitive status monitoring to VLA models for on-demand reasoning and error recovery, reporting over 30% higher real-world task success than prior SOTA.
Make Tracking Easy: Neural Motion Retargeting for Humanoid Whole-body Control cs.RO · 2026-03-23 · unverdicted · none · ref 34 · internal anchor
NMR uses VAE-based clustered expert physics refinement and a CNN-Transformer to learn dynamics-aware retargeting, eliminating joint jumps and self-collisions on Unitree G1 while accelerating downstream control policies.
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors cs.RO · 2026-03-16 · conditional · none · ref 6 · internal anchor
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation cs.RO · 2026-03-05 · conditional · none · ref 34 · internal anchor
SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.
ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination cs.RO · 2025-12-19 · conditional · none · ref 16 · internal anchor
ImagineNav++ achieves SOTA mapless visual navigation by prompting VLMs to select imagined future views generated from a human-preference-distilled module and maintained via selective foveation memory.
Neuro-Symbolic Control with Large Language Models for Language-Guided Spatial Tasks cs.RO · 2025-12-19 · unverdicted · none · ref 46 · internal anchor
A neuro-symbolic system pairing LLMs for symbolic reasoning with neural delta controllers for execution delivers over 70% step reduction and up to 8.83x speedup in language-guided planar object manipulation while remaining robust to LLM quality.
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training cs.RO · 2025-09-29 · unverdicted · none · ref 18 · internal anchor
World-Env replaces physical robot interactions with a world model-based virtual environment and VLM-guided rewards to enable efficient RL post-training for VLA models, showing gains with only five demonstrations per task.
Policy Contrastive Decoding for Robotic Foundation Models cs.RO · 2025-05-19 · conditional · none · ref 18 · internal anchor
PCD redirects robotic policies toward object-relevant visual features via contrastive decoding on masked inputs, improving generalization without retraining or weight access.
DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies cs.RO · 2025-05-12 · unverdicted · none · ref 51 · internal anchor
DexWild co-trains dexterous robot policies on in-the-wild human hand interactions recorded with a low-cost system and limited robot data, achieving 68.5% success in unseen environments and 5.8x better cross-embodiment generalization.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 1 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation cs.RO · 2024-11-29 · unverdicted · none · ref 63 · internal anchor
CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new robots and objects.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 23 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
A Survey on Vision-Language-Action Models for Embodied AI cs.RO · 2024-05-23 · unverdicted · none · ref 291 · internal anchor
This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation cs.RO · 2023-12-20 · conditional · none · ref 146 · internal anchor
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
Vision-Language Foundation Models as Effective Robot Imitators cs.RO · 2023-11-02 · conditional · none · ref 20 · internal anchor
RoboFlamingo adapts open-source vision-language models for robot manipulation tasks via single-step comprehension plus an explicit policy head, outperforming prior methods on benchmarks with only light fine-tuning.
Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking cs.RO · 2026-06-02 · unverdicted · none · ref 35 · internal anchor
Humanoid-GPT is a causal Transformer pre-trained on a unified billion-scale motion dataset that tracks dynamic behaviors with zero-shot generalization to unseen motions and tasks.
CLUE: Adaptively Prioritized Contextual Cues by Leveraging a Unified Semantic Map for Effective Zero-Shot Object-Goal Navigation cs.RO · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
CLUE adaptively weights room-type and object-co-location cues from an LLM to construct a unified semantic value map that improves success rate and efficiency in zero-shot object-goal navigation.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning cs.RO · 2026-04-20 · unverdicted · none · ref 35 · internal anchor
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation cs.RO · 2026-04-13 · unverdicted · none · ref 45 · internal anchor
Compositional Simulation generates scalable real-world robot training data by combining classical simulation with neural simulation in a closed-loop real-sim-real augmentation pipeline.
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation cs.RO · 2025-07-07 · accept · none · ref 15 · internal anchor
Multi-task pretraining of diffusion policies on diverse robot data produces more successful, robust, and data-efficient policies for dexterous manipulation than single-task baselines, with performance scaling with pretraining size and diversity.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 57 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning cs.RO · 2024-11-15 · unverdicted · none · ref 35 · internal anchor
VeriGraph integrates VLMs with scene-graph verification to raise robot task success rates by 30-58% over baselines in manipulation scenarios.
From Technical Metrics to User Perception: A User Study of a Multimodal Human-Robot Interaction System for Object Detection and Grasping cs.RO · 2026-07-01 · unverdicted · none · ref 12 · internal anchor
A user study found that 71% of 24 participants preferred an improved multimodal HRI grasping system over baseline, with significantly higher ratings on three perceptual scales after statistical correction.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 101 · 2 links · internal anchor
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
Ablation Study of Multimodal Perception, Language Grounding, and Control for Human-Robot Interaction in an Object Detection and Grasping Task cs.RO · 2026-05-01 · unverdicted · none · ref 17 · internal anchor
An ablation study isolates the contributions of LLM choice, visual perception configuration, and motion controller to success rate and execution time in a human-robot grasping task.

LLaMA: Open and Efficient Foundation Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

mega hub controls

Recognition alignment

counterfactual ablation

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer