Title resolution pending

GPT-4 Technical Report , author= · 2023

42 Pith papers cite this work. Polarity classification is still indexing.

42 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Language Acquisition Device in Large Language Models

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

cs.CV · 2023-12-28 · conditional · novelty 7.0

Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

cs.CL · 2023-11-28 · unverdicted · novelty 7.0

LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Detecting Pretraining Data from Large Language Models

cs.CL · 2023-10-25 · conditional · novelty 7.0

Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.

Learning Interactive Real-World Simulators

cs.AI · 2023-10-09 · conditional · novelty 7.0

UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models

cs.LG · 2023-10-03 · conditional · novelty 7.0

Time-LLM reprograms frozen LLMs for time series forecasting via text prototypes and Prompt-as-Prefix, outperforming specialized models in standard, few-shot, and zero-shot settings.

Steering Language Models With Activation Engineering

cs.CL · 2023-08-20 · unverdicted · novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

LIMA: Less Is More for Alignment

cs.CL · 2023-05-18 · conditional · novelty 7.0

Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Letting the neural code speak: Automated characterization of monkey visual neurons through human language

q-bio.NC · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bottom 5% of natural-image response distributions.

Weighted Rules under the Stable Model Semantics

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.

Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

cs.AI · 2024-08-01 · conditional · novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

cs.CL · 2024-02-18 · unverdicted · novelty 6.0

ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.

Steering Llama 2 via Contrastive Activation Addition

cs.CL · 2023-12-09 · unverdicted · novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

citing papers explorer

Showing 42 of 42 citing papers.

Language Acquisition Device in Large Language Models cs.CL · 2026-05-16 · unverdicted · none · ref 74
Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 86
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 119
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels cs.CV · 2023-12-28 · conditional · none · ref 70
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA cs.CL · 2023-11-28 · unverdicted · none · ref 40
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 28
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Detecting Pretraining Data from Large Language Models cs.CL · 2023-10-25 · conditional · none · ref 80
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
Learning Interactive Real-World Simulators cs.AI · 2023-10-09 · conditional · none · ref 101
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models cs.LG · 2023-10-03 · conditional · none · ref 25
Time-LLM reprograms frozen LLMs for time series forecasting via text prototypes and Prompt-as-Prefix, outperforming specialized models in standard, few-shot, and zero-shot settings.
Steering Language Models With Activation Engineering cs.CL · 2023-08-20 · unverdicted · none · ref 129
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
LIMA: Less Is More for Alignment cs.CL · 2023-05-18 · conditional · none · ref 13
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 111
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Letting the neural code speak: Automated characterization of monkey visual neurons through human language q-bio.NC · 2026-05-12 · unverdicted · none · ref 83 · 2 links
Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bottom 5% of natural-image response distributions.
Weighted Rules under the Stable Model Semantics cs.AI · 2026-05-10 · unverdicted · none · ref 87
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans cs.CL · 2026-05-09 · unverdicted · none · ref 79
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives cs.CL · 2026-04-22 · unverdicted · none · ref 179
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 54
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Towards an AI co-scientist cs.AI · 2025-02-26 · unverdicted · none · ref 218
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 87
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models cs.AI · 2024-08-01 · conditional · none · ref 232
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 86
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models cs.CL · 2024-02-18 · unverdicted · none · ref 39
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
Steering Llama 2 via Contrastive Activation Addition cs.CL · 2023-12-09 · unverdicted · none · ref 47
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 66
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 118
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets cs.AI · 2023-10-10 · unverdicted · none · ref 17
At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs cs.CL · 2023-10-03 · conditional · none · ref 19
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
Chain-of-Verification Reduces Hallucination in Large Language Models cs.CL · 2023-09-20 · unverdicted · none · ref 63
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback cs.CL · 2023-09-01 · conditional · none · ref 84
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models cs.CL · 2023-08-03 · unverdicted · none · ref 42
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 67
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Towards Expert-Level Medical Question Answering with Large Language Models cs.CL · 2023-05-16 · unverdicted · none · ref 136
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models cs.CL · 2023-04-13 · accept · none · ref 24
AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
Insider Attacks in Multi-Agent LLM Consensus Systems cs.MA · 2026-05-08 · unverdicted · none · ref 64
A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression cs.AI · 2026-04-21 · unverdicted · none · ref 111
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models cs.CV · 2026-04-07 · unverdicted · none · ref 4
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
AppAgent: Multimodal Agents as Smartphone Users cs.CV · 2023-12-21 · unverdicted · none · ref 54
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 77
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 78
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 116
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
A Survey of Hallucination in Large Foundation Models cs.AI · 2023-09-12 · accept · none · ref 97
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages cs.CL · 2026-05-16 · unverdicted · none · ref 48
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer