Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.
Title resolution pending
42 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
Time-LLM reprograms frozen LLMs for time series forecasting via text prototypes and Prompt-as-Prefix, outperforming specialized models in standard, few-shot, and zero-shot settings.
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bottom 5% of natural-image response distributions.
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
citing papers explorer
-
Language Acquisition Device in Large Language Models
Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
-
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Detecting Pretraining Data from Large Language Models
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
-
Learning Interactive Real-World Simulators
UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.
-
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
Time-LLM reprograms frozen LLMs for time series forecasting via text prototypes and Prompt-as-Prefix, outperforming specialized models in standard, few-shot, and zero-shot settings.
-
Steering Language Models With Activation Engineering
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
-
LIMA: Less Is More for Alignment
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
-
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bottom 5% of natural-image response distributions.
-
Weighted Rules under the Stable Model Semantics
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
-
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
-
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
-
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
-
Steering Llama 2 via Contrastive Activation Addition
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
Llemma: An Open Language Model For Mathematics
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
-
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth value of factual statements, as shown by visualizations, cross-dataset generalization, and causal interventions that flip truth judgments.
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
-
Chain-of-Verification Reduces Hallucination in Large Language Models
Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
-
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
-
Reasoning with Language Model is Planning with World Model
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
-
Towards Expert-Level Medical Question Answering with Large Language Models
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
-
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
-
Insider Attacks in Multi-Agent LLM Consensus Systems
A malicious agent in multi-agent LLM consensus systems can be trained via a surrogate world model and RL to reduce consensus rates and prolong disagreement more effectively than direct prompt attacks.
-
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
-
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models
VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
-
AppAgent: Multimodal Agents as Smartphone Users
AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
-
Detecting Language Model Attacks with Perplexity
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
-
A Survey of Hallucination in Large Foundation Models
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
-
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.