super hub Canonical reference

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Ece Kamar, Eric Horvitz, Johannes Gehrke, Ronen Eldan, Varun Chandrasekaran · 2023 · cs.CL · arXiv 2303.12712

Canonical reference. 73% of citing Pith papers cite this work as background.

162 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 162 citing papers more from Ece Kamar arXiv PDF

abstract

Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 35 method 4 baseline 1 dataset 1

citation-polarity summary

background 30 support 4 use method 4 baseline 2 unclear 1

claims ledger

abstract Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example)

authors

Ece Kamar Eric Horvitz Johannes Gehrke Ronen Eldan S\'ebastien Bubeck Varun Chandrasekaran

co-cited works

representative citing papers

Tight Sample Complexity of Transformers

cs.LG · 2026-06-08 · unverdicted · novelty 8.0

Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

cs.SE · 2024-03-25 · conditional · novelty 8.0

RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes

cs.CR · 2026-06-24 · unverdicted · novelty 7.0

CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

Rates of forgetting for the sequentially Markov coalescent

math.PR · 2026-04-22 · unverdicted · novelty 7.0

SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.

ROSE: Retrieval-Oriented Segmentation Enhancement

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

CrossTraffic: An Open-Source Framework for Reproducible and Executable Transportation Analysis and Knowledge Management

cs.CY · 2026-02-08 · unverdicted · novelty 7.0

CrossTraffic encodes transportation methodologies in an executable core and ontology-driven knowledge graph, enabling LLM-assisted analyses with near-zero numerical error and perfect invalid-input detection.

CircuChain: Disentangling Competence and Compliance in LLM Circuit Analysis

cs.SE · 2026-01-29 · unverdicted · novelty 7.0

Stronger LLMs show near-perfect physical reasoning in circuits but violate explicit sign and polarity instructions in trap setups, while weaker models follow instructions better but reason less accurately.

Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting

cs.CL · 2026-01-06 · unverdicted · novelty 7.0

SLIP enables self-jailbreaking of aligned LLMs via lexical insertion in breadth-first tree search, reaching 94.7% average ASR on AdvBench and HarmBench across eleven models with ~7.9 calls.

TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

cs.CL · 2025-11-02 · unverdicted · novelty 7.0

TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

cs.SE · 2025-10-16 · unverdicted · novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

Deep Multimodal Learning with Missing Modality: A Survey

cs.CV · 2024-09-12 · unverdicted · novelty 7.0

This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

cs.CL · 2024-06-06 · accept · novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

cs.CL · 2024-05-29 · unverdicted · novelty 7.0

Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.

RouterBench: A Benchmark for Multi-LLM Routing System

cs.LG · 2024-03-18 · unverdicted · novelty 7.0

RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.

citing papers explorer

Showing 50 of 86 citing papers after filters.

Tight Sample Complexity of Transformers cs.LG · 2026-06-08 · unverdicted · none · ref 21 · internal anchor
Depth-L transformers with W parameters have VC dimension Theta(L W log(T W)), yielding matching O(L W log((T+T')W)) upper and Omega(L W log((T+T')W/L)) lower bounds on sample complexity for chain-of-thought learning.
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data q-fin.CP · 2026-04-03 · conditional · none · ref 7 · internal anchor
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
CrypFormBench: Benchmarking Formal Analysis Capability of Large Language Models for Cryptographic Schemes cs.CR · 2026-06-24 · unverdicted · none · ref 15 · internal anchor
CrypFormBench is a new benchmark jointly covering symbolic and computational security to evaluate LLMs on five formal analysis capabilities, with results showing top model Claude-3.5 scores 48.7/100 and most models struggling on generation, transformation, and correction.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 86 · 2 links · internal anchor
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Rates of forgetting for the sequentially Markov coalescent math.PR · 2026-04-22 · unverdicted · none · ref 95 · internal anchor
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
ROSE: Retrieval-Oriented Segmentation Enhancement cs.CV · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy cs.CL · 2026-04-03 · unverdicted · none · ref 5 · internal anchor
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
CrossTraffic: An Open-Source Framework for Reproducible and Executable Transportation Analysis and Knowledge Management cs.CY · 2026-02-08 · unverdicted · none · ref 6 · internal anchor
CrossTraffic encodes transportation methodologies in an executable core and ontology-driven knowledge graph, enabling LLM-assisted analyses with near-zero numerical error and perfect invalid-input detection.
CircuChain: Disentangling Competence and Compliance in LLM Circuit Analysis cs.SE · 2026-01-29 · unverdicted · none · ref 13 · internal anchor
Stronger LLMs show near-perfect physical reasoning in circuits but violate explicit sign and polarity instructions in trap setups, while weaker models follow instructions better but reason less accurately.
Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting cs.CL · 2026-01-06 · unverdicted · none · ref 1 · internal anchor
SLIP enables self-jailbreaking of aligned LLMs via lexical insertion in breadth-first tree search, reaching 94.7% average ASR on AdvBench and HarmBench across eleven models with ~7.9 calls.
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG cs.CL · 2026-06-30 · unverdicted · none · ref 11 · internal anchor
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense cs.CR · 2026-06-29 · unverdicted · none · ref 90 · internal anchor
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code cs.SE · 2026-06-28 · unverdicted · none · ref 11 · internal anchor
Mandatory per-line citations in SDD frameworks reduce LLM output determinism but enable reliable automated hallucination detection (TDR 86-88%, FPR 0%), a trade-off replicated across Claude and GLM models.
When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs cs.SE · 2026-06-26 · unverdicted · none · ref 97 · internal anchor
Experiments across code LLMs show no-review collapses fastest, human-gated filters slow collapse, and AI self-gates lose effect over time, degenerating to ungated self-training under self-confirming acceptance as proven via gated distributional reweighting and spectral analysis.
POISE: Position-Aware Undetectable Skill Injection on LLM Agents cs.CR · 2026-06-06 · unverdicted · none · ref 16 · internal anchor
POISE is a stealthy skill-poisoning attack achieving 89.3% ASR on Skill-Inject by blending a compressed trigger into contextually appropriate positions in skill bodies, outperforming YAML and random-placement baselines while evading static scanners.
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts cs.CL · 2026-06-03 · unverdicted · none · ref 172 · internal anchor
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention cs.CL · 2026-06-03 · unverdicted · none · ref 29 · internal anchor
Speech LLMs fail logical entity-tracking tasks due to binding failures in continuous speech features; Entity-Aware Chain-of-Thought prompting recovers most of the gap even under misrecognition.
NeuroSymbolic Robustness Analysis for Discrete Systems with Respect to Transition Deviations eess.SY · 2026-06-02 · unverdicted · none · ref 20 · internal anchor
A neurosymbolic method uses LLMs to filter feasible deviation transitions in discrete-event systems and computes symbolic robustness guarantees, shown on three case studies to yield smaller deviation sets with comparable guarantees.
Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 50 · internal anchor
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning cs.CL · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
Experiments reveal that topological cues robustly support LLM navigation planning while incorrect semantic cues derail it, with linguistic format effects varying by model size and compression.
Robust class-gated single-pixel diffractive optical neural network with random-aberration-aware training physics.optics · 2026-05-29 · unverdicted · none · ref 6 · internal anchor
A class-gated single-pixel DONN reaches 90% MNIST and 80% Fashion-MNIST accuracy at 5 kHz readout via random-phase-aware training that tolerates phase aberrations.
Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
Helicase proposes an autonomous multi-agent LLM framework for uncertainty-guided supply chain knowledge graph construction evaluated on the new SCQA benchmark of 80 queries.
Repeated Sequences Reveal Gaps between Large Language Models and Natural Language cs.CL · 2026-05-24 · unverdicted · none · ref 2 · internal anchor
Natural language shows stable entropy-growth patterns in repeated subsequences while GPT texts exhibit systematic shifts in exponents with model size.
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions cs.CL · 2026-05-22 · unverdicted · none · ref 4 · internal anchor
LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.
Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations cs.RO · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
Robots detect underspecified reward features via demonstration variation and query targeted natural language explanations to improve reward recovery from imperfect demos.
Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build cs.CY · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
Generative AI reduced study time on AI-susceptible math problems by 9-31% across grade levels and produced a 25% decline in retention odds on proctored assessments.
Task Abstention for Large Language Models in Code Generation cs.SE · 2026-05-16 · unverdicted · none · ref 31 · internal anchor
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 11 · internal anchor
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 192 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
CHAL: Council of Hierarchical Agentic Language cs.AI · 2026-05-12 · unverdicted · none · ref 16 · internal anchor
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 142 · internal anchor
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Making Abstraction Concrete: A Design Space and Interaction Model of Abstraction in Interactive Systems cs.HC · 2026-05-11 · unverdicted · none · ref 25 · internal anchor
A survey of 457 papers yields a six-dimensional design space for abstraction in interactive systems that reframes gulfs of execution and evaluation while articulating cognitive and design processes for bridging abstraction gaps.
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews cs.CL · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution cs.LG · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 65 · internal anchor
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Process Matters more than Output for Distinguishing Humans from Machines cs.AI · 2026-05-07 · unverdicted · none · ref 36 · 2 links · internal anchor
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis cs.CL · 2026-04-29 · unverdicted · none · ref 38 · internal anchor
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation cs.IR · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 167 · internal anchor
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents cs.AI · 2026-04-05 · conditional · none · ref 16 · internal anchor
Persistent memory is necessary and sufficient for LLM poker agents to reach ToM levels 3-5 and use strategic deception, while agents without memory stay at level 0.
Can Humans Tell? A Dual-Axis Study of Human Perception of LLM-Generated News cs.CY · 2026-04-04 · conditional · none · ref 1 · internal anchor
Humans cannot reliably distinguish LLM-generated news from human-written news across multiple models, with domain expertise providing only modest help and fatigue reducing accuracy over time.
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations cs.CV · 2026-03-09 · unverdicted · none · ref 4 · internal anchor
GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.
Identification of quantum generative circuits with parallel quantum neural network quant-ph · 2026-03-03 · unverdicted · none · ref 27 · internal anchor
ParaQuanNet distinguishes eight quantum generative circuits via 99.5% accurate classification of their output data using parallel quantum embeddings and mutually unbiased measurements.
Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models cs.CL · 2026-02-08 · unverdicted · none · ref 3 · internal anchor
LLMs dynamically construct and causally rely on structured conceptual subspaces in middle-to-late layers for in-context inference.
Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident cs.CL · 2026-02-01 · unverdicted · none · ref 4 · internal anchor
LLMs simulating student think-alouds in multi-step chemistry tutoring produce overly coherent, verbose, and confident reasoning that overestimates learner success compared to 630 human utterances.
Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal cs.LG · 2026-06-10 · unverdicted · none · ref 237 · internal anchor
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning cs.CV · 2026-06-10 · unverdicted · none · ref 43 · internal anchor
TASM proposes a task-aware structured memory framework using task-vector compression, bipartite token merging, and a Core Memory plus Latent Bank hierarchy to enable efficient dynamic multi-modal in-context learning.
TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning cs.LG · 2026-06-10 · unverdicted · none · ref 5 · internal anchor
TAROT constructs and refines LLM-derived task-adaptive semantic graphs then applies GNN message passing to improve few-shot tabular prediction.
Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models cs.CL · 2026-06-09 · unverdicted · none · ref 64 · internal anchor
The work establishes an evaluation framework for personality induction and switching in MLLMs, reporting improved captioning but impaired VQA performance plus balancing and residual effects during multi-trait and dynamic conditions.
OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind cs.AI · 2026-05-19 · unverdicted · none · ref 10 · internal anchor
OSCToM uses RL-guided generation with an extended DSL and surrogate models to create nested belief conflict tasks, raising FANToM accuracy from 0.2% to 76% while being 6x more efficient.

Sparks of Artificial General Intelligence: Early experiments with GPT-4

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer