Faith and fate: Limits of transformers on compositionality (2023).arXiv preprint arXiv:2305.18654

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D Hwang, et al · 2023 · arXiv 2305.18654

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Training Transformers as a Universal Computer

cs.AI · 2026-04-28 · unverdicted · novelty 7.0

A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.

TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

cs.CL · 2025-11-02 · unverdicted · novelty 7.0

TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.

When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks

cs.HC · 2025-10-06 · conditional · novelty 6.0

A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versus confirm-at-end.

How Do Language Models Compose Functions?

cs.CL · 2025-10-02 · conditional · novelty 6.0

LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models

cs.CL · 2025-05-30 · unverdicted · novelty 5.0

LLM errors concentrate in sparse key tokens (5-10% of sequence) at semantic decision junctions, yielding a new reliability model that explains sustained long-context coherence.

citing papers explorer

Showing 6 of 6 citing papers.

Training Transformers as a Universal Computer cs.AI · 2026-04-28 · unverdicted · none · ref 4
A transformer trained on random meaningless MicroPy programs generalizes to execute diverse human-written programs, providing empirical evidence it can act as a universal computer.
TSVer: A Benchmark for Fact Verification Against Time-Series Evidence cs.CL · 2025-11-02 · unverdicted · none · ref 23
TSVer is a new benchmark dataset for fact verification against time-series evidence, with 304 annotated real-world claims, 400 time series, verdicts, and justifications, plus baseline results showing current models struggle.
When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks cs.HC · 2025-10-06 · conditional · none · ref 25
A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versus confirm-at-end.
How Do Language Models Compose Functions? cs.CL · 2025-10-02 · conditional · none · ref 4
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 95
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models cs.CL · 2025-05-30 · unverdicted · none · ref 2
LLM errors concentrate in sparse key tokens (5-10% of sequence) at semantic decision junctions, yielding a new reliability model that explains sustained long-context coherence.

Faith and fate: Limits of transformers on compositionality (2023).arXiv preprint arXiv:2305.18654

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer