{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:RIN4LFODOD6LAI57IADUKMTMY2","short_pith_number":"pith:RIN4LFOD","schema_version":"1.0","canonical_sha256":"8a1bc595c370fcb023bf400745326cc699c5b84031d15b20656b81b207f550dc","source":{"kind":"arxiv","id":"2303.17491","version":3},"attestation_state":"computed","paper":{"title":"Language Models can Solve Computer Tasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pre-trained language models solve novel computer tasks by recursively criticizing and improving their own outputs.","cross_cats":["cs.AI","cs.HC","cs.LG"],"primary_cat":"cs.CL","authors_text":"Geunwoo Kim, Pierre Baldi, Stephen McAleer","submitted_at":"2023-03-30T16:01:52Z","abstract_excerpt":"Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a si"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2303.17491","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2023-03-30T16:01:52Z","cross_cats_sorted":["cs.AI","cs.HC","cs.LG"],"title_canon_sha256":"3052200996bccbdd3aea210ba8a81dc343126e1d0a133066f47e02fa852b0ed5","abstract_canon_sha256":"69132e31a7eb5356f32a2e7c61b3915b3437ad300418ab3fa4792d45de0579dc"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.140197Z","signature_b64":"oBJnyI5EqkyBsyydTAcnShFKcWxiIDwsxofWKcO+gMoRmYxxM0NC+8mPRwiUQ5RlO9kh7DjWYSx+0ey66/8FCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"8a1bc595c370fcb023bf400745326cc699c5b84031d15b20656b81b207f550dc","last_reissued_at":"2026-05-17T23:38:14.139474Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.139474Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Language Models can Solve Computer Tasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Pre-trained language models solve novel computer tasks by recursively criticizing and improving their own outputs.","cross_cats":["cs.AI","cs.HC","cs.LG"],"primary_cat":"cs.CL","authors_text":"Geunwoo Kim, Pierre Baldi, Stephen McAleer","submitted_at":"2023-03-30T16:01:52Z","abstract_excerpt":"Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a si"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the pre-trained LLM already contains sufficient world knowledge and self-critique capability to generate and iteratively refine correct computer actions for novel tasks when given only a few demonstrations and a simple prompting template.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Pre-trained language models solve novel computer tasks by recursively criticizing and improving their own outputs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9b0556daa55dfa0a6e65c238492af6a8c225eb30764b8d14b8fed3fd1b0521be"},"source":{"id":"2303.17491","kind":"arxiv","version":3},"verdict":{"id":"1241c1f4-e686-44a2-ac3b-ba39bb44abba","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T12:11:24.124658Z","strongest_claim":"RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function.","one_line_summary":"Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the pre-trained LLM already contains sufficient world knowledge and self-critique capability to generate and iteratively refine correct computer actions for novel tasks when given only a few demonstrations and a simple prompting template.","pith_extraction_headline":"Pre-trained language models solve novel computer tasks by recursively criticizing and improving their own outputs."},"references":{"count":102,"sample":[{"doi":"","year":2022,"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","ref_index":1,"cited_arxiv_id":"2204.01691","is_internal_anchor":true},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"906c9ca4-12b9-4ec9-ba04-d0bad471451d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","ref_index":3,"cited_arxiv_id":"2212.08073","is_internal_anchor":true},{"doi":"","year":2022,"title":"Video pretraining (vpt): Learning to act by watching unlabeled online videos","work_id":"8791fba2-3b4c-4358-bba5-ae454e771d6d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"06921215-168b-4266-a8bd-53d84ad473f0","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":102,"snapshot_sha256":"dba6a23200abf7752bbe4c2a1b607dfec157ee1c6c389dfc2fa1ee3e3df8363e","internal_anchors":19},"formal_canon":{"evidence_count":1,"snapshot_sha256":"08f158b359c71aeb903bd6d4069871459bc25234c403aab33639b5381de32531"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2303.17491","created_at":"2026-05-17T23:38:14.139581+00:00"},{"alias_kind":"arxiv_version","alias_value":"2303.17491v3","created_at":"2026-05-17T23:38:14.139581+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2303.17491","created_at":"2026-05-17T23:38:14.139581+00:00"},{"alias_kind":"pith_short_12","alias_value":"RIN4LFODOD6L","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"RIN4LFODOD6LAI57","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"RIN4LFOD","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":18,"internal_anchor_count":18,"sample":[{"citing_arxiv_id":"2503.09572","citing_title":"Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2409.12917","citing_title":"Training Language Models to Self-Correct via Reinforcement Learning","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10935","citing_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2309.02427","citing_title":"Cognitive Architectures for Language Agents","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2309.17421","citing_title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2401.01614","citing_title":"GPT-4V(ision) is a Generalist Web Agent, if Grounded","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2305.18323","citing_title":"ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2403.07718","citing_title":"WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2309.03409","citing_title":"Large Language Models as Optimizers","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2304.03277","citing_title":"Instruction Tuning with GPT-4","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2402.02716","citing_title":"Understanding the planning of LLM agents: A survey","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10774","citing_title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads","ref_index":88,"is_internal_anchor":true},{"citing_arxiv_id":"2304.05128","citing_title":"Teaching Large Language Models to Self-Debug","ref_index":101,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08904","citing_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2305.15334","citing_title":"Gorilla: Large Language Model Connected with Massive APIs","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2403.07974","citing_title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","ref_index":146,"is_internal_anchor":true},{"citing_arxiv_id":"2303.11366","citing_title":"Reflexion: Language Agents with Verbal Reinforcement Learning","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21375","citing_title":"VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation","ref_index":36,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2","json":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2.json","graph_json":"https://pith.science/api/pith-number/RIN4LFODOD6LAI57IADUKMTMY2/graph.json","events_json":"https://pith.science/api/pith-number/RIN4LFODOD6LAI57IADUKMTMY2/events.json","paper":"https://pith.science/paper/RIN4LFOD"},"agent_actions":{"view_html":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2","download_json":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2.json","view_paper":"https://pith.science/paper/RIN4LFOD","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2303.17491&json=true","fetch_graph":"https://pith.science/api/pith-number/RIN4LFODOD6LAI57IADUKMTMY2/graph.json","fetch_events":"https://pith.science/api/pith-number/RIN4LFODOD6LAI57IADUKMTMY2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2/action/storage_attestation","attest_author":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2/action/author_attestation","sign_citation":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2/action/citation_signature","submit_replication":"https://pith.science/pith/RIN4LFODOD6LAI57IADUKMTMY2/action/replication_record"}},"created_at":"2026-05-17T23:38:14.139581+00:00","updated_at":"2026-05-17T23:38:14.139581+00:00"}