{"paper":{"title":"Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"MRBTs generated by LLMs and verified by SMT solvers deliver reactive reward shaping plus action masking for compositional RL tasks.","cross_cats":[],"primary_cat":"cs.LG","authors_text":"Ankita Samaddar, Nicholas Potteiger, Taylor T. Johnson, Xenofon Koutsoukos","submitted_at":"2026-05-07T07:33:08Z","abstract_excerpt":"Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well-defined subtask rewards and benefits from action masking. Recent work uses large language models (LLMs) to automate reward shaping and action masking, however none of them fully address reactivity to subtask failure and modularity to varying objects for compositional tasks. To overcome these challenges, we develop masking reward behavior tree (MRBT), a symbolic structure "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments demonstrate successful generation and refinement of five MRBTs, consistently improving training efficiency and task success rates over baselines and MRBTs without action masking.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That LLMs can reliably produce MRBTs that remain correct and modular across varying task objects, and that the derived logical specifications fully capture reactivity to subtask failure without missing edge cases.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MRBTs generated via LLMs and verified by SMT solvers deliver modular, reactive reward shaping and action masking that improves RL training efficiency and success rates on compositional tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MRBTs generated by LLMs and verified by SMT solvers deliver reactive reward shaping plus action masking for compositional RL tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a55b4a192b1c00f64c08bee096a4a33c4559a974fb7473f7d2b86ef2e0f94a80"},"source":{"id":"2605.05795","kind":"arxiv","version":2},"verdict":{"id":"2afb84a1-fa2f-4d23-a7cd-286086dd4b9d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-09T15:32:25.590175Z","strongest_claim":"Experiments demonstrate successful generation and refinement of five MRBTs, consistently improving training efficiency and task success rates over baselines and MRBTs without action masking.","one_line_summary":"MRBTs generated via LLMs and verified by SMT solvers deliver modular, reactive reward shaping and action masking that improves RL training efficiency and success rates on compositional tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That LLMs can reliably produce MRBTs that remain correct and modular across varying task objects, and that the derived logical specifications fully capture reactivity to subtask failure without missing edge cases.","pith_extraction_headline":"MRBTs generated by LLMs and verified by SMT solvers deliver reactive reward shaping plus action masking for compositional RL tasks."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.05795/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-20T13:42:04.641393Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-20T09:35:05.550679Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T20:01:19.401481Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T13:15:27.364307Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"4b186a11873e4d96fbde8136480b80b81d28169fb3fe9797063d5e1a13a3c892"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}