PersonalHomeBench: Evaluating Agents in Personalized Smart Homes

InJung Yang; Kevin Ferreira; KoKeun Kim; Manasa Bharadwaj; Nikhil Verma; Sungil Kim; Yolanda Liu; YoungJoon Kim

arxiv: 2604.16813 · v3 · submitted 2026-04-18 · 💻 cs.AI · cs.CL· cs.DB

PersonalHomeBench: Evaluating Agents in Personalized Smart Homes

Nikhil Verma , InJung Yang , Sungil Kim , KoKeun Kim , YoungJoon Kim , Manasa Bharadwaj , Yolanda Liu , Kevin Ferreira This is my paper

Pith reviewed 2026-05-15 06:37 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.DB

keywords PersonalHomeBenchsmart home agentsagentic AIpersonalized environmentscounterfactual reasoningpartial observabilityfoundation modelsbenchmark evaluation

0 comments

The pith

Agents in personalized smart homes show clear performance drops as task complexity rises, especially failing at counterfactual reasoning and partial observability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PersonalHomeBench to evaluate how foundation models perform as agents in simulated personalized smart home environments. It builds household states iteratively to create context-specific tasks and supplies PersonalHomeTools for agents to gather information, control devices, and interpret situations. Testing covers both reactive and proactive behaviors across different observation types. Experiments demonstrate that success rates fall steadily with greater task difficulty, and agents particularly struggle when they must reason about what might have happened differently or when they lack full information and must actively use tools to find it. The benchmark is positioned as a way to measure where current agent systems are not yet ready for realistic personalized use.

Core claim

PersonalHomeBench is constructed by iteratively building rich household states that then generate personalized, context-dependent tasks, paired with PersonalHomeTools that let agents retrieve household information, control appliances, and build situational understanding. Evaluation of foundation models shows a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability where effective tool-based information gathering is required.

What carries the argument

PersonalHomeBench benchmark, built through iterative household-state construction and personalized task generation, together with the PersonalHomeTools toolbox for retrieval, control, and understanding.

If this is right

Agent performance declines steadily as task complexity grows in personalized settings.
Counterfactual reasoning remains a clear weakness even for current foundation models.
Partial observability demands effective tool-based information gathering that most agents lack.
Reactive and proactive abilities can be compared directly within the same benchmark.
The platform enables systematic analysis of robustness in personalized agentic planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world smart-home assistants may need targeted training on incomplete-information scenarios before reliable deployment.
The benchmark could be extended to multi-step dynamic replanning when household states change over time.
Failures identified here point to specific gaps in tool-use integration that future agent architectures should address.
Similar evaluation pipelines might be adapted to other personalized physical environments such as offices or vehicles.

Load-bearing premise

The iteratively constructed household states and generated tasks sufficiently capture the complexity and personalization of real-world smart home environments for meaningful agent evaluation.

What would settle it

An experiment in which one or more agents maintain stable performance across rising task complexities with no pronounced failures in counterfactual reasoning or partial-observability tool-use scenarios would falsify the reported pattern.

Figures

Figures reproduced from arXiv: 2604.16813 by InJung Yang, Kevin Ferreira, KoKeun Kim, Manasa Bharadwaj, Nikhil Verma, Sungil Kim, Yolanda Liu, YoungJoon Kim.

**Figure 1.** Figure 1: Overview of the PersonalHomeBench data generation pipeline. Detailed household environments are first assembled from personas, devices, memories, and contextual descriptions, with optional video grounding. Five categories of personalized tasks are then generated from these environments, spanning reactive question answering and proactive assistance, enabling evaluation of agent behavior in personalized sm… view at source ↗

**Figure 2.** Figure 2: An agent interprets real-world household situations and triggers, using PersonalHomeTools to generate proactive plans (e.g., safety, first aid, ambience control), which are evaluated by the Role Playing Judge to produce personalized recommendations. (Xie et al., 2024a; Singh et al., 2024; Yang et al., 2025b; Bartkowiak & Podstawski, 2025), PersonalHomeBench offers broader coverage in scale, task diversity… view at source ↗

**Figure 3.** Figure 3: Statistics of generated household and persona diversity in PersonalHomeBench. (a) Age and gender distribution of household members. (b) Distribution of the number of pets per household type. (c) Distribution of video lengths (sec). (d) Word cloud of common persona attributes, illustrating diversity in interests, habits, and lifestyle characteristics. manici et al., 2025), (2) Medium-size open models: QWEN3… view at source ↗

**Figure 4.** Figure 4: Agentic Counterfactual Reasoning (CF) performance across models. The heatmap reports accuracy on original questions, on counterfactual cases with changed and unchanged answers. tern validates the core design of our PersonalHomeTools, demonstrating that its structured, tool-mediated access to household information enables models to adapt effectively to personalized contexts by retrieving user-specific detai… view at source ↗

**Figure 5.** Figure 5: Tool usage distributions across models for reactive (top) and proactive (bottom) tasks, grouped by low (< 5), medium (5–10), and high (> 10) numbers of calls per task instance. Impact of Reflection. We repurpose our Role Playing Judge (RPJ) as a self-reflection signal, enabling models to iteratively critique and revise generated plans in the spirit of Reflexion (Shinn et al., 2023). RP J feedback highlight… view at source ↗

**Figure 6.** Figure 6: Qualitative examples illustrating personalized agent behavior in PersonalHomeBench, covering reactive question answering (top) and multimodal proactive planning (bottom). Refer to appendix for tool usage failure analysis (F.1), limitations (C), and future directions. Conclusion We presented PersonalHomeBench, a benchmark for evaluating personalized agentic behavior in realistic smart home environments, … view at source ↗

**Figure 7.** Figure 7: Additional statistics of generated household in PersonalHomeBench. (a) Distribution of household size and composition. (b) Distribution of appliances owned in the households. (c) Common appliances owned by the households. resets, enabling detection of dynamic interaction errors. (See Section 6.1 for the main results.) • Tool Usage Errors. The errors arise in the agentic setting, where models actively gathe… view at source ↗

**Figure 8.** Figure 8: Error distribution across models, tasks, and failure types. Bubble plots show the average frequency of error types across four task settings (IG, CF, MC, FR, PG). Rows correspond to models and columns to error categories, with bubble size proportional to error frequency. consistent with the ground truth. This hybrid decision rule is formalized in Eq. 3. For CF, performance is reported as the macro-average … view at source ↗

**Figure 9.** Figure 9: Counterfactual Reasoning (CF) performance across models under settings with and without tool access. The heatmap reports accuracy for original questions, counterfactual queries where the correct answer changes, and counterfactual queries where the answer remains unchanged [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Accuracy across models for three reactive task categories, Information Grounding (IG), Counterfactual Reasoning (CF), and Multiple-Choice (MCQ), evaluated at easy, medium, and hard difficulty levels. Each point corresponds to a model–task pair; marker size reflects the average number of turns taken to complete the task [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Mean Average Precision at rank k (MAP@k) across models evaluated with and without tool access. Solid lines denote performance with tools, while dashed lines correspond to settings without tools. Results are reported for varying values of k, illustrating ranking performance trends across evaluation conditions. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of human agreement scores for Role Playing Judge evaluations. Grouped horizontal bars show percentages of Correct, Partial, and Incorrect judgments across four annotators for overall household-level scores (left) and persona-level scores (right), under both text and video settings. 4Two annotators (A1, A2) independently annotated the same set of 50 plans, and another two annotators (A3, A4) i… view at source ↗

read the original abstract

Agentic AI systems are rapidly advancing toward real-world applications, yet their readiness in complex and personalized environments remains insufficiently characterized. To address this gap, we introduce PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. The benchmark is constructed through an iterative process that progressively builds rich household states, which are then used to generate personalized, context-dependent tasks. To support realistic agent-environment interaction, we provide PersonalHomeTools, a comprehensive toolbox enabling household information retrieval, appliance control, and situational understanding. PersonalHomeBench evaluates both reactive and proactive agentic abilities under unimodal and multimodal observations. Thorough experimentation reveals a systematic performance reduction as task complexity increases, with pronounced failures in counterfactual reasoning and under partial observability, where effective tool-based information gathering is required. These results position PersonalHomeBench as a rigorous evaluation platform for analyzing the robustness and limitations of personalized agentic reasoning and planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PersonalHomeBench introduces an iterative synthetic benchmark for smart-home agents and reports clear performance drops on complex tasks, but the abstract leaves the methodology and numbers too thin to judge how general those drops really are.

read the letter

The main thing here is a new benchmark called PersonalHomeBench that tests foundation-model agents in personalized smart homes. They build household states iteratively, generate context-dependent tasks from them, and supply PersonalHomeTools for retrieval, control, and situational understanding. The experiments show performance falling as task complexity rises, with particular trouble in counterfactual reasoning and partial-observability cases that require active tool use. That setup is the clearest novelty: the progressive state construction and the split between reactive and proactive evaluation under different observation modes are not standard in prior agent benchmarks. The paper does a reasonable job framing why these capabilities matter for real home assistants and in pointing out where current models fall short. The soft spot is the level of detail. The abstract describes the outcomes at a high level but does not give the actual numbers, error breakdowns, or exact task-generation rules, so it is difficult to tell how robust the reported reductions are. The iterative synthetic construction also raises the usual question of whether the generated states and tasks over-represent certain edge cases that would be rarer in actual user homes; without some external validation or real-home comparison, the failures could be partly benchmark-specific. This is aimed at researchers working on agentic systems and domestic HCI. Anyone building or evaluating home assistants would find the task-generation approach and the tool interface useful to look at. I would send it for peer review. The benchmark idea is concrete enough that referees can tighten the methodology and check for construction artifacts.

Referee Report

2 major / 0 minor

Summary. The paper introduces PersonalHomeBench, a benchmark for evaluating foundation models as agentic assistants in personalized smart home environments. It describes an iterative construction process that builds rich household states and generates context-dependent tasks, provides PersonalHomeTools for information retrieval and control, and evaluates reactive and proactive abilities under unimodal and multimodal observations. Experimental results indicate systematic performance reductions as task complexity increases, with notable failures in counterfactual reasoning and partial observability scenarios requiring tool-based information gathering.

Significance. If the benchmark construction is representative and the empirical findings are robustly supported, PersonalHomeBench could provide a valuable platform for diagnosing limitations in current agentic systems for personalized, real-world environments, particularly highlighting challenges in handling partial observability and counterfactual tasks.

major comments (2)

Abstract: The central claims of systematic performance reduction and pronounced failures in counterfactual reasoning and partial observability are presented only at a high level without any quantitative metrics, specific results, tables, error analysis, or baseline comparisons, leaving the empirical support for these load-bearing findings insufficiently substantiated.
Construction process (iterative household state building and task generation): The synthetic iterative process risks embedding generation artifacts that could artifactually produce the reported failures in tool-based information gathering and counterfactual reasoning; without explicit validation against real-world personalized smart home data or ablation studies on the construction steps, it is unclear whether the observed performance drops generalize beyond benchmark-specific artifacts.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The central claims of systematic performance reduction and pronounced failures in counterfactual reasoning and partial observability are presented only at a high level without any quantitative metrics, specific results, tables, error analysis, or baseline comparisons, leaving the empirical support for these load-bearing findings insufficiently substantiated.

Authors: We agree that the abstract would benefit from greater specificity to substantiate the central claims. In the revised manuscript, we will expand the abstract to include key quantitative results, such as the magnitude of performance reductions across task complexity levels, failure rates in counterfactual reasoning and partial observability scenarios, and brief baseline comparisons drawn from our experimental tables. revision: yes
Referee: Construction process (iterative household state building and task generation): The synthetic iterative process risks embedding generation artifacts that could artifactually produce the reported failures in tool-based information gathering and counterfactual reasoning; without explicit validation against real-world personalized smart home data or ablation studies on the construction steps, it is unclear whether the observed performance drops generalize beyond benchmark-specific artifacts.

Authors: We acknowledge the concern that the synthetic iterative construction could introduce artifacts. The process is deliberately structured around realistic household dynamics and user-specific preferences to promote ecological validity. To address this directly, we will add ablation studies on the construction steps (e.g., varying iteration depth and state enrichment rules) to demonstrate that the reported performance trends are robust. We cannot provide direct validation against real-world personalized smart home datasets in this work, as no suitable public datasets exist at the required scale and granularity; we will explicitly note this limitation and outline it as future work. revision: partial

standing simulated objections not resolved

Direct empirical validation of the benchmark construction against real-world personalized smart home data

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and evaluation

full rationale

The paper introduces PersonalHomeBench via iterative synthetic household state building and context-dependent task generation, then reports empirical agent performance results under varying complexity, observability, and modalities. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations of uniqueness theorems appear. The construction process is explicitly described as a methodological choice for creating the evaluation platform rather than a self-referential definition or reduction. All reported findings (performance drops with complexity, failures in counterfactual reasoning) are direct experimental outcomes on the generated benchmark and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper with no mathematical derivations or theoretical claims, resulting in no free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.0 · 5478 in / 997 out tokens · 35899 ms · 2026-05-15T06:37:39.214753+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

iterative process that progressively builds rich household states... PersonalHomeTools... five categories of personalized tasks
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

systematic performance reduction as task complexity increases... failures in counterfactual reasoning and under partial observability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
cs.AI 2026-05 unverdicted novelty 5.0

TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Partnr: A benchmark for planning and reasoning in embodied multi-agent tasks.arXiv preprint arXiv:2411.00081, 2024

URL https://proceedings.mlr.press/ v267/bonatti25a.html. Chang, M., Chhablani, G., Clegg, A., Cote, M. D., Desai, R., Hlavac, M., Karashchuk, V ., Krantz, J., Mottaghi, R., Parashar, P., et al. Partnr: A benchmark for planning and reasoning in embodied multi-agent tasks.arXiv preprint arXiv:2411.00081, 2024a. Chang, M., Zhang, J., Zhu, Z., Yang, C., Yang,...

work page arXiv 2025
[2]

doi: 10.18653/v1/2024.acl-long

URL https://api.semanticscholar. org/CorpusID:273375463. Maharana, A., Lee, D.-H., Tulyakov, S., Bansal, M., Barbi- eri, F., and Fang, Y . Evaluating very long-term conver- sational memory of LLM agents. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: L...

work page doi:10.18653/v1/2024.acl-long 2024
[3]

acl-long.747/

URL https://aclanthology.org/2024. acl-long.747/. Meyer, Y . and Corneil, D. Nemotron-Personas- USA: Synthetic personas aligned to real- world distributions, June 2025. URL https: //huggingface.co/datasets/nvidia/ Nemotron-Personas-USA. Puig, X., Ra, K. K., Boben, M., Li, J., Wang, T., Fi- dler, S., and Torralba, A. Virtualhome: Simulating household activ...

work page 2024
[4]

OpenAI GPT-5 System Card

URL https://openreview.net/forum? id=dHng2O0Jjr. Salemi, A., Mysore, S., Bendersky, M., and Zamani, H. LaMP: When large language models meet person- alization. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 7370–7392, Bangkok, Thailan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.399 2024
[5]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

URL https://aclanthology.org/2024. emnlp-industry.37/. Tang, C., Li, Y ., Yang, Y ., Zhuang, J., Sun, G., Li, W., Ma, Z., and Zhang, C. video-salmonn 2: Captioning- enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025. Wang, N., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y ., Guo, H., Gan, R., Ni, Z., Yang, J., et al. Rolellm: ...

work page doi:10.52202/079017-1650 2024
[6]

arXiv:2406.12639 [cs] doi:10.48550/arXiv.2406.12639

URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference. pdf. Zhang, X., Deng, Y ., Ren, Z., Ng, S.-K., and Chua, T.-S. Ask-before-plan: Proactive language agents for real-world planning.ArXiv, abs/2406.12639,

work page arXiv 2022
[7]

org/CorpusID:270561990

URL https://api.semanticscholar. org/CorpusID:270561990. Zhao, S., Zhu, A., Mozannar, H., Sontag, D. A., Talwalkar, A., and Chen, V . Codinggenie: A proactive llm-powered programming assistant.Pro- ceedings of the 33rd ACM International Confer- ence on the F oundations of Software Engineering,

work page
[8]

org/CorpusID:277112990

URL https://api.semanticscholar. org/CorpusID:277112990. Zhu, M. Recall, precision and average precision.Depart- ment of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2(30):6, 2004. 11 PersonalHomeBench: Evaluating Agents in Personalized Smart Homes A. Outline The appendix provides supplementary material that supports and extends the...

work page 2004
[9]

• All individual personas must align with this household information

Use the Household Persona as the Anchor • Use the household-level description to infer: demographics, living situation, socioeconomic context, lifestyle preferences, environment and context. • All individual personas must align with this household information

work page
[10]

–Ages should fall within a similar age range, unless the household persona describes otherwise

Respect the Occupancy Type • If occupancy_type = roommate: –All individuals should be unrelated. –Ages should fall within a similar age range, unless the household persona describes otherwise. – Income levels, occupations, and lifestyles may differ, but should still plausibly co-exist in a shared-living situation. –Personalities should not be identical—ma...

work page
[11]

• Do not copy content

Use the Samples as Style Guidance • Use samples only as examples of tone, structure, and level of detail. • Do not copy content. • Maintain similar attributes

work page
[12]

Produce a Concise, Self-Contained Persona for Each Individual • For each individual, provide: 29 PersonalHomeBench: Evaluating Agents in Personalized Smart Homes –member_id: must be {household_id}_{n} where n starts at 1 –name: extract name from persona description –role: parent, child, roommate, partner, etc. –age (estimate if not explicitly stated) –gen...

work page
[13]

members": [ {

Output Format Return the result as a list of personas, one entry per individual, labeled: "members": [ { "member_id": "{household_id}_1", "name": "", "role": "", "age": , "gender": "", "occupation": "", "persona": "", "hobbies": ["", ""], "lifestyle": "", "preference": "", "major_event": [ { "date": "", "description": "" } ] }, { "member_id": "{household_...

work page
[14]

Take a grounded QA pair

work page
[15]

Generate a counterfactual modification to the household/context/memory/appliance data

work page
[16]

difficulty

Provide: a) The counterfactual condition b) The new derived answer under this counterfactual world c) A short explanation of the reasoning shift Input Format You will be given a single QA item in JSON format with the following structure: {{ "difficulty": "easy | medium | hard", "question": "<generic question using IDs>", "personalized_question": "<questio...

work page
[17]

Select exactly 10 features from the input list

work page
[18]

Assign a correct ranking from 1 (most relevant) to 10 (least relevant)

work page
[19]

Rankings must reflect explicit evidence from: 35 PersonalHomeBench: Evaluating Agents in Personalized Smart Homes - Household structure (members, ages, hobbies, roles, pets) - Device information (appliance types, sensors, locations) - Contextual data (current time, weather, schedules, states) - Long-term memories (habits, major events, preferences)

work page
[20]

appliance

Provide a short explanation for each feature, describing why it is placed in its specific ranking position. RANKING REQUIREMENTS: - The ranking must reflect actual priorities inferred from the household’s behavior and needs. - Reasoning should show: - Observation (e.g., someone frequently forgets to turn off appliances) - Inference (e.g., a room with high...

work page
[21]

Answer the QUESTION usingonlythe provided FULL_CONTEXT

work page
[22]

unknown". ## Output format (STRICT JSON) {{

Donotinvent facts. If the answer is not derivable, output"unknown". ## Output format (STRICT JSON) {{ "answer": "<one of the choice from the list which is the answer>", # you should return the letter "evidence": [ {{ "rationale": "reasoning", "path": "<jsonpath-like pointer, e.g. appliances.microwave.location>", }} ] }} ## Rules - Keep theanswershort and ...

work page
[23]

Sort features frommost relevant to least relevant

work page
[24]

Rankings must reflectactual household priorities, not generic recommendations

work page
[25]

Donot introducefeature IDs that are not provided

work page
[26]

‘json {{

Use only information present in the household record. — ## OUTPUT REQUIREMENTS Returnonlythe following JSON object and nothing else: “‘json {{ "ranked_features": [ "<feature_id_1>", "<feature_id_2>", "<feature_id_3>", "...", "<feature_id_n>" ] }} “‘ Note: - The first element is the most relevant feature. - The last element is the least relevant feature. -...

work page
[27]

A", "B",

A set of candidate options labeled with alphabet letters (e.g., "A", "B", "C", . . . ). Each letter corresponds to one LLM-generated smart-home program

work page
[28]

A household profile containing members, preferences, routines, constraints, and recent context. Your task is to rank the options from: - The perspective of the household as a whole (smart_home) - The perspective of each household member individually Evaluation Guidelines Household (smart_home) Perspective Rank options based on: • Safety and anomaly preven...

work page
[29]

Safety and prevention

work page
[30]

Multi-member benefit

work page
[31]

overall_order

Lower disruption Output Format (STRICT) Return one JSON object only, following this schema exactly: { "overall_order": [<LETTER\_1>, <LETTER\_2>, <LETTER\_3>...], "persona_order": { "household\_member\_id_1": [<LETTER\_1>, <LETTER\_2>, <LETTER\_3>...], "household\_member\_id_2": [<LETTER\_1>, <LETTER\_2>, <LETTER\_3>...] }, "rationale": "<1-3 sentence exp...

work page

[1] [1]

Partnr: A benchmark for planning and reasoning in embodied multi-agent tasks.arXiv preprint arXiv:2411.00081, 2024

URL https://proceedings.mlr.press/ v267/bonatti25a.html. Chang, M., Chhablani, G., Clegg, A., Cote, M. D., Desai, R., Hlavac, M., Karashchuk, V ., Krantz, J., Mottaghi, R., Parashar, P., et al. Partnr: A benchmark for planning and reasoning in embodied multi-agent tasks.arXiv preprint arXiv:2411.00081, 2024a. Chang, M., Zhang, J., Zhu, Z., Yang, C., Yang,...

work page arXiv 2025

[2] [2]

doi: 10.18653/v1/2024.acl-long

URL https://api.semanticscholar. org/CorpusID:273375463. Maharana, A., Lee, D.-H., Tulyakov, S., Bansal, M., Barbi- eri, F., and Fang, Y . Evaluating very long-term conver- sational memory of LLM agents. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: L...

work page doi:10.18653/v1/2024.acl-long 2024

[3] [3]

acl-long.747/

URL https://aclanthology.org/2024. acl-long.747/. Meyer, Y . and Corneil, D. Nemotron-Personas- USA: Synthetic personas aligned to real- world distributions, June 2025. URL https: //huggingface.co/datasets/nvidia/ Nemotron-Personas-USA. Puig, X., Ra, K. K., Boben, M., Li, J., Wang, T., Fi- dler, S., and Torralba, A. Virtualhome: Simulating household activ...

work page 2024

[4] [4]

OpenAI GPT-5 System Card

URL https://openreview.net/forum? id=dHng2O0Jjr. Salemi, A., Mysore, S., Bendersky, M., and Zamani, H. LaMP: When large language models meet person- alization. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 7370–7392, Bangkok, Thailan...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.399 2024

[5] [5]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

URL https://aclanthology.org/2024. emnlp-industry.37/. Tang, C., Li, Y ., Yang, Y ., Zhuang, J., Sun, G., Li, W., Ma, Z., and Zhang, C. video-salmonn 2: Captioning- enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025. Wang, N., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y ., Guo, H., Gan, R., Ni, Z., Yang, J., et al. Rolellm: ...

work page doi:10.52202/079017-1650 2024

[6] [6]

arXiv:2406.12639 [cs] doi:10.48550/arXiv.2406.12639

URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference. pdf. Zhang, X., Deng, Y ., Ren, Z., Ng, S.-K., and Chua, T.-S. Ask-before-plan: Proactive language agents for real-world planning.ArXiv, abs/2406.12639,

work page arXiv 2022

[7] [7]

org/CorpusID:270561990

URL https://api.semanticscholar. org/CorpusID:270561990. Zhao, S., Zhu, A., Mozannar, H., Sontag, D. A., Talwalkar, A., and Chen, V . Codinggenie: A proactive llm-powered programming assistant.Pro- ceedings of the 33rd ACM International Confer- ence on the F oundations of Software Engineering,

work page

[8] [8]

org/CorpusID:277112990

URL https://api.semanticscholar. org/CorpusID:277112990. Zhu, M. Recall, precision and average precision.Depart- ment of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2(30):6, 2004. 11 PersonalHomeBench: Evaluating Agents in Personalized Smart Homes A. Outline The appendix provides supplementary material that supports and extends the...

work page 2004

[9] [9]

• All individual personas must align with this household information

Use the Household Persona as the Anchor • Use the household-level description to infer: demographics, living situation, socioeconomic context, lifestyle preferences, environment and context. • All individual personas must align with this household information

work page

[10] [10]

–Ages should fall within a similar age range, unless the household persona describes otherwise

Respect the Occupancy Type • If occupancy_type = roommate: –All individuals should be unrelated. –Ages should fall within a similar age range, unless the household persona describes otherwise. – Income levels, occupations, and lifestyles may differ, but should still plausibly co-exist in a shared-living situation. –Personalities should not be identical—ma...

work page

[11] [11]

• Do not copy content

Use the Samples as Style Guidance • Use samples only as examples of tone, structure, and level of detail. • Do not copy content. • Maintain similar attributes

work page

[12] [12]

Produce a Concise, Self-Contained Persona for Each Individual • For each individual, provide: 29 PersonalHomeBench: Evaluating Agents in Personalized Smart Homes –member_id: must be {household_id}_{n} where n starts at 1 –name: extract name from persona description –role: parent, child, roommate, partner, etc. –age (estimate if not explicitly stated) –gen...

work page

[13] [13]

members": [ {

Output Format Return the result as a list of personas, one entry per individual, labeled: "members": [ { "member_id": "{household_id}_1", "name": "", "role": "", "age": , "gender": "", "occupation": "", "persona": "", "hobbies": ["", ""], "lifestyle": "", "preference": "", "major_event": [ { "date": "", "description": "" } ] }, { "member_id": "{household_...

work page

[14] [14]

Take a grounded QA pair

work page

[15] [15]

Generate a counterfactual modification to the household/context/memory/appliance data

work page

[16] [16]

difficulty

Provide: a) The counterfactual condition b) The new derived answer under this counterfactual world c) A short explanation of the reasoning shift Input Format You will be given a single QA item in JSON format with the following structure: {{ "difficulty": "easy | medium | hard", "question": "<generic question using IDs>", "personalized_question": "<questio...

work page

[17] [17]

Select exactly 10 features from the input list

work page

[18] [18]

Assign a correct ranking from 1 (most relevant) to 10 (least relevant)

work page

[19] [19]

Rankings must reflect explicit evidence from: 35 PersonalHomeBench: Evaluating Agents in Personalized Smart Homes - Household structure (members, ages, hobbies, roles, pets) - Device information (appliance types, sensors, locations) - Contextual data (current time, weather, schedules, states) - Long-term memories (habits, major events, preferences)

work page

[20] [20]

appliance

Provide a short explanation for each feature, describing why it is placed in its specific ranking position. RANKING REQUIREMENTS: - The ranking must reflect actual priorities inferred from the household’s behavior and needs. - Reasoning should show: - Observation (e.g., someone frequently forgets to turn off appliances) - Inference (e.g., a room with high...

work page

[21] [21]

Answer the QUESTION usingonlythe provided FULL_CONTEXT

work page

[22] [22]

unknown". ## Output format (STRICT JSON) {{

Donotinvent facts. If the answer is not derivable, output"unknown". ## Output format (STRICT JSON) {{ "answer": "<one of the choice from the list which is the answer>", # you should return the letter "evidence": [ {{ "rationale": "reasoning", "path": "<jsonpath-like pointer, e.g. appliances.microwave.location>", }} ] }} ## Rules - Keep theanswershort and ...

work page

[23] [23]

Sort features frommost relevant to least relevant

work page

[24] [24]

Rankings must reflectactual household priorities, not generic recommendations

work page

[25] [25]

Donot introducefeature IDs that are not provided

work page

[26] [26]

‘json {{

Use only information present in the household record. — ## OUTPUT REQUIREMENTS Returnonlythe following JSON object and nothing else: “‘json {{ "ranked_features": [ "<feature_id_1>", "<feature_id_2>", "<feature_id_3>", "...", "<feature_id_n>" ] }} “‘ Note: - The first element is the most relevant feature. - The last element is the least relevant feature. -...

work page

[27] [27]

A", "B",

A set of candidate options labeled with alphabet letters (e.g., "A", "B", "C", . . . ). Each letter corresponds to one LLM-generated smart-home program

work page

[28] [28]

A household profile containing members, preferences, routines, constraints, and recent context. Your task is to rank the options from: - The perspective of the household as a whole (smart_home) - The perspective of each household member individually Evaluation Guidelines Household (smart_home) Perspective Rank options based on: • Safety and anomaly preven...

work page

[29] [29]

Safety and prevention

work page

[30] [30]

Multi-member benefit

work page

[31] [31]

overall_order

Lower disruption Output Format (STRICT) Return one JSON object only, following this schema exactly: { "overall_order": [<LETTER\_1>, <LETTER\_2>, <LETTER\_3>...], "persona_order": { "household\_member\_id_1": [<LETTER\_1>, <LETTER\_2>, <LETTER\_3>...], "household\_member\_id_2": [<LETTER\_1>, <LETTER\_2>, <LETTER\_3>...] }, "rationale": "<1-3 sentence exp...

work page