TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks

Chanjin Zheng; Haoran Shi; Jiarui Yu; Jin Wu; Wei Xia; Xiangyu Wang

arxiv: 2604.06765 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.AI

TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks

Xiangyu Wang , Jin Wu , Haoran Shi , Wei Xia , Jiarui Yu , Chanjin Zheng This is my paper

Pith reviewed 2026-05-10 18:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-LLM collaborationteam rolescontextualized tasksCGPST benchmarkrole divisionmulti-step reasoningLLM frameworkcollaboration phases

0 comments

The pith

Dividing large language models into four human-like team roles and coordinating them through three phases improves results on multi-step contextual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix a gap in existing multi-LLM systems for complex tasks that unfold over many steps and depend on ongoing context. These systems often operate from a single viewpoint because they skip the kind of role division people use in teams. TeamLLM therefore defines four distinct roles and routes the work through three explicit collaboration phases. The authors also introduce the CGPST benchmark, which tests contextual grounding, procedural structure, and multi-dimensional scoring. When ten popular LLMs are evaluated on this benchmark, the team-structured version produces clearly higher scores at the overall, step, and dimension levels, and the full set of scenarios, responses, and human ratings is released for further use.

Core claim

TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks, resulting in substantial performance improvements on the CGPST benchmark.

What carries the argument

Four human-inspired team roles together with a three-phase collaboration process that structures how multiple LLMs exchange and refine outputs.

If this is right

Multi-step tasks that require holding context across sequential steps become more tractable for LLM-based systems.
Performance gains register not only overall but also when scoring individual steps and separate assessment dimensions.
A new benchmark is now available with scenarios, full-process responses, and human scores to support standardized testing.
Structured role assignment can reduce the narrow-perspective problem that arises when LLMs collaborate without division of labor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same role-and-phase structure could be tried on tasks such as long-horizon planning or iterative design where viewpoint diversity matters.
Fixed roles might be compared against versions that allow roles to shift dynamically during a task.
The framework offers a possible base for mixing LLM teams with human participants in shared workflows.
Success on CGPST could be checked as an indicator for performance in applied settings like research assistance or project coordination.

Load-bearing premise

Assigning fixed human-inspired roles and running a three-phase process will reliably produce better outcomes than existing multi-LLM methods without introducing new coordination failures or role-specific biases.

What would settle it

Direct evaluation on the CGPST benchmark where the TeamLLM setup yields equal or lower scores than baseline multi-LLM methods that lack explicit roles and phases.

Figures

Figures reproduced from arXiv: 2604.06765 by Chanjin Zheng, Haoran Shi, Jiarui Yu, Jin Wu, Wei Xia, Xiangyu Wang.

**Figure 1.** Figure 1: Benchmark Comparison. human team role division, which may lead to a single perspective and exacerbate output homogenization. This limitation may weaken performance on multi-step contextualized tasks (Xu et al., 2025; Wenger and Kenett, 2025; Lu et al., 2024a; Fukumura and Ito, 2025). Moreover, frameworks such as LLM Discussion (Lu et al., 2024a) are designed for single-step tasks and may not be directly … view at source ↗

**Figure 2.** Figure 2: TeamLLM: A Human-Like Team-Oriented Collaboration Framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Step-level performance of TeamLLM (right bars) and the baseline (left bars), with different colors [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of model performance in Step-3 across Flexibility Efficiency, Originality Efficiency, and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study results. Orginality_Eff iciency = Orginality F luency ∈ [0, 2] (9) These metrics normalize for solution quantity, providing a clearer measure of solution quality [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of Diversity and Flexibility in [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of Diversity and Flexibility in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Dimension-level performance of TeamLLM (red) and baseline (blue) across all steps. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Two representative pages of the Excel scoring sheet designed for human evaluation, with some annotated [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at https://anonymous.4open.science/r/TeamLLM-anonymous-C50E/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TeamLLM adds four fixed roles and a three-phase workflow plus a new CGPST benchmark, but the gains are not yet isolated from extra interaction rounds.

read the letter

The key takeaway is that TeamLLM assigns four specific roles and runs a three-phase process to improve multi-LLM performance on sequential tasks, backed by their new CGPST benchmark. The results claim clear gains over ten models. What stands out as new is the precise role set and workflow combined with a benchmark that stresses contextual grounding, procedural steps, and multi-dimensional scoring. They also release the full data and responses, which is straightforward and helpful for replication. The paper does a decent job breaking down results by step and dimension rather than just overall scores. That gives more insight into where the collaboration helps. The soft spot is the lack of targeted ablations. We do not see tests that keep the multi-turn interaction but remove the fixed roles or collapse the phases. Gains could come from extra compute or prompting instead of the team structure itself. The abstract mentions positive results but skips details on exact baselines and any error analysis or statistical tests. This work suits people who build or study applied multi-agent LLM systems for tasks that unfold over several steps. Someone needing a concrete starting point and a new test set will find value, even before the claims are fully locked down. It should go to peer review. The ideas are testable and the benchmark adds something concrete that referees can engage with and build on.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes TeamLLM, a framework for multi-LLM collaboration that incorporates four human-like team roles—Leader, Critic, Executor, and Recorder—along with a three-phase collaboration process to address multi-step contextualized tasks. It introduces the CGPST benchmark, which emphasizes contextual grounding, procedural structure, process-oriented evaluation, and multi-dimensional assessment. The authors evaluate ten popular LLMs on this benchmark at overall, step, and dimension levels, reporting that TeamLLM substantially improves performance, and provide the benchmark data including scenarios, full-process responses, and human scores.

Significance. Should the central claims hold under rigorous scrutiny, this work offers a novel human-inspired structure for multi-agent LLM systems, potentially leading to better handling of complex, multi-step tasks. The public release of the benchmark, responses, and scores supports reproducibility and further research in the area. However, the current presentation of results limits the ability to fully assess the framework's advantages over existing methods.

major comments (3)

[Results and Evaluation] The claim of substantial performance improvements on CGPST lacks supporting details such as the specific single-LLM and multi-LLM baselines used, any statistical tests performed, error bars, or number of runs. Without these, the central empirical claim cannot be properly evaluated.
[CGPST Benchmark Construction] There is insufficient description of how the benchmark scenarios were chosen or constructed, including potential selection biases or how they ensure coverage of multi-step contextualized tasks.
[Ablation and Control Experiments] The manuscript does not report ablation studies, such as comparing the full TeamLLM (with fixed roles and three phases) against variants with generic multi-agent interactions or without phase structure. This is critical to establish that the specific design, rather than increased interaction or token usage, drives the observed gains.

minor comments (2)

[Abstract] The code and data link is provided as an anonymous URL, which is appropriate for blind review but should be replaced with a permanent link in the final version.
[Framework Description] The four roles are introduced without a clear table or diagram summarizing their responsibilities and interactions in the three phases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify important areas where additional clarity and rigor will strengthen the presentation of TeamLLM and the CGPST benchmark. We address each major comment below and commit to incorporating the suggested improvements in a revised version.

read point-by-point responses

Referee: [Results and Evaluation] The claim of substantial performance improvements on CGPST lacks supporting details such as the specific single-LLM and multi-LLM baselines used, any statistical tests performed, error bars, or number of runs. Without these, the central empirical claim cannot be properly evaluated.

Authors: We acknowledge that the current manuscript would benefit from more comprehensive reporting of the experimental details to allow full assessment of the performance claims. In the revised version, we will explicitly list the single-LLM baselines (direct prompting of each of the ten evaluated models) and multi-LLM baselines (including standard multi-agent collaboration without role specialization), report that all experiments were run five times with different random seeds, include error bars showing standard deviation, and add statistical significance testing (paired t-tests with reported p-values) between TeamLLM and the baselines to substantiate the improvements. revision: yes
Referee: [CGPST Benchmark Construction] There is insufficient description of how the benchmark scenarios were chosen or constructed, including potential selection biases or how they ensure coverage of multi-step contextualized tasks.

Authors: We agree that the benchmark construction section requires expansion for transparency. In the revision, we will add a detailed subsection describing the scenario selection process, the criteria applied to ensure coverage of multi-step contextualized tasks (e.g., varying numbers of steps, domains, and contextual dependencies), the sources used for scenario generation, and steps taken to reduce selection bias such as stratified sampling across complexity levels and independent review by multiple annotators. revision: yes
Referee: [Ablation and Control Experiments] The manuscript does not report ablation studies, such as comparing the full TeamLLM (with fixed roles and three phases) against variants with generic multi-agent interactions or without phase structure. This is critical to establish that the specific design, rather than increased interaction or token usage, drives the observed gains.

Authors: The referee is correct that ablation studies are missing from the current manuscript. We will add a new subsection with ablation experiments that compare the full TeamLLM framework against (i) a generic multi-agent baseline without fixed roles, (ii) a version that removes the three-phase structure while retaining roles, and (iii) controls that match interaction count and token budget. These results will be presented alongside the main evaluation to isolate the contribution of the human-inspired role division and phased process. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on newly constructed benchmark

full rationale

The paper introduces TeamLLM as a framework with four fixed roles and a three-phase process, then constructs the CGPST benchmark to evaluate it directly against ten LLMs. All claims reduce to reported performance numbers on this benchmark at overall, step, and dimension levels, with no equations, fitted parameters, self-referential definitions, or load-bearing self-citations that collapse the central result back to its inputs by construction. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the design choice that four fixed roles plus phased interaction will emulate useful human team behavior; no free parameters are fitted to data in the abstract, but the roles themselves are invented constructs whose effectiveness is tested empirically.

axioms (1)

domain assumption Distinct LLM roles can be maintained across multiple interaction turns without role drift or prompt leakage.
Implicit in the three-phase collaboration design.

invented entities (1)

Four team roles (Leader, Critic, Executor, Recorder) no independent evidence
purpose: To enforce perspective diversity and procedural structure in multi-LLM interaction.
Newly defined roles introduced by the paper.

pith-pipeline@v0.9.0 · 5525 in / 1212 out tokens · 33313 ms · 2026-05-10T18:11:14.096895+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration... Inspired by Belbin’s team roles
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-phase collaboration framework: task initiation, perspective sharing, and consensus building

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, and Yixue Li

Creation-mmbench: Assessing context- aware creative intelligence in mllm.Preprint, arXiv:2503.14478. Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, and Yixue Li. 2025. Doctoragent-rl: A multi-agent col- laborative reinforcement learning system for multi- turn clinical dialogue.Preprint, arXiv:2505.19630. Kazuma Fukumura and Takayuki Ito. 2025. Can llm- powe...

work page arXiv 2025
[2]

generate ideas and solve difficult problems

Assessing and understanding creativity in large language models.Machine Intelligence Research, 22(3):417–436. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval,...

work page 2018
[3]

Actively collaborate with your team mem- bers, carefully consider their contributions, and work together to advance the task

work page
[4]

Clearly understand and adhere to your assigned team role and responsibilities, ensuring consistency in role-playing

work page
[5]

Follow the six-step problem solving pro- cess strictly

work page
[6]

All responses should be in Chinese, with clear language and accurate logic

work page
[7]

You are about to participate in a contextual- ized task

For each response, only output what you intend to say—do not include any role descrip- tions, labels, or guiding text. You are about to participate in a contextual- ized task. In this task, you will receive a scenario about future societal issues. Please follow the task steps and independently complete the task. The future scenario for this task is as fol...

work page
[8]

For each response, only output what you intend to say—do not include any explanations, labels, or guiding text

work page
[9]

Table 12: Meta Prompts for TeamLLM and Baseline Conditions

All responses should be in Chinese, with clear language and accurate logic. Table 12: Meta Prompts for TeamLLM and Baseline Conditions. 19 Team_Role Role_Speciality Role_Prompt Co-Ordinator Team guidance, task organization, consen- sus integration As the Co-Ordinator of the team, your primary responsibility is to organize and guide team collaboration, dri...

work page
[10]

Use declarative sentences

work page
[11]

might,” “could,

Use modal verbs like “might,” “could,” or “should.”

work page
[12]

Explain what the challenge is, why it is a challenge, and how it connects to the scenario

work page
[13]

xxx.” Step-2: Select an Underlying Problem From the challenges in Step-1, select the most impactful one and refor- mulate it into a focused core problem statement

Number each challenge, e.g., “1. xxx.” Step-2: Select an Underlying Problem From the challenges in Step-1, select the most impactful one and refor- mulate it into a focused core problem statement. Provide a complete description including:

work page
[14]

Challenge number: the identifier of the specific challenge from Step 1 that is being developed into the underlying problem

work page
[15]

Conditional phrase (CP): a fact or condition drawn from the future scenario, which provides the theoretical or situational basis for the problem

work page
[16]

How might we

Stem + Key Verb Phrase (KVP): the core phrasing of the underlying prob- lem, usually starting with “How might we. . . ” or “In what ways can we. . . ”. The KVP should contain only one active verb specifying the main action or intervention to be taken, and should avoid absolute or overly broad verbs to ensure focus and feasibility

work page
[17]

in order to

Purpose: typically expressed with “in order to” or “so that”, clarifying the intended goal of the KVP

work page
[18]

Step-3: Produce Solutions Generate up to eight possible solutions based on the underlying prob- lem

Future scenario parameters: the three parameters of time, location, and theme that situate the underlying problem within the scenario. Step-3: Produce Solutions Generate up to eight possible solutions based on the underlying prob- lem. Each solution should:

work page
[19]

Each solution must be written as a complete sentence

work page
[20]

will” rather than “might

Use “will” rather than “might” to indicate certainty

work page
[21]

Each solution should address at least three of the following aspects: Who, What, How, Why, When, and Where

work page
[22]

Ensure alignment with the key verb phrase (KVP) and the intended purpose of the underlying problem

work page
[23]

1

Begin each solution with a number, e.g., “1. . . . ”. Step-4: Select Criteria Create five criteria to evaluate the solutions. Each criterion should:

work page
[24]

Be properly phrased: single dimension, superlatives as needed, indicate evaluation direction, phrased as a question

work page
[25]

Be relevant to the underlying problem

work page
[26]

xxx” Step-5: Apply Criteria to Top Solution Evaluate the eight solu- tions from Step-3 using the criteria from Step-4 in a matrix format

Numbered, e.g., “1. xxx” Step-5: Apply Criteria to Top Solution Evaluate the eight solu- tions from Step-3 using the criteria from Step-4 in a matrix format. Please provide the answers for this step in the following matrix (grid) format: Solution ID | Criterion 1 | Criterion 2 | Criterion 3 | Criterion 4 | Criterion 5 | Total Score 1 | 5 | 7 | 6 | 4 | 8 |...

work page
[27]

For each criterion, all solutions must be scored

work page
[28]

Scores for each criterion should range from 1 to x, where 1 represents the worst-performing solution andxrepresents the best

work page
[29]

No two solutions may receive the same score under the same criterion; i.e., each column must be a unique permutation of 1 tox

work page
[30]

healthy ocean

Provide both the full scoring matrix and the ID and content of the highest- scoring solution. Step-6: De- velop an Action Plan Develop the top solution from Step-5 into an ac- tionable plan. Develop the highest-scoring solution selected in Step-5 into a comprehensive action plan. The plan should systematically and thoroughly explain how the underlying pro...

work page 2052
[31]

water samples still show an alarming amount of plastic particles

The concentration of microplastics may have exceeded the density of plankton by tenfold, disrupting the energy input of the base food chain. This challenge arises from the warning in the scenario that "water samples still show an alarming amount of plastic particles." Category Elaboration Originality Environment 1 0 Environment 1 0

work page
[32]

after experimenting with several collection methods,

Subsurface robotic collectors may miss low-velocity eddy zones, creating data gaps and masking local ecological collapse points. This is directly related to the scenario’s mention that "after experimenting with several collection methods," weekly weighing is still required, implying sampling limitations. Category Elaboration Originality Technology 1 1 Tec...

work page
[33]

eliminating the need to return to shore for disposal

The plastic-to-fuel conversion system may emit nanoscale black carbon particles, which could exacerbate imbalances in ocean surface heat absorption. This challenge stems from the scenario emphasizing "eliminating the need to return to shore for disposal" without addressing potential secondary emissions. Category Elaboration Originality Technology 1 0 Tech...

work page
[34]

can dissolve pollutants

Endangered species in the northwest islands may face unknown toxicological effects from ingesting micro-fragments of plastics broken down by lasers. The scenario mentions that laser technology "can dissolve pollutants" but does not evaluate the byproducts of fragmentation. Category Elaboration Originality Environment 1 0 Environment 1 1

work page
[35]

altered to reduce impact

Adjustments to eco-tourism routes around floating laboratories may transfer visitor pressure to other more fragile reefs. This challenge is directly related to the scenario’s note that routes were "altered to reduce impact" but without ensuring overall pressure reduction. Category Elaboration Originality Recreation 1 0 Recreation 1 1

work page
[36]

environmental regulations have historically been weak or disregarded

Legal exemptions for manufacturers on both sides of the Pacific may cause Hawaiian regional governance to operate in isolation. This challenge is closely related to the scenario’s statement that "environmental regulations have historically been weak or disregarded." Category Elaboration Originality Law & Justice 1 0 Law & Justice 1 1

work page
[37]

harvesting tons of plastic

The efficiency of the plastic-to-diesel system may suddenly decline due to sea spray corrosion, forcing laboratories to rely on land-based resupply. This challenge is implied in the scenario mentioning "harvesting tons of plastic" without considering long-term durability. Category Elaboration Originality Technology 1 0 Technology 1 1

work page
[38]

dividing responsibilities among agencies

Data protocols among the network of floating labs may be incompatible, hindering multinational collaboration in compiling a comprehensive microplastic hotspot map. This challenge is directly related to the scenario emphasizing "dividing responsibilities among agencies" without a unified standard. Category Elaboration Originality Communication 1 2 Communic...

work page 2035
[39]

The Ola Kai project chemistry team will deploy glycosylated nanosponges, dispersing 2 tons within a 20 km radius of the Ola Kai mooring point by August 2035. Subsurface robots will recover the flocs and recycle them through the onboard plastic-to-diesel system, directly reducing microplastic ingestion by plankton, lowering the proportion of plastics at th...

work page 2035
[40]

photoacoustic unmanned vessel + AR snorkeling goggles

Google X Lab and Hawaiian community divers will run a crowdsourced "photoacoustic unmanned vessel + AR snorkeling goggles" collection program across the northwest islands by December 2035. Unmanned vessels will map microplastic clouds in real-time using laser sonar, while AR glasses guide divers to precise retrieval points, clearing high-density fragments...

work page 2035
[41]

bubble curtain + photocatalytic net

NOAA and Hawaiian Electric will pilot a 2-nautical-mile-diameter "bubble curtain + photocatalytic net" system north of Kaua’i by October 2035. Wave-driven pumps will concentrate microplastics, which are then broken down by photocatalytic nets into short-chain acids absorbable by phytoplankton, reducing microplastic dominance on plankton and restoring base...

work page 2035
[42]

SpaceX and a local high school team will launch the CubeSat constellation "KiloEye" by July 2035. Weekly scans of the 137 Hawaiian Islands will use hyperspectral data to direct Ola Kai drones for targeted microplastic removal, lowering the risk of plankton mis-ingestion at the source and ensuring Pacific ecosystem energy flow is rebalanced. Category Elabo...

work page 2035
[43]

biopolymer-coated kelp ropes

Japan’s SpiraNova and the University of Hawai’i will plant 300 "biopolymer-coated kelp ropes" off the west coast of the Big Island in Q3 2035. Kelp leaves will adsorb microplastics, and harvested ropes will be processed into high-value composites, directly removing plastics at the base of the food chain and generating revenue while protecting endangered s...

work page 2035
[44]

Plastic Sentinel

The Ola Kai project biology team will release living blue-green algae “Plastic Sentinel” strains in a 500-hectare demonstration area off the northwest reef of the main Hawaiian Island by November

work page
[45]

Category Elaboration Originality Technology 2 2 Technology 2 1

These algae continuously secrete degrading enzymes to break down 0.1–1 mm microplastics, reducing plankton ingestion, restoring baseline energy input, and protecting the Pacific ecosystem. Category Elaboration Originality Technology 2 2 Technology 2 1

work page
[46]

body-mounted

MantaSync and the University of Hawai’i will deploy five “body-mounted” manta ray filtration units along the Maui–Ni’ihau route by September 2035. These units capture microplastics in real-time during swimming and ferment them into manta ray body oils. Due to their large feeding area, they significantly dilute plastics at the base of the food chain, reduc...

work page 2035
[47]

container-scale algae farm

A local cruise company and the state government will retrofit the first ferry deck in Honolulu Harbor into a “container-scale algae farm” by October 2035. Chlorella algae will consume microplastics from tourist wastewater and convert them into aviation-grade biodiesel, reducing overall plastic input from tourism, lowering plankton ingestion, and maintaini...

work page 2035
[48]

Which approach can most rapidly reduce the net microplastic content inside plankton? Correctly Written Relevance 1 2 1 2

work page
[49]

Which pathway requires the lowest one-time capital investment? Correctly Written Relevance 1 0 1 1

work page
[50]

Which model has the highest potential for rapid global replication and scaling? Correctly Written Relevance 1 1 1 1

work page
[51]

Which technology poses the least risk of secondary ecological disturbance to endangered species? Correctly Written Relevance 1 1 1 2

work page
[52]

Which solution maintains the lowest full-life-cycle carbon footprint over 15 years? Correctly Written Relevance 1 2 1 2 Step-4 Score: H01: {Correctly Written: 5,Relevance: 6,Total Score: 11} H02: {Correctly Written: 5,Relevance: 8,Total Score: 13} NOTE:

work page
[53]

correctly written

All criteria satisfy the requirements ofsingle dimension,superlatives as needed,clear indication of evaluation direction, andphrasing as questions, and thus are all "correctly written"

work page
[54]

The rubrics for Relevance are defined as follows: 0 points forirrelevance or repetition; 1 point forbeing vague and unspecific; 2 points forbeing relatively specific yet improvable; 3 points forbeing clear, specific and highly relevant

work page
[55]

Table 21: Step-4 (Select Criteria) of A05_FS10

The rubrics and detailed explanations for the two dimensions are available in thescoring sheet. Table 21: Step-4 (Select Criteria) of A05_FS10. Responses Score Item H01 H02 Solution ID | Criterion 1 | Criterion 2 | Criterion 3 | Criterion 4 | Criterion 5 | Total Score 1 | 8 | 6 | 7 | 7 | 6 | 34 2 | 6 | 8 | 5 | 6 | 7 | 32 3 | 7 | 3 | 6 | 3 | 5 | 24 4 | 5 |...

work page 2035
[56]

molecular magnet

Problem Closure Nanosponges use a "molecular magnet" mechanism to selectively adsorb 0.1–1 mm microplastics, aggregating them into millimeter-scale flocs that cannot be ingested by plankton. Recovered flocs are immediately converted into diesel, achieving a “collect-convert-use” zero-waste cycle, directly reducing the overwhelming proportion of microplast...

work page
[57]

• Phase B – Pilot Demonstration (Nov 2025–Apr 2026): Deploy 100 kg in South Bay, Oahu; 30-day monitoring shows≥70%reduction of microplastics in plankton

Implementation Steps and Timeline • Phase A – R&D and Validation (Now–Oct 2025): Ola Kai Chemistry Team × MIT Materials Department iterate the third-generation biodegradable nanosponges and complete biotoxicity-degradation tests. • Phase B – Pilot Demonstration (Nov 2025–Apr 2026): Deploy 100 kg in South Bay, Oahu; 30-day monitoring shows≥70%reduction of ...

work page 2025
[58]

• Team: Chemistry team handles materials, MIT provides R&D, NOAA provides monitoring platform, State Environmental Department supervises approvals

Resources and Responsibilities • Funding: Ola Kai Research $300k + NOAA Innovation Fund $400k + State Green Bonds $300k + Carbon Credit Pre-sale; total≤$1M. • Team: Chemistry team handles materials, MIT provides R&D, NOAA provides monitoring platform, State Environmental Department supervises approvals

work page
[59]

• Robot malfunction: 1:1 spare parts + 48-hour offshore repair; if failure rate >15%, NOAA backup ROVs are deployed

Risks and Contingency • Nanomaterial leakage: Three passive samplers monitor in real-time; >10µg/L triggers magnetic recovery nets. • Robot malfunction: 1:1 spare parts + 48-hour offshore repair; if failure rate >15%, NOAA backup ROVs are deployed. • Regulatory delays: Suspension during typhoon season; stock maintained at 1.5× safety level

work page
[60]

Nanosponges Sharing Depot

Impacts and Scaling • Local: By 2028, microplastic content in plankton decreases by 80%, coral spawning rates increase by 30%. • Regional: By 2030, open “Nanosponges Sharing Depot” allows replication in Guam, Palau, Tuvalu. • Global: By 2032, included in IMO Green Shipping Guidelines; long-haul fleets can treat plastics in-transit, establishing a Pacific-...

work page 2028

[1] [1]

Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, and Yixue Li

Creation-mmbench: Assessing context- aware creative intelligence in mllm.Preprint, arXiv:2503.14478. Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, and Yixue Li. 2025. Doctoragent-rl: A multi-agent col- laborative reinforcement learning system for multi- turn clinical dialogue.Preprint, arXiv:2505.19630. Kazuma Fukumura and Takayuki Ito. 2025. Can llm- powe...

work page arXiv 2025

[2] [2]

generate ideas and solve difficult problems

Assessing and understanding creativity in large language models.Machine Intelligence Research, 22(3):417–436. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. InThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval,...

work page 2018

[3] [3]

Actively collaborate with your team mem- bers, carefully consider their contributions, and work together to advance the task

work page

[4] [4]

Clearly understand and adhere to your assigned team role and responsibilities, ensuring consistency in role-playing

work page

[5] [5]

Follow the six-step problem solving pro- cess strictly

work page

[6] [6]

All responses should be in Chinese, with clear language and accurate logic

work page

[7] [7]

You are about to participate in a contextual- ized task

For each response, only output what you intend to say—do not include any role descrip- tions, labels, or guiding text. You are about to participate in a contextual- ized task. In this task, you will receive a scenario about future societal issues. Please follow the task steps and independently complete the task. The future scenario for this task is as fol...

work page

[8] [8]

For each response, only output what you intend to say—do not include any explanations, labels, or guiding text

work page

[9] [9]

Table 12: Meta Prompts for TeamLLM and Baseline Conditions

All responses should be in Chinese, with clear language and accurate logic. Table 12: Meta Prompts for TeamLLM and Baseline Conditions. 19 Team_Role Role_Speciality Role_Prompt Co-Ordinator Team guidance, task organization, consen- sus integration As the Co-Ordinator of the team, your primary responsibility is to organize and guide team collaboration, dri...

work page

[10] [10]

Use declarative sentences

work page

[11] [11]

might,” “could,

Use modal verbs like “might,” “could,” or “should.”

work page

[12] [12]

Explain what the challenge is, why it is a challenge, and how it connects to the scenario

work page

[13] [13]

xxx.” Step-2: Select an Underlying Problem From the challenges in Step-1, select the most impactful one and refor- mulate it into a focused core problem statement

Number each challenge, e.g., “1. xxx.” Step-2: Select an Underlying Problem From the challenges in Step-1, select the most impactful one and refor- mulate it into a focused core problem statement. Provide a complete description including:

work page

[14] [14]

Challenge number: the identifier of the specific challenge from Step 1 that is being developed into the underlying problem

work page

[15] [15]

Conditional phrase (CP): a fact or condition drawn from the future scenario, which provides the theoretical or situational basis for the problem

work page

[16] [16]

How might we

Stem + Key Verb Phrase (KVP): the core phrasing of the underlying prob- lem, usually starting with “How might we. . . ” or “In what ways can we. . . ”. The KVP should contain only one active verb specifying the main action or intervention to be taken, and should avoid absolute or overly broad verbs to ensure focus and feasibility

work page

[17] [17]

in order to

Purpose: typically expressed with “in order to” or “so that”, clarifying the intended goal of the KVP

work page

[18] [18]

Step-3: Produce Solutions Generate up to eight possible solutions based on the underlying prob- lem

Future scenario parameters: the three parameters of time, location, and theme that situate the underlying problem within the scenario. Step-3: Produce Solutions Generate up to eight possible solutions based on the underlying prob- lem. Each solution should:

work page

[19] [19]

Each solution must be written as a complete sentence

work page

[20] [20]

will” rather than “might

Use “will” rather than “might” to indicate certainty

work page

[21] [21]

Each solution should address at least three of the following aspects: Who, What, How, Why, When, and Where

work page

[22] [22]

Ensure alignment with the key verb phrase (KVP) and the intended purpose of the underlying problem

work page

[23] [23]

1

Begin each solution with a number, e.g., “1. . . . ”. Step-4: Select Criteria Create five criteria to evaluate the solutions. Each criterion should:

work page

[24] [24]

Be properly phrased: single dimension, superlatives as needed, indicate evaluation direction, phrased as a question

work page

[25] [25]

Be relevant to the underlying problem

work page

[26] [26]

xxx” Step-5: Apply Criteria to Top Solution Evaluate the eight solu- tions from Step-3 using the criteria from Step-4 in a matrix format

Numbered, e.g., “1. xxx” Step-5: Apply Criteria to Top Solution Evaluate the eight solu- tions from Step-3 using the criteria from Step-4 in a matrix format. Please provide the answers for this step in the following matrix (grid) format: Solution ID | Criterion 1 | Criterion 2 | Criterion 3 | Criterion 4 | Criterion 5 | Total Score 1 | 5 | 7 | 6 | 4 | 8 |...

work page

[27] [27]

For each criterion, all solutions must be scored

work page

[28] [28]

Scores for each criterion should range from 1 to x, where 1 represents the worst-performing solution andxrepresents the best

work page

[29] [29]

No two solutions may receive the same score under the same criterion; i.e., each column must be a unique permutation of 1 tox

work page

[30] [30]

healthy ocean

Provide both the full scoring matrix and the ID and content of the highest- scoring solution. Step-6: De- velop an Action Plan Develop the top solution from Step-5 into an ac- tionable plan. Develop the highest-scoring solution selected in Step-5 into a comprehensive action plan. The plan should systematically and thoroughly explain how the underlying pro...

work page 2052

[31] [31]

water samples still show an alarming amount of plastic particles

The concentration of microplastics may have exceeded the density of plankton by tenfold, disrupting the energy input of the base food chain. This challenge arises from the warning in the scenario that "water samples still show an alarming amount of plastic particles." Category Elaboration Originality Environment 1 0 Environment 1 0

work page

[32] [32]

after experimenting with several collection methods,

Subsurface robotic collectors may miss low-velocity eddy zones, creating data gaps and masking local ecological collapse points. This is directly related to the scenario’s mention that "after experimenting with several collection methods," weekly weighing is still required, implying sampling limitations. Category Elaboration Originality Technology 1 1 Tec...

work page

[33] [33]

eliminating the need to return to shore for disposal

The plastic-to-fuel conversion system may emit nanoscale black carbon particles, which could exacerbate imbalances in ocean surface heat absorption. This challenge stems from the scenario emphasizing "eliminating the need to return to shore for disposal" without addressing potential secondary emissions. Category Elaboration Originality Technology 1 0 Tech...

work page

[34] [34]

can dissolve pollutants

Endangered species in the northwest islands may face unknown toxicological effects from ingesting micro-fragments of plastics broken down by lasers. The scenario mentions that laser technology "can dissolve pollutants" but does not evaluate the byproducts of fragmentation. Category Elaboration Originality Environment 1 0 Environment 1 1

work page

[35] [35]

altered to reduce impact

Adjustments to eco-tourism routes around floating laboratories may transfer visitor pressure to other more fragile reefs. This challenge is directly related to the scenario’s note that routes were "altered to reduce impact" but without ensuring overall pressure reduction. Category Elaboration Originality Recreation 1 0 Recreation 1 1

work page

[36] [36]

environmental regulations have historically been weak or disregarded

Legal exemptions for manufacturers on both sides of the Pacific may cause Hawaiian regional governance to operate in isolation. This challenge is closely related to the scenario’s statement that "environmental regulations have historically been weak or disregarded." Category Elaboration Originality Law & Justice 1 0 Law & Justice 1 1

work page

[37] [37]

harvesting tons of plastic

The efficiency of the plastic-to-diesel system may suddenly decline due to sea spray corrosion, forcing laboratories to rely on land-based resupply. This challenge is implied in the scenario mentioning "harvesting tons of plastic" without considering long-term durability. Category Elaboration Originality Technology 1 0 Technology 1 1

work page

[38] [38]

dividing responsibilities among agencies

Data protocols among the network of floating labs may be incompatible, hindering multinational collaboration in compiling a comprehensive microplastic hotspot map. This challenge is directly related to the scenario emphasizing "dividing responsibilities among agencies" without a unified standard. Category Elaboration Originality Communication 1 2 Communic...

work page 2035

[39] [39]

The Ola Kai project chemistry team will deploy glycosylated nanosponges, dispersing 2 tons within a 20 km radius of the Ola Kai mooring point by August 2035. Subsurface robots will recover the flocs and recycle them through the onboard plastic-to-diesel system, directly reducing microplastic ingestion by plankton, lowering the proportion of plastics at th...

work page 2035

[40] [40]

photoacoustic unmanned vessel + AR snorkeling goggles

Google X Lab and Hawaiian community divers will run a crowdsourced "photoacoustic unmanned vessel + AR snorkeling goggles" collection program across the northwest islands by December 2035. Unmanned vessels will map microplastic clouds in real-time using laser sonar, while AR glasses guide divers to precise retrieval points, clearing high-density fragments...

work page 2035

[41] [41]

bubble curtain + photocatalytic net

NOAA and Hawaiian Electric will pilot a 2-nautical-mile-diameter "bubble curtain + photocatalytic net" system north of Kaua’i by October 2035. Wave-driven pumps will concentrate microplastics, which are then broken down by photocatalytic nets into short-chain acids absorbable by phytoplankton, reducing microplastic dominance on plankton and restoring base...

work page 2035

[42] [42]

SpaceX and a local high school team will launch the CubeSat constellation "KiloEye" by July 2035. Weekly scans of the 137 Hawaiian Islands will use hyperspectral data to direct Ola Kai drones for targeted microplastic removal, lowering the risk of plankton mis-ingestion at the source and ensuring Pacific ecosystem energy flow is rebalanced. Category Elabo...

work page 2035

[43] [43]

biopolymer-coated kelp ropes

Japan’s SpiraNova and the University of Hawai’i will plant 300 "biopolymer-coated kelp ropes" off the west coast of the Big Island in Q3 2035. Kelp leaves will adsorb microplastics, and harvested ropes will be processed into high-value composites, directly removing plastics at the base of the food chain and generating revenue while protecting endangered s...

work page 2035

[44] [44]

Plastic Sentinel

The Ola Kai project biology team will release living blue-green algae “Plastic Sentinel” strains in a 500-hectare demonstration area off the northwest reef of the main Hawaiian Island by November

work page

[45] [45]

Category Elaboration Originality Technology 2 2 Technology 2 1

These algae continuously secrete degrading enzymes to break down 0.1–1 mm microplastics, reducing plankton ingestion, restoring baseline energy input, and protecting the Pacific ecosystem. Category Elaboration Originality Technology 2 2 Technology 2 1

work page

[46] [46]

body-mounted

MantaSync and the University of Hawai’i will deploy five “body-mounted” manta ray filtration units along the Maui–Ni’ihau route by September 2035. These units capture microplastics in real-time during swimming and ferment them into manta ray body oils. Due to their large feeding area, they significantly dilute plastics at the base of the food chain, reduc...

work page 2035

[47] [47]

container-scale algae farm

A local cruise company and the state government will retrofit the first ferry deck in Honolulu Harbor into a “container-scale algae farm” by October 2035. Chlorella algae will consume microplastics from tourist wastewater and convert them into aviation-grade biodiesel, reducing overall plastic input from tourism, lowering plankton ingestion, and maintaini...

work page 2035

[48] [48]

Which approach can most rapidly reduce the net microplastic content inside plankton? Correctly Written Relevance 1 2 1 2

work page

[49] [49]

Which pathway requires the lowest one-time capital investment? Correctly Written Relevance 1 0 1 1

work page

[50] [50]

Which model has the highest potential for rapid global replication and scaling? Correctly Written Relevance 1 1 1 1

work page

[51] [51]

Which technology poses the least risk of secondary ecological disturbance to endangered species? Correctly Written Relevance 1 1 1 2

work page

[52] [52]

Which solution maintains the lowest full-life-cycle carbon footprint over 15 years? Correctly Written Relevance 1 2 1 2 Step-4 Score: H01: {Correctly Written: 5,Relevance: 6,Total Score: 11} H02: {Correctly Written: 5,Relevance: 8,Total Score: 13} NOTE:

work page

[53] [53]

correctly written

All criteria satisfy the requirements ofsingle dimension,superlatives as needed,clear indication of evaluation direction, andphrasing as questions, and thus are all "correctly written"

work page

[54] [54]

The rubrics for Relevance are defined as follows: 0 points forirrelevance or repetition; 1 point forbeing vague and unspecific; 2 points forbeing relatively specific yet improvable; 3 points forbeing clear, specific and highly relevant

work page

[55] [55]

Table 21: Step-4 (Select Criteria) of A05_FS10

The rubrics and detailed explanations for the two dimensions are available in thescoring sheet. Table 21: Step-4 (Select Criteria) of A05_FS10. Responses Score Item H01 H02 Solution ID | Criterion 1 | Criterion 2 | Criterion 3 | Criterion 4 | Criterion 5 | Total Score 1 | 8 | 6 | 7 | 7 | 6 | 34 2 | 6 | 8 | 5 | 6 | 7 | 32 3 | 7 | 3 | 6 | 3 | 5 | 24 4 | 5 |...

work page 2035

[56] [56]

molecular magnet

Problem Closure Nanosponges use a "molecular magnet" mechanism to selectively adsorb 0.1–1 mm microplastics, aggregating them into millimeter-scale flocs that cannot be ingested by plankton. Recovered flocs are immediately converted into diesel, achieving a “collect-convert-use” zero-waste cycle, directly reducing the overwhelming proportion of microplast...

work page

[57] [57]

• Phase B – Pilot Demonstration (Nov 2025–Apr 2026): Deploy 100 kg in South Bay, Oahu; 30-day monitoring shows≥70%reduction of microplastics in plankton

Implementation Steps and Timeline • Phase A – R&D and Validation (Now–Oct 2025): Ola Kai Chemistry Team × MIT Materials Department iterate the third-generation biodegradable nanosponges and complete biotoxicity-degradation tests. • Phase B – Pilot Demonstration (Nov 2025–Apr 2026): Deploy 100 kg in South Bay, Oahu; 30-day monitoring shows≥70%reduction of ...

work page 2025

[58] [58]

• Team: Chemistry team handles materials, MIT provides R&D, NOAA provides monitoring platform, State Environmental Department supervises approvals

Resources and Responsibilities • Funding: Ola Kai Research $300k + NOAA Innovation Fund $400k + State Green Bonds $300k + Carbon Credit Pre-sale; total≤$1M. • Team: Chemistry team handles materials, MIT provides R&D, NOAA provides monitoring platform, State Environmental Department supervises approvals

work page

[59] [59]

• Robot malfunction: 1:1 spare parts + 48-hour offshore repair; if failure rate >15%, NOAA backup ROVs are deployed

Risks and Contingency • Nanomaterial leakage: Three passive samplers monitor in real-time; >10µg/L triggers magnetic recovery nets. • Robot malfunction: 1:1 spare parts + 48-hour offshore repair; if failure rate >15%, NOAA backup ROVs are deployed. • Regulatory delays: Suspension during typhoon season; stock maintained at 1.5× safety level

work page

[60] [60]

Nanosponges Sharing Depot

Impacts and Scaling • Local: By 2028, microplastic content in plankton decreases by 80%, coral spawning rates increase by 30%. • Regional: By 2030, open “Nanosponges Sharing Depot” allows replication in Guam, Palau, Tuvalu. • Global: By 2032, included in IMO Green Shipping Guidelines; long-haul fleets can treat plastics in-transit, establishing a Pacific-...

work page 2028