pith. sign in

arxiv: 2606.18728 · v1 · pith:3Y7EOFYEnew · submitted 2026-06-17 · 💻 cs.CL

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

Pith reviewed 2026-06-26 20:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords legal agentslife-cycle simulationChinese civil litigationagent evaluationLongJud-Benchcausal state chainprocedural faithfulness
0
0 comments X

The pith

LegalWorld builds a single simulator that carries one Chinese civil dispute through five causally connected litigation stages without resetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing legal benchmarks test isolated tasks and restart each scenario independently, so they miss how early drafting choices shape later trial outcomes. LegalWorld instead constructs an interactive environment that models the full life cycle of Chinese civil litigation as one causally linked chain of five stages, grounded directly in 75,309 paired real judgments. Reusable memory and tool infrastructure keep the same dispute consistent across stages. Human ratings from legal experts confirm the generated trajectories stay procedurally faithful and role-consistent. Stage-by-stage model comparisons on the new LongJud-Bench expose performance gaps that overall scores conceal.

Core claim

LegalWorld is a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios) grounded in 75,309 paired Chinese civil judgments, equipped with local memory, global case memory, and a Skill/Tool library that maintains consistency across the entire dispute, together with LongJud-Bench that evaluates agents across all connected stages.

What carries the argument

The causally connected state chain of five stages together with local memory, global case memory, and Skill/Tool library that enforce role and procedural consistency across the full life cycle.

If this is right

  • Trajectories remain procedurally faithful and role-consistent according to 18,992 ratings from 217 legal-background evaluators.
  • Cross-model evaluations on LongJud-Bench reveal sharp capability divergences that aggregate scores do not capture.
  • No single model backbone leads across all five stages of consultation, drafting, and courtroom advocacy.
  • Agents can be tested on how decisions made in early stages constrain outcomes in later stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents trained inside the chained environment may develop better long-term case strategy than those trained on isolated tasks.
  • The same life-cycle approach could be adapted to evaluate agents in other sequential professional domains such as medical diagnosis chains.
  • Benchmarks that report only aggregate scores will continue to mask the need for stage-specific capabilities.

Load-bearing premise

That real litigation's five stages can be represented as one unbroken causal chain grounded in the judgments without simulation artifacts or omitted complexities that would break the claimed consistency.

What would settle it

Generated trajectories that produce outcomes or procedural steps inconsistent with the actual paired judgments used to ground the five-stage chain.

Figures

Figures reproduced from arXiv: 2606.18728 by Guanying Li, Shengbin Yue, Songhan Zuo, Tao Chiang, Xuanjing Huang, Yun Song, Zhongyu Wei.

Figure 1
Figure 1. Figure 1: Example from LEGALWORLD. The figure traces a civil dispute from legal consultation to the first￾instance civil trial, showing scene-level communication content and the memory flow through which case infor￾mation is recorded, updated, and carried forward. collection of independent tasks. A dispute unfolds from initial consultation through document draft￾ing, first-instance trial, appeal, and second-instance… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LEGALWORLD. The figure shows the participating client, lawyer, and judge agents, the five-stage life-cycle state chain, in-scenario local memory, global case memory, and Skill/Tool support. Source collection. We collect public civil first￾instance and second-instance judgment docu￾ments from China Judgments Online (wen￾shu.court.gov.cn), retaining the judgment text and case number of each docum… view at source ↗
Figure 3
Figure 3. Figure 3: Data foundation for LEGALWORLD envi￾ronment construction. (A) Court-level distribution of the 75,309 second-instance judgments used to construct runnable civil-litigation case trajectories. Most cases are decided at the intermediate court level, consistent with the structure of Chinese civil appellate jurisdiction. (B) Top-category cause-of-action distribution across all 75,309 paired cases. The distributi… view at source ↗
Figure 4
Figure 4. Figure 4: Human minus Claude-Sonnet-4.6 LLM-as￾Judge score differences across aligned metric-level pairs. Positive values indicate higher human scores; mean difference is +0.67, σ = 0.98, and 64.4% fall within one point (|∆| ≤ 1.0). scale across procedural compliance and process co￾herence, covering Civil Procedure Law alignment, procedural-step integrity, information transfer, turn￾taking, role boundaries, and prof… view at source ↗
Figure 5
Figure 5. Figure 5: Human rating reason analysis from the 18,992 free-text justifications. Top: the overall score distribution—73% of ratings are ≥ 9 and only 4.5% are ≤ 6. Bottom: selected informative reason themes summarized separately for the high-score band (≥ 9) and the rare low-score band (≤ 6); each bar reports a theme’s share within its own score band after assigning each justification one quality theme and omitting t… view at source ↗
Figure 6
Figure 6. Figure 6: Capability heatmap of the six backbones (con [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An anonymized lawyer role-memory excerpt ( [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An anonymized lawyer role-memory excerpt ( [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An anonymized client role-memory excerpt ( [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An anonymized client role-memory excerpt ( [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Human evaluation interface. Evaluators inspect a complete case trajectory by stage and assign structured [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt of Client Persona (Chinese Version) Prompt of Client Persona (English Version) You are playing a real legal party. Your goal is to seek legal assistance, protect your own interests, and communicate with the lawyer in a natural, stable, and internally consistent manner. You must always follow the given case facts, must not invent key facts, and must not suddenly change your personality or behavioral… view at source ↗
Figure 13
Figure 13. Figure 13: Prompt of Client Persona (English Version) Prompt of Lawyer in Consultation and Drafting (Chinese Version) LC. <Role>法律顾问</Role> <Task>解答当事人法律咨询,并通过发问补齐缺失事实。</Task> <Rules> 1. 首轮先厘清:你不知道案件背景、证据、诉求或人物关系,先追问最关键的一项,不要直接下结 论。2. 聚焦问题:直接且简洁地回应当前问题;若需补充信息,每次只问一个关键问题。3. 结束控制:法律咨 询场景的结束只能由当事人决定。不得在此场景起草文书,若当事人需要起草文书则引导结束咨询。</Rules> CD. <Role>原告代理律师</Role> <Task>通过对话向客户收集信息,并及时使用工具起草《民事起诉 状》。</Task> 【开场】延续之前的话题自然开场… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt of Lawyer in Consultation and Drafting ( [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt of Lawyer in Consultation and Drafting ( [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt of Lawyer in Trial (Chinese Version) Prompt of Lawyer in Trial (English Version) FIT plaintiff side. <Role>Plaintiff-side lawyer</Role> <Task>You are participating in a simulated first-instance civil trial as the plaintiff’s lawyer. Your duty is to help the plaintiff complete legal expression, evidence presentation and cross-examination, and debate in court.</Task> <Rules> 1. Follow the presiding j… view at source ↗
Figure 17
Figure 17. Figure 17: Prompt of Lawyer in Trial (English Version) 28 [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt of Judge in Trial (Chinese Version) Prompt of Judge in Trial (English Version) FIT judge. <Role>Presiding judge in a first-instance civil trial</Role> Cause of action: case_cause Case number: case_number Procedure: ordinary procedure <Rules> 1. Maintain neutrality. Participants include both parties and their lawyers. During opening and mediation, mainly question the parties themselves; during court… view at source ↗
Figure 19
Figure 19. Figure 19: Prompt of Judge in Trial (English Version) Prompt of LongJud-Bench LLM-as-Judge Scoring (Chinese Version) Stage-specific system prompt. LC:你是法律咨询阶段的评测法官。给定参考答案,但不要机械做关键词匹配, 只输出JSON。CD:你是起诉状评测法官。给定参考答案,但不要机械做关键词匹配,只输出JSON。DD:你 是答辩状评测法官。给定参考答案,但不要机械做关键词匹配,只输出JSON。AD:你是上诉状评测法官。给 定参考答案,但不要机械做关键词匹配,只输出JSON。AR:你是上诉答辩状评测法官。给定参考答案,但不 要机械做关键词匹配,只输出JSON。FIT:你是一审庭审评测法官。给定参考答案,但不要机械做关键词匹 配,只输出JSON。SIT:你是二审庭审评测法官。给定参考… view at source ↗
Figure 20
Figure 20. Figure 20: Prompt of LongJud-Bench LLM-as-Judge Scoring ( [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt of LongJud-Bench LLM-as-Judge Scoring ( [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Prompt of LC Full-Dialog Benchmark Scorer ( [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prompt of LC Full-Dialog Benchmark Scorer ( [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Prompt of Persona-Validation LLM-as-Judge ( [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Prompt of Persona-Validation LLM-as-Judge ( [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Prompt of Experimental LLM-as-Judge Evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Prompt of Experimental LLM-as-Judge Evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗
read the original abstract

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LegalWorld, an interactive environment modeling Chinese civil litigation as a causally connected five-stage (seven sub-scenario) state chain grounded directly in 75,309 paired judgments. It supplies reusable infrastructure (local/global memory, Skill/Tool library) to maintain consistency across the full life cycle and pairs the environment with LongJud-Bench for cross-stage agent evaluation. Human ratings (18,992 from 217 legal-background evaluators) are reported to confirm procedural faithfulness and role consistency, while cross-model results show capability divergences not captured by aggregate scores.

Significance. If the claimed causal chain and faithfulness hold, the work would fill a clear gap between isolated legal subtasks and realistic life-cycle dependencies, offering a reusable testbed that exposes model strengths/weaknesses across consultation, drafting, and advocacy. The scale of the judgment corpus and the public release of resources are explicit strengths.

major comments (2)
  1. [Abstract / environment construction] Abstract and environment-construction description: the central claim that the five-stage chain is 'grounded in' and 'directly' derived from the 75,309 paired judgments provides no account of the extraction procedure, state-transition rules, or enforcement mechanism for causal dependencies. Without this, it is impossible to assess whether real-world constraints (e.g., evolving admissibility or jurisdiction triggers) are preserved or abstracted away.
  2. [Human-evaluation protocol] Human-evaluation section: the 18,992 ratings are presented as confirming 'procedural faithfulness,' yet the protocol does not state whether evaluators received side-by-side access to the source judgment pairs. If evaluators only saw generated trajectories, the ratings cannot reliably detect modeling artifacts or omitted complexities that the skeptic note identifies as the weakest assumption.
minor comments (1)
  1. [Abstract] The abstract states 'detailed resources will be released publicly' but supplies no concrete inventory of what will be released (code, judgment pairs, transition rules, or evaluator guidelines).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the points raised.

read point-by-point responses
  1. Referee: [Abstract / environment construction] Abstract and environment-construction description: the central claim that the five-stage chain is 'grounded in' and 'directly' derived from the 75,309 paired judgments provides no account of the extraction procedure, state-transition rules, or enforcement mechanism for causal dependencies. Without this, it is impossible to assess whether real-world constraints (e.g., evolving admissibility or jurisdiction triggers) are preserved or abstracted away.

    Authors: We agree the current description lacks sufficient detail on the derivation process. In the revised manuscript we will add a dedicated subsection under environment construction that specifies: (1) the extraction procedure used to identify the five stages and seven sub-scenarios from the 75,309 paired judgments, (2) the legal-procedural rules defining state transitions (e.g., how filing triggers consultation and how admissibility evolves), and (3) the enforcement mechanisms (local/global memory and rule-based validators) that maintain causal consistency. This will allow readers to evaluate how real-world constraints are modeled versus abstracted. revision: yes

  2. Referee: [Human-evaluation protocol] Human-evaluation section: the 18,992 ratings are presented as confirming 'procedural faithfulness,' yet the protocol does not state whether evaluators received side-by-side access to the source judgment pairs. If evaluators only saw generated trajectories, the ratings cannot reliably detect modeling artifacts or omitted complexities that the skeptic note identifies as the weakest assumption.

    Authors: The manuscript does not currently detail the evaluator protocol. We will revise the human-evaluation section to explicitly describe the full protocol, including whether evaluators received the source judgment pairs alongside generated trajectories, the rating rubrics, and any blinding procedures. This clarification will directly address concerns about detecting modeling artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: environment built from external data and validated externally

full rationale

The paper constructs LegalWorld as a state chain grounded directly in 75,309 external paired Chinese civil judgments, then validates procedural faithfulness and role-consistency via 18,992 independent ratings from 217 legal-background evaluators. No equations, fitted parameters, predictions, or self-citation chains appear in the provided text. The central claims do not reduce to inputs by construction; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented physical entities are described; the work is infrastructure and benchmark construction based on existing judgment data.

pith-pipeline@v0.9.1-grok · 5728 in / 1066 out tokens · 20386 ms · 2026-06-26T20:48:31.742557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 21 canonical work pages · 4 internal anchors

  1. [2]

    Proceedings of the NeurIPS 2024 Workshop on Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI , year =

    Motivations for Reframing Large Language Model Benchmarking for Legal Applications , author =. Proceedings of the NeurIPS 2024 Workshop on Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI , year =

  2. [3]

    2026 , eprint=

    Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments , author=. 2026 , eprint=

  3. [12]

    Guha, Neel and Nyarko, Julian and Ho, Daniel E. and R. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track , year =

  4. [13]

    2025 , url =

    Kyung, Daeun and Chung, Hyunseung and Bae, Seongsu and Kim, Jiho and Sohn, Jae Ho and Kim, Taerim and Kim, Soo Kyung and Choi, Edward , booktitle =. 2025 , url =

  5. [14]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle =. Judging. 2023 , url =

  6. [15]

    2024 , eprint =

    Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model , author =. 2024 , eprint =

  7. [26]

    Executable Code Actions Elicit Better

    Wang, Xingyao and Chen, Yangyi and Yuan, Lifan and Zhang, Yizhe and Li, Yunzhu and Peng, Hao and Ji, Heng , booktitle =. Executable Code Actions Elicit Better. 2024 , publisher =

  8. [28]

    2024 , url =

    Li, Haitao and Chen, You and Ai, Qingyao and Wu, Yueyue and Zhang, Ruizhe and Liu, Yiqun , booktitle =. 2024 , url =

  9. [29]

    2024 , url =

    Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , booktitle =. 2024 , url =

  10. [30]

    Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

    Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

  11. [31]

    Evaluating Very Long-Term Conversational Memory of

    Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei , booktitle =. Evaluating Very Long-Term Conversational Memory of. 2024 , publisher =

  12. [32]

    2024 , url =

    Zhou, Xuhui and Zhu, Hao and Mathur, Leena and Zhang, Ruohong and Yu, Haofei and Qi, Zhengyang and Morency, Louis-Philippe and Bisk, Yonatan and Fried, Daniel and Neubig, Graham and Sap, Maarten , booktitle =. 2024 , url =

  13. [33]

    2026 , month = feb, howpublished =

    Claude Sonnet 4.6 System Card , author =. 2026 , month = feb, howpublished =

  14. [34]

    2025 , month = dec, howpublished =

    Update to GPT-5 System Card: GPT-5.2 , author =. 2025 , month = dec, howpublished =

  15. [36]

    Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments , url =

    Jia, Zheng and Yue, Shengbin and Chen, Wei and Wang, Siyuan and Liu, Yidong and Li, Zejun and Song, Yun and Wei, Zhongyu , urldate =. Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments , url =. 2026 , langid =. doi:10.48550/arXiv.2507.04037 , shorttitle =. 2507.04037 [cs] , keywords =

  16. [37]

    arXiv preprint arXiv:2502.06882 , year =

    Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction , author =. arXiv preprint arXiv:2502.06882 , year =

  17. [38]

    2024 , langid =

    He, Zhitao and Cao, Pengfei and Wang, Chenhao and Jin, Zhuoran and Chen, Yubo and Xu, Jiexin and Li, Huaijun and Jiang, Xiaojian and Liu, Kang and Zhao, Jun , urldate =. 2024 , langid =. doi:10.48550/arXiv.2403.02959 , shorttitle =. 2403.02959 [cs] , keywords =

  18. [39]

    CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

    Qu, Ao and Zheng, Han and Zhou, Zijian and Yan, Yihao and Tang, Yihong and Ong, Shao Yong and Hong, Fenglu and Zhou, Kaichen and Jiang, Chonghe and Kong, Minwei and Zhu, Jiacheng and Jiang, Xuan and Li, Sirui and Wu, Cathy and Low, Bryan Kian Hsiang and Zhao, Jinhua and Liang, Paul Pu , urldate =. 2026 , langid =. doi:10.48550/arXiv.2604.01658 , shorttitl...

  19. [40]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik R. , booktitle =. 2024 , url =

  20. [41]

    2023 , url =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , url =

  21. [42]

    Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

    Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

  22. [43]

    Transactions on Machine Learning Research (TMLR) , year =

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author =. Transactions on Machine Learning Research (TMLR) , year =

  23. [44]

    2024 , address =

    He, Zhitao and Cao, Pengfei and Wang, Chenhao and Jin, Zhuoran and Chen, Yubo and Xu, Jiexin and Li, Huaijun and Liu, Kang and Zhao, Jun , booktitle =. 2024 , address =

  24. [45]

    2412.15204 , archivePrefix =

    Bai, Yushi and Tu, Shangqing and Zhang, Jiajie and Peng, Hao and Wang, Xiaozhi and Lv, Xin and Cao, Shulin and Xu, Jiazheng and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi , year =. 2412.15204 , archivePrefix =

  25. [46]

    Frontiers of Computer Science , year =

    A Survey on Large Language Model based Autonomous Agents , author =. Frontiers of Computer Science , year =. 2308.11432 , archivePrefix =

  26. [47]

    arXiv preprint arXiv:2602.02276 , year =

    Kimi K2.5: Visual Agentic Intelligence , author =. arXiv preprint arXiv:2602.02276 , year =

  27. [48]

    2026 , month = apr, howpublished =

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , month = apr, howpublished =

  28. [49]

    arXiv preprint arXiv:2508.06471 , year =

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author =. arXiv preprint arXiv:2508.06471 , year =

  29. [50]

    Anthropic . 2026. https://anthropic.com/claude-sonnet-4-6-system-card Claude sonnet 4.6 system card . System card

  30. [51]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.172 LongBench : A bilingual, multitask benchmark for long context understanding . In Proceedings of the 62nd Annual Meeting of the Association for Co...

  31. [52]

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2025. https://doi.org/10.18653/v1/2025.acl-long.183 L ong B ench v2: Towards deeper understanding and reasoning on realistic long-context multitasks . In Proceedings of the 63rd Annual Meeting of the Association fo...

  32. [53]

    Guhong Chen, Liyang Fan, Zihan Gong, Nan Xie, Zixuan Li, Ziqiang Liu, Chengming Li, Qiang Qu, Hamid Alinejad-Rokny, Shiwen Ni, and Min Yang. 2025. https://doi.org/10.18653/v1/2025.findings-acl.304 AgentCourt : Simulating court with adversarial evolvable lawyer agents . In Findings of the Association for Computational Linguistics: ACL 2025, pages 5850--586...

  33. [54]

    Jiaxi Cui, Munan Ning, Zongjian Li, Bohua Chen, Yang Yan, Hao Li, Bin Ling, Yonghong Tian, and Li Yuan. 2024. https://arxiv.org/abs/2306.16092 Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model . Preprint, arXiv:2306.16092

  34. [55]

    DeepSeek-AI . 2025. https://arxiv.org/abs/2512.02556 Deepseek-v3.2: Pushing the frontier of open large language models . arXiv preprint arXiv:2512.02556

  35. [56]

    Chenlong Deng, Kelong Mao, and Zhicheng Dou. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.73 Learning interpretable legal case retrieval via knowledge-guided case reformulation . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1253--1265, Miami, Florida, USA. Association for Computational Linguistics

  36. [57]

    Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Alan Huang, Songyang Zhang, Kai Chen, Zhixin Yin, Zongwen Shen, Jidong Ge, and Vincent Ng. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.452 L aw B ench: Benchmarking legal knowledge of large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Proc...

  37. [58]

    Cheng Gao, Chaojun Xiao, Zhenghao Liu, Huimin Chen, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.402 Enhancing legal case retrieval via scaling high-quality synthetic query-candidate pairs . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7086--7100, Miami, Florida, USA. A...

  38. [59]

    Ho, Christopher R \'e , Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher R \'e , Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, and 21 others. 2023. https://openreview.net/forum?id=WqSPQF...

  39. [60]

    Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Jiexin Xu, Huaijun Li, Kang Liu, and Jun Zhao. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.549 AgentsCourt : Building judicial decision-making agents with court debate simulation and legal knowledge augmentation . In Findings of the Association for Computational Linguistics: EMNLP 202...

  40. [61]

    Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Zejun Li, Yun Song, and Zhongyu Wei. 2026. https://arxiv.org/abs/2507.04037 Ready jurist one: Benchmarking language agents for legal intelligence in dynamic environments . Preprint, arXiv:2507.04037

  41. [62]

    Sheng Jin, Haoming Wang, Zhiqi Gao, Yongbo Yang, Bao Chunjia, and Chengliang Wang. 2025. https://doi.org/10.48550/arXiv.2510.11290 Evolution in simulation: AI -agent school with dual memory for high-fidelity educational dynamics . Preprint, arxiv:2510.11290 [cs]

  42. [63]

    Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, and Edward Choi. 2025. https://openreview.net/forum?id=1THAjdP4QJ PatientSim : A persona-driven simulator for realistic doctor-patient interactions . In Advances in Neural Information Processing Systems 39 (NeurIPS 2025) Datasets and Benchmarks Track

  43. [64]

    Chance Jiajie Li, Jiayi Wu, Zhenze Mo, Ao Qu, Yuhan Tang, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Jinhua Zhao, Paul Liang, Luis Alonso, and Kent Larson. 2025 a . https://doi.org/10.48550/arXiv.2506.06958 Simulating society requires simulating thought . Preprint, arxiv:2506.06958 [cs]

  44. [65]

    Haitao Li, You Chen, Qingyao Ai, Yueyue Wu, Ruizhe Zhang, and Yiqun Liu. 2024. https://doi.org/10.52202/079017-0790 LexEval : A comprehensive C hinese legal benchmark for evaluating large language models . In Advances in Neural Information Processing Systems 38 (NeurIPS 2024) Datasets and Benchmarks Track

  45. [66]

    Junkai Li, Yunghwei Lai, Weitao Li, Jingyi Ren, Meng Zhang, Xinhui Kang, Siyu Wang, Peng Li, Ya-Qin Zhang, Weizhi Ma, and Yang Liu. 2025 b . https://doi.org/10.48550/arXiv.2405.02957 Agent hospital: A simulacrum of hospital with evolvable medical agents . Preprint, arxiv:2405.02957 [cs]

  46. [67]

    Shuang Liu, Ruijia Zhang, Ruoyun Ma, Yujia Deng, Lanyi Zhu, Jiayu Li, Zelong Li, Zhibin Shen, and Mengnan Du. 2026. https://doi.org/10.48550/arXiv.2601.06216 LLM agents in law: Taxonomy, applications, and challenges . Preprint, arxiv:2601.06216 [cs]

  47. [68]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. https://aclanthology.org/2024.acl-long.747/ Evaluating very long-term conversational memory of LLM agents . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851--13870. Associati...

  48. [69]

    OpenAI . 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf Update to gpt-5 system card: Gpt-5.2 . System card update

  49. [70]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. https://arxiv.org/abs/2310.08560 MemGPT : Towards LLMs as operating systems . Preprint, arXiv:2310.08560

  50. [71]

    Generative agents: Interactive simulacra of human behavior,

    Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. https://doi.org/10.1145/3586183.3606763 Generative agents: Interactive simulacra of human behavior . In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23), New York, NY, USA. Association for ...

  51. [72]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. https://openreview.net/forum?id=dHng2O0Jjr ToolLLM : Facilitating large language models to master 16000+ real-world APIs ....

  52. [73]

    Riya Ranjan and Megan Ma. 2024. https://neurips.cc/virtual/2024/104203 Motivations for reframing large language model benchmarking for legal applications . In Proceedings of the NeurIPS 2024 Workshop on Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI

  53. [74]

    Timo Schick, Janne Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://openreview.net/forum?id=Yacmpz84TH Toolformer: Language models can teach themselves to use tools . In Advances in Neural Information Processing Systems 36 (NeurIPS 2023)

  54. [75]

    Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, and Yun-Nung Chen. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.969 Two tales of persona in LLMs : A survey of role-playing and personalization . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16612--16631, Miami, Florida, USA. Ass...

  55. [76]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024 a . https://doi.org/10.1007/s11704-024-40231-1 A survey on large language model based autonomous agents . Frontiers of Computer Science, arXiv:2308.11432

  56. [77]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024 b . https://proceedings.mlr.press/v235/wang24h.html Executable code actions elicit better LLM agents . In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 50208--50232. PMLR

  57. [78]

    Yiding Wang, Yuxuan Chen, Fanxu Meng, Xifan Chen, Xiaolei Yang, and Muhan Zhang. 2025. https://doi.org/10.48550/arXiv.2510.24442 Law in silico: Simulating legal society with LLM -based agents . Preprint, arxiv:2510.24442 [cs]

  58. [79]

    Chaojun Xiao, Haoxi Zhong, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018. https://arxiv.org/abs/1807.02478 CAIL2018 : A large-scale legal dataset for judgment prediction . Preprint, arXiv:1807.02478

  59. [80]

    Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, and Zhongyu Wei. 2023. https://arxiv.org/abs/2309.11325 DISC - LawLLM : Fine-tuning large language models for intelligent legal services . Preprint, arXiv:2309.11325

  60. [81]

    Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. 2026. https://doi.org/10.48550/arXiv.2602.02474 MemSkill : Learning and evolving memory skills for self-evolving agents . Preprint, arxiv:2602.02474 [cs]

  61. [82]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. https://openreview.net/forum?id=uccHPGDlao Judging LLM -as-a-judge with MT -bench and chatbot arena . In Advances in Neural Information Processing Systems 36 (NeurIPS 2023...

  62. [83]

    Haoxi Zhong, Chaojun Xiao, Zhipeng Guo, Cunchao Tu, Zhiyuan Liu, Maosong Sun, Yansong Feng, Xianpei Han, Zhen Hu, Heng Wang, and Jianfeng Xu. 2018. https://arxiv.org/abs/1810.05851 Overview of CAIL2018 : Legal judgment prediction competition . Preprint, arXiv:1810.05851

  63. [84]

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. 2024. https://openreview.net/forum?id=mM7VurbA4r SOTOPIA : Interactive evaluation for social intelligence in language agents . In The Twelfth International Conference on Learning Representations (ICLR)

  64. [85]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  65. [86]

    Publications Manual , year = "1983", publisher =

  66. [87]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  67. [88]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  68. [89]

    Dan Gusfield , title =. 1997

  69. [90]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  70. [91]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =