arxiv: 2604.18041 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.CY

Recognition: unknown

JudgeMeNot: Personalizing Large Language Models to Emulate Judicial Reasoning in Hebrew

Itay Razumenko , Arnon Sturm , Nir Grinberg

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:08 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords LLM personalizationjudicial reasoninginstruction tuningsynthetic dataHebrewcausal language modelinglow-resource settingsmodel fine-tuning

0 comments

The pith

A pipeline using synthetic data from court rulings lets language models emulate specific judges' reasoning in Hebrew.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to personalize large language models for individual judges by converting existing decisions into instruction examples. It applies causal language modeling first and then tunes the model with the synthetic instructions. This approach works in settings with limited data such as Hebrew legal texts. The resulting models produce outputs that match human judges closely in wording, style, and meaning. If the claim holds, it shows that efficient personalization is possible without large volumes of new private data.

Core claim

The central claim is that Causal Language Modeling followed by synthetically generated instruction-tuning significantly outperforms all other baselines, providing significant improvements across lexical, stylistic, and semantic similarity. Notably, our model-generated outputs are indistinguishable from the reasoning of human judges.

What carries the argument

The synthetic-organic supervision pipeline that transforms raw judicial decisions into instruction-tuning data for parameter-efficient fine-tuning.

If this is right

The method delivers higher similarity to individual judges than existing personalization techniques on three tasks.
Outputs match human reasoning so closely that they cannot be reliably distinguished.
Personalization succeeds even with limited training resources in languages like Hebrew.
The two-stage process of causal modeling then synthetic instruction tuning drives the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthetic generation step could help personalize models for other experts who produce consistent but individual reasoning, such as physicians or financial analysts.
Future tests on entirely new cases could show whether the tuned models preserve a judge's preferences beyond the training examples.
Using synthetic data may lower privacy risks by reducing the need to expose full personal decision histories during training.

Load-bearing premise

Synthetically generated instruction data from judicial decisions accurately captures each judge's individual reasoning style and details without introducing biases or artifacts that distort the similarity measures.

What would settle it

A blind evaluation in which independent raters cannot distinguish the model's outputs from actual human judge decisions at rates above chance, or where lexical, stylistic, and semantic similarity scores show no advantage over standard personalization baselines.

Figures

Figures reproduced from arXiv: 2604.18041 by Arnon Sturm, Itay Razumenko, Nir Grinberg.

**Figure 2.** Figure 2: Each point shows the performance gap be [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study on three judges. Top row: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Despite significant advances in large language models, personalizing them for individual decision-makers remains an open problem. Here, we introduce a synthetic-organic supervision pipeline that transforms raw judicial decisions into instruction-tuning data, enabling parameter-efficient fine-tuning of personalized models for individual judges in low-resource settings. We compare our approach to state-of-the-art personalization techniques across three different tasks and settings. The results show that Causal Language Modeling followed by synthetically generated instruction-tuning significantly outperforms all other baselines, providing significant improvements across lexical, stylistic, and semantic similarity. Notably, our model-generated outputs are indistinguishable from the reasoning of human judges, highlighting the viability of efficient personalization, even in low-resource settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a pipeline for turning Hebrew judicial decisions into synthetic instruction data to personalize LLMs for individual judges, with reported gains on similarity metrics.

read the letter

The main point is that they describe a synthetic-organic supervision approach: take raw court decisions, generate instruction-tuning examples from them, then do causal language modeling followed by parameter-efficient fine-tuning to match a specific judge's style. They say this beats standard personalization baselines on lexical, stylistic, and semantic similarity, and that the outputs are indistinguishable from human judge reasoning in Hebrew.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces JudgeMeNot, a synthetic-organic supervision pipeline that converts raw judicial decisions into instruction-tuning data for parameter-efficient fine-tuning of personalized LLMs to emulate individual Hebrew judges' reasoning. It compares Causal Language Modeling followed by synthetically generated instruction-tuning against state-of-the-art personalization techniques across three tasks and settings, claiming significant outperformance on lexical, stylistic, and semantic similarity metrics, with model outputs described as indistinguishable from human judges' reasoning.

Significance. If the results hold under detailed scrutiny, the work would offer a practical advance in low-resource personalized language modeling for high-stakes domains like judicial reasoning. The pipeline's use of synthetic data derived from real decisions provides an efficient path to capturing individual nuances without extensive manual annotation, with potential applicability to other specialized decision-making contexts in under-resourced languages.

major comments (3)

[Abstract] Abstract: The central claims of outperformance and indistinguishability from human judges are stated without any information on datasets (e.g., number of decisions or judges), specific similarity metrics, baselines, statistical significance testing, or the human evaluation protocol used to establish indistinguishability. These details are load-bearing for verifying the headline result that the approach 'significantly outperforms all other baselines' and produces outputs 'indistinguishable' from human judges.
[Evaluation] The manuscript provides no ablations, human validation of synthetic examples, or controls to test whether the synthetically generated instruction data preserves individual judicial nuances or instead introduces artifacts from the generator LLM (especially relevant in low-resource Hebrew). Without such checks, improvements on automatic similarity metrics may not demonstrate faithful emulation of judicial reasoning.
[Methods] Methods: The transformation process from raw judicial decisions to instruction-tuning data is described at a high level but lacks concrete details on prompting strategies, data filtering, or how individual judge-specific patterns are isolated and preserved, hindering assessment of reproducibility and bias risks.

minor comments (1)

[Abstract] The abstract introduces the term 'synthetic-organic supervision pipeline' without a brief definition or pointer to related hybrid supervision literature, which would aid reader comprehension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and commit to revisions that enhance clarity, rigor, and reproducibility without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of outperformance and indistinguishability from human judges are stated without any information on datasets (e.g., number of decisions or judges), specific similarity metrics, baselines, statistical significance testing, or the human evaluation protocol used to establish indistinguishability. These details are load-bearing for verifying the headline result that the approach 'significantly outperforms all other baselines' and produces outputs 'indistinguishable' from human judges.

Authors: We agree that the abstract would be strengthened by incorporating these specifics. In the revised manuscript, we will expand the abstract to report the dataset composition (number of decisions and judges), enumerate the lexical, stylistic, and semantic similarity metrics, identify the baselines, reference the statistical significance testing performed, and outline the human evaluation protocol used to assess indistinguishability from human judges' reasoning. revision: yes
Referee: [Evaluation] The manuscript provides no ablations, human validation of synthetic examples, or controls to test whether the synthetically generated instruction data preserves individual judicial nuances or instead introduces artifacts from the generator LLM (especially relevant in low-resource Hebrew). Without such checks, improvements on automatic similarity metrics may not demonstrate faithful emulation of judicial reasoning.

Authors: We acknowledge that additional validation would further substantiate the pipeline's fidelity. The submitted manuscript emphasized end-to-end task performance, but we will incorporate ablations isolating the synthetic instruction-tuning component, a human validation study on sampled synthetic examples to verify preservation of judge-specific nuances, and controls comparing synthetic outputs against original decisions to assess potential generator artifacts in the Hebrew setting. revision: yes
Referee: [Methods] Methods: The transformation process from raw judicial decisions to instruction-tuning data is described at a high level but lacks concrete details on prompting strategies, data filtering, or how individual judge-specific patterns are isolated and preserved, hindering assessment of reproducibility and bias risks.

Authors: We concur that greater methodological detail is necessary for reproducibility. The revised Methods section will include the exact prompting templates used for converting decisions to instruction data, the filtering criteria applied, and the specific techniques for isolating and retaining individual judge patterns. We will also add discussion of bias risks and corresponding safeguards. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external human similarity metrics

full rationale

The paper describes an empirical pipeline that converts raw judicial decisions into synthetic instruction data for fine-tuning, then evaluates the resulting models via independent lexical, stylistic, and semantic similarity metrics against held-out human judge outputs. No equations, fitted parameters, or derivations are presented that reduce the reported performance gains to the inputs by construction. The central claim of outperforming baselines and producing indistinguishable outputs is supported by comparative experiments rather than self-referential fitting or self-citation chains. Evaluation relies on external benchmarks, satisfying the criteria for a self-contained, non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard machine learning assumptions about data representativeness and model capacity for style emulation; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Raw judicial decisions contain sufficient signal to train models that emulate individual reasoning styles
Invoked by the pipeline that transforms decisions into instruction-tuning data.
domain assumption Parameter-efficient fine-tuning can achieve high fidelity personalization in low-resource settings
Central to the claim of viability in low-resource Hebrew judicial data.

pith-pipeline@v0.9.0 · 5411 in / 1233 out tokens · 37626 ms · 2026-05-10T04:08:26.978850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Political Analysis, 32(1):115– 132

Relatio: Text semantics capture political and economic narratives. Political Analysis, 32(1):115– 132. J Elliott Casal and Matt Kessler. 2023. Can linguists distinguish between chatgpt/ai and human writing?: A study of research ethics and academic publish- ing. Research Methods in Applied Linguistics , 2(3):100068. Shu Chen, Xinyan Guan, Y aojie Lu, Hongy...

2023
[2]

In Proceedings of the 5th International Con- ference on Natural Language Processing for Digital Humanities, pages 20–32, Albuquerque, USA

Analyzing large language models’ pastiche ability: a case study on a 20th century Romanian au- thor. In Proceedings of the 5th International Con- ference on Natural Language Processing for Digital Humanities, pages 20–32, Albuquerque, USA. Asso- ciation for Computational Linguistics. Lee Epstein, William M Landes, and Richard A Posner
[3]

harvard uni- versity press

The behavior of federal judges: a theoretical and empirical study of rational choice . harvard uni- versity press. Deva Kumar Gajulamandyam, Sainath Veerla, Y asaman Emami, Kichang Lee, Yuanting Li, Jinthy Swetha Mamillapalli, and Simon Shim
[4]

In 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), pages 00484–00490

Domain specific finetuning of llms using peft techniques. In 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), pages 00484–00490. IEEE. Google DeepMind. 2025. Gemini 3 pro model card. Model card. https://storage.googleapis. com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf . Neel Guha, Julian Nyarko, Daniel Ho, Ch...

2025
[5]

International Conference on Machine Learning, pages 2790–2799

Parameter-eﬀicient transfer learning for nlp. International Conference on Machine Learning, pages 2790–2799. Edward J Hu, Y elong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3. Jie Huang and Kevin Chen-Chuan Chang. 2023. To- wards r...

2022
[6]

Journal of the American Society for informa- tion Science and Technology, 60(1):9–26

Computational methods in authorship attribu- tion. Journal of the American Society for informa- tion Science and Technology, 60(1):9–26. Kalpesh Krishna, John Wieting, and Mohit Iyyer. 2020. Reformulating unsupervised style transfer as para- phrase generation. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing (EMN...

2020
[7]

Chatharuhi: Reviving anime character in reality via large language model

Chatharuhi: Reviving anime character in reality via large language model. arXiv preprint arXiv:2308.09597. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

work page arXiv
[8]

In The Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. In The Twelfth Inter- national Conference on Learning Representations . Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe
[9]

In The Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. In The Twelfth Inter- national Conference on Learning Representations . Xinyue Liu, Harshita Diddee, and Daphne Ippolito
[10]

In Proceedings of the 17th International Natural Language Generation Conference , pages 412–426, Tokyo, Japan

Customizing large language model gen- eration style using parameter-eﬀicient finetuning . In Proceedings of the 17th International Natural Language Generation Conference , pages 412–426, Tokyo, Japan. Association for Computational Lin- guistics. Gianluca Moro, Leonardo David Matteo Magnani, and Luca Ragazzi. 2026. Legal lay summarization: exploring method...

2026
[11]

Perplm: Personalized fine-tuning of pre- trained language models via writer-specific interme- diate learning and prompts . arXiv. Jeiyoon Park, Chanjun Park, and Heuiseok Lim. 2025. CharacterGPT: A persona reconstruction framework for role-playing agents. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Democratizing large lan- guage models via personalized parameter-efficient fine-tuning

Character-LLM: A trainable agent for role- playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Process- ing, pages 13153–13187, Singapore. Association for Computational Linguistics. Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, and Moshe Koppel. 2025. Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs. Tech...

work page arXiv 2023
[13]

Reasoning Sentence Extraction We process each verdict with a chain-of-thought prompt to identify sentences that express the judge’s legal reasoning
[14]

Judicial Reasoning ValidationWe pass the extracted flagged sentences to a separate validator prompt that confirms they indeed contain judicial reasoning; sentences that fail validation are removed
[15]

Question Generation Based on the extracted reasoning sentences, we generate questions whose an- swers refer to the reasoning sentences
[16]

The exact prompts used in each stage of the workflow are provided in Section A.7

Validation We then verify that the questions match the answers; if validation fails twice, the pair is discarded. The exact prompts used in each stage of the workflow are provided in Section A.7. A.6 Qualitative Error Analysis Below we present representative question–answer pairs comparing the outputs of the Base model, CLM, and CoLA on identical inputs. ...
[17]

Language — Output questions in Hebrew
[18]

Style — Prefer״כיצד״,״מהו״,״מדוע״ to open questions; avoid rhetoric
[19]

Scope — Use only the Answer (and optional Context to disambiguate names)
[20]

### Reasoning Steps

Punctuation — Standard Hebrew punctuation; never the character ‘ -’. ### Reasoning Steps
[21]

Identify its pivotal legal point (rule, fact pattern, mitigation, etc.)
[22]

Reformulate that pivot into an interrogative sentence that elicits the Answer verbatim or in tight paraphrase
[23]

### Output Format Return exactly one line per Answer, numbered 1

Verify the question is answerable solely from the Answer. ### Output Format Return exactly one line per Answer, numbered 1. 2. … Example: 1 2 Return only the numbered list—nothing else. Step 4: QA Validation Model: GPT-4o-mini אתהבודק-איכותקפדן.השאלהוהתשובהחייבותלהתאיםבזיקהמלאה:האםהתשובהעונהבמדויקעלהשאלהענה אךורקב-״