ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
Pith reviewed 2026-06-29 07:07 UTC · model grok-4.3
The pith
Fine-tuning a 3B model on a new source-grounded immigration dataset lifts its mean score 27% above an 8B base model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors assembled 17,058 QA pairs from 10,056 validated documents across 13 immigration subdomains, generated via five mode-specific prompts on Claude Sonnet 4.6, and fine-tuned Llama 3.2 3B Instruct using LoRA, yielding a mean score of 1.08/3.0 (16.8% fully correct) on LLM-as-judge evaluation of a 101-example stratified sample from the 993-pair held-out set versus 0.85/3.0 (4% fully correct) for the Llama 3 8B base model, a relative improvement of 27%.
What carries the argument
The ImmigrationQA dataset of source-grounded QA pairs extracted from official policy manuals, regulations, and precedents and used for LoRA fine-tuning of the small model.
If this is right
- The fine-tuned model shows concentrated improvement in procedural subdomains such as travel documents, adjustment of status, and nonimmigrant visas.
- A zero-shot Claude Sonnet baseline still scores higher at 1.52/3.0 with 25% fully correct answers.
- The full dataset construction and adaptation pipeline required approximately $29 in cloud compute.
- All artifacts including the dataset, model, code, and prompt templates are released publicly.
Where Pith is reading between the lines
- The source-grounded dataset construction approach could be repeated for other regulatory domains that publish large volumes of official text.
- The model would likely require periodic retraining to stay current with policy changes that occur after the corpus crawl date.
- Pairing the fine-tuned model with retrieval over updated official sources could mitigate weaknesses on time-sensitive statistics.
Load-bearing premise
That LLM-as-judge scoring on the 101-example sample from generated QA pairs reliably indicates overall model quality for immigration questions and that the pairs faithfully capture the legal content of the source chunks.
What would settle it
A side-by-side comparison by immigration attorneys of model outputs against source documents on a fresh set of real petitioner questions would show whether the 27% score gain corresponds to higher factual accuracy.
read the original abstract
U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the construction of ImmigrationQA, a source-grounded QA dataset of 17,058 pairs across 13 subdomains derived from 10,056 canonical documents and 18,308 chunks from USCIS Policy Manual, 8 CFR, BIA decisions and related sources. QA pairs were generated via five mode-specific prompts to Claude Sonnet 4.6 with 22 pairs rejected for insufficient overlap. A Llama 3.2 3B Instruct model was fine-tuned with LoRA and evaluated on a 101-example stratified sample from the 993-pair held-out set using LLM-as-judge scoring, reporting a mean of 1.08/3.0 (16.8% fully correct) versus 0.85/3.0 (4% fully correct) for the Llama 3 8B base model (27% relative improvement); a zero-shot Claude baseline scored 1.52/3.0. All artifacts are released publicly at low cost (~$29).
Significance. If the evaluation is validated, the work supplies a publicly released, source-grounded dataset and reproducible LoRA adaptation pipeline for a high-stakes legal domain, demonstrating that small models can be cheaply specialized on procedural immigration topics. The explicit release of dataset, model, code, and prompt templates is a clear strength supporting further research.
major comments (3)
- [Evaluation section] Evaluation section (abstract and corresponding evaluation paragraph): the headline 27% relative improvement (1.08 vs 0.85 mean score on the 3-point scale) rests on LLM-as-judge scoring of a 101-example stratified sample drawn from the 993-pair held-out set. No human expert validation, inter-annotator agreement statistics, or judge-model ablation is reported; because the judge shares the same generator distribution as the training data, the observed lift may reflect surface-form matching rather than improved legal fidelity.
- [Dataset construction] Dataset construction (abstract and § on corpus assembly): all 17,058 pairs were produced by Claude Sonnet 4.6; only overlap filtering (22 pairs rejected) is described and no error rate or human legal-accuracy audit of the generated QA pairs is provided. This directly affects the central claim that the dataset faithfully represents U.S. immigration law content.
- [Baselines] Baselines paragraph: the comparison is drawn against the Llama 3 8B base model rather than an 8B Instruct model, which weakens attribution of gains specifically to the LoRA adaptation on ImmigrationQA.
minor comments (1)
- The 3-point scoring rubric used by the LLM judge is not reproduced or exemplified in the manuscript, making it difficult to interpret the absolute scores (1.08/3.0).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We respond point-by-point to the major comments below, proposing revisions where they strengthen the work without misrepresenting our contributions or resources.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (abstract and corresponding evaluation paragraph): the headline 27% relative improvement (1.08 vs 0.85 mean score on the 3-point scale) rests on LLM-as-judge scoring of a 101-example stratified sample drawn from the 993-pair held-out set. No human expert validation, inter-annotator agreement statistics, or judge-model ablation is reported; because the judge shares the same generator distribution as the training data, the observed lift may reflect surface-form matching rather than improved legal fidelity.
Authors: We agree that LLM-as-judge scoring without human validation or ablation is a limitation and that shared generator distribution raises the possibility of surface-form bias. The 101-example sample and subdomain-specific gains were reported transparently as an initial proxy; full legal-expert annotation exceeded available resources. In revision we will expand the limitations paragraph to discuss judge bias risks and the proxy nature of the metric, while retaining the reported numbers with their caveats. revision: partial
-
Referee: [Dataset construction] Dataset construction (abstract and § on corpus assembly): all 17,058 pairs were produced by Claude Sonnet 4.6; only overlap filtering (22 pairs rejected) is described and no error rate or human legal-accuracy audit of the generated QA pairs is provided. This directly affects the central claim that the dataset faithfully represents U.S. immigration law content.
Authors: Generation used five mode-specific prompts followed by overlap filtering to enforce source grounding. A full human legal audit was not conducted due to cost and domain-expertise requirements. We will revise the dataset-construction section to state this limitation explicitly, add further prompt and filtering details, and note that public release enables external audits. revision: partial
-
Referee: [Baselines] Baselines paragraph: the comparison is drawn against the Llama 3 8B base model rather than an 8B Instruct model, which weakens attribution of gains specifically to the LoRA adaptation on ImmigrationQA.
Authors: The Llama 3 8B base was selected to contrast adaptation gains against a larger but untuned model. We acknowledge that an 8B Instruct baseline would allow cleaner attribution to the ImmigrationQA fine-tuning. We will revise the baselines paragraph to include or discuss the 8B Instruct comparison and clarify the original rationale. revision: yes
- Human expert validation, inter-annotator agreement, or judge ablation for the LLM-as-judge evaluation
- Human legal-accuracy audit or error-rate measurement for the generated QA pairs
Circularity Check
No circularity: empirical pipeline uses external generator, held-out split, and separate judge model
full rationale
The paper describes dataset construction from public sources via Claude Sonnet 4.6 prompts, LoRA fine-tuning of Llama 3.2 3B, and evaluation on a 993-pair held-out set scored by LLM-as-judge on a 101-example sample. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the derivation; the reported 1.08 vs 0.85 mean scores rest on an independent held-out split and external judge rather than reducing to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-generated QA pairs validated only by source-span overlap checks form a sufficiently accurate training set for legal reasoning
- domain assumption LLM-as-judge scores on a 101-example sample accurately reflect model performance on the full held-out set
Reference graph
Works this paper leans on
-
[1]
LEGAL-BERT : The muppets straight out of law school
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. LEGAL-BERT : The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898--2904, 2020
2020
-
[2]
Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Martin Katz, and Anders S gaard. MultiEURLEX -- a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv preprint arXiv:2109.00904, 2021
-
[3]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Ho, Christopher R\'e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher R\'e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, et al. LegalBench : A collaboratively built benchmark for measuring legal reasoning in large language models. In Advances in Neural Information Processing Systems, volume 36, 2023
2023
-
[5]
Krass, Lucia Zheng, Neel Guha, Christopher D
Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D. Manning, Dan Jurafsky, and Daniel E. Ho. The Pile of Law : Learning responsible data filtering from the law and a 256gb open-source legal dataset. In Advances in Neural Information Processing Systems, volume 35, 2022
2022
-
[6]
CUAD : An expert-annotated NLP dataset for legal contract review
Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD : An expert-annotated NLP dataset for legal contract review. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021
2021
-
[7]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
2022
-
[8]
LLM-assisted legal question answering with a retrieval-augmented architecture
Robert Mahari, Alex Pentland, and Markus Alber. LLM-assisted legal question answering with a retrieval-augmented architecture. In Proceedings of the Natural Legal Language Processing Workshop at EMNLP, 2024
2024
-
[9]
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Kelsey Slama, Alex Ray, et al
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Kelsey Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, 2022
2022
-
[10]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Anderson, Peter Henderson, and Daniel E
Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. LawInstruct : A general-purpose legal instruction dataset. arXiv preprint arXiv:2306.09027, 2023 b
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.