ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

Nazarii Shportun

arxiv: 2605.30589 · v1 · pith:DD6D5PL5new · submitted 2026-05-28 · 💻 cs.CL · cs.AI

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

Nazarii Shportun This is my paper

Pith reviewed 2026-06-29 07:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords immigration lawquestion answeringfine-tuningLlamaLoRAlegal datasetsource-grounded QAsmall language models

0 comments

The pith

Fine-tuning a 3B model on a new source-grounded immigration dataset lifts its mean score 27% above an 8B base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds ImmigrationQA, a dataset of 17,058 question-answer pairs drawn directly from official U.S. immigration documents including the USCIS Policy Manual, federal regulations, and precedent decisions. It fine-tunes Llama 3.2 3B Instruct on this data with parameter-efficient LoRA and measures performance on a held-out set using LLM-as-judge scoring. The fine-tuned model reaches a mean of 1.08 out of 3.0 on a 101-example sample, compared with 0.85 for the Llama 3 8B base model, with the largest gains appearing in procedural subdomains. The work releases the full dataset, model, code, and prompts at low cost while stating that the system is not legal advice and may not reflect later regulatory changes.

Core claim

The authors assembled 17,058 QA pairs from 10,056 validated documents across 13 immigration subdomains, generated via five mode-specific prompts on Claude Sonnet 4.6, and fine-tuned Llama 3.2 3B Instruct using LoRA, yielding a mean score of 1.08/3.0 (16.8% fully correct) on LLM-as-judge evaluation of a 101-example stratified sample from the 993-pair held-out set versus 0.85/3.0 (4% fully correct) for the Llama 3 8B base model, a relative improvement of 27%.

What carries the argument

The ImmigrationQA dataset of source-grounded QA pairs extracted from official policy manuals, regulations, and precedents and used for LoRA fine-tuning of the small model.

If this is right

The fine-tuned model shows concentrated improvement in procedural subdomains such as travel documents, adjustment of status, and nonimmigrant visas.
A zero-shot Claude Sonnet baseline still scores higher at 1.52/3.0 with 25% fully correct answers.
The full dataset construction and adaptation pipeline required approximately $29 in cloud compute.
All artifacts including the dataset, model, code, and prompt templates are released publicly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The source-grounded dataset construction approach could be repeated for other regulatory domains that publish large volumes of official text.
The model would likely require periodic retraining to stay current with policy changes that occur after the corpus crawl date.
Pairing the fine-tuned model with retrieval over updated official sources could mitigate weaknesses on time-sensitive statistics.

Load-bearing premise

That LLM-as-judge scoring on the 101-example sample from generated QA pairs reliably indicates overall model quality for immigration questions and that the pairs faithfully capture the legal content of the source chunks.

What would settle it

A side-by-side comparison by immigration attorneys of model outputs against source documents on a fresh set of real petitioner questions would show whether the 27% score gain corresponds to higher factual accuracy.

read the original abstract

U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new public 17k-pair immigration-law QA dataset and modest LoRA gains on a 3B model, but its evaluation rests on unvalidated LLM-as-judge scores from Claude-generated data.

read the letter

The main things to know are that this work releases a source-grounded dataset of 17,058 QA pairs drawn from USCIS manuals, 8 CFR, and related documents across 13 subdomains, and that LoRA fine-tuning lifts a 3B Llama 3.2 Instruct model to 1.08/3.0 on a 101-example LLM-as-judge sample versus 0.85/3.0 for an 8B base model.

What is actually new is the dataset itself and its public release of the pairs, the fine-tuned model, code, and prompts. The construction follows a standard pipeline: chunk official sources, generate pairs with Claude Sonnet via mode-specific prompts, filter for overlap, then adapt with LoRA. The $29 compute cost and explicit domain focus on procedural immigration topics are practical pluses. The held-out split and separate judge model avoid the most obvious circularity.

The soft spots sit in the evaluation. All training and test pairs come from Claude, yet the headline numbers rest on LLM-as-judge scoring of only 101 stratified examples with no reported human legal review or agreement stats. The 27% relative lift could partly reflect stylistic alignment between generator and judge rather than better legal fidelity. Comparing the fine-tuned 3B instruct model to an 8B base model (not an 8B instruct) also makes attribution to the adaptation step less clean. The paper notes the system is not legal advice and stops at the crawl date, which is appropriate but underscores the narrow scope.

This is for researchers or practitioners who need a starting corpus in U.S. immigration QA and are willing to do their own validation. The data release gives it more value than a pure methods paper would have.

I would send it to peer review. The artifact is concrete and the domain application is clear even if the adaptation techniques are established.

Referee Report

3 major / 1 minor

Summary. The manuscript describes the construction of ImmigrationQA, a source-grounded QA dataset of 17,058 pairs across 13 subdomains derived from 10,056 canonical documents and 18,308 chunks from USCIS Policy Manual, 8 CFR, BIA decisions and related sources. QA pairs were generated via five mode-specific prompts to Claude Sonnet 4.6 with 22 pairs rejected for insufficient overlap. A Llama 3.2 3B Instruct model was fine-tuned with LoRA and evaluated on a 101-example stratified sample from the 993-pair held-out set using LLM-as-judge scoring, reporting a mean of 1.08/3.0 (16.8% fully correct) versus 0.85/3.0 (4% fully correct) for the Llama 3 8B base model (27% relative improvement); a zero-shot Claude baseline scored 1.52/3.0. All artifacts are released publicly at low cost (~$29).

Significance. If the evaluation is validated, the work supplies a publicly released, source-grounded dataset and reproducible LoRA adaptation pipeline for a high-stakes legal domain, demonstrating that small models can be cheaply specialized on procedural immigration topics. The explicit release of dataset, model, code, and prompt templates is a clear strength supporting further research.

major comments (3)

[Evaluation section] Evaluation section (abstract and corresponding evaluation paragraph): the headline 27% relative improvement (1.08 vs 0.85 mean score on the 3-point scale) rests on LLM-as-judge scoring of a 101-example stratified sample drawn from the 993-pair held-out set. No human expert validation, inter-annotator agreement statistics, or judge-model ablation is reported; because the judge shares the same generator distribution as the training data, the observed lift may reflect surface-form matching rather than improved legal fidelity.
[Dataset construction] Dataset construction (abstract and § on corpus assembly): all 17,058 pairs were produced by Claude Sonnet 4.6; only overlap filtering (22 pairs rejected) is described and no error rate or human legal-accuracy audit of the generated QA pairs is provided. This directly affects the central claim that the dataset faithfully represents U.S. immigration law content.
[Baselines] Baselines paragraph: the comparison is drawn against the Llama 3 8B base model rather than an 8B Instruct model, which weakens attribution of gains specifically to the LoRA adaptation on ImmigrationQA.

minor comments (1)

The 3-point scoring rubric used by the LLM judge is not reproduced or exemplified in the manuscript, making it difficult to interpret the absolute scores (1.08/3.0).

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments below, proposing revisions where they strengthen the work without misrepresenting our contributions or resources.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (abstract and corresponding evaluation paragraph): the headline 27% relative improvement (1.08 vs 0.85 mean score on the 3-point scale) rests on LLM-as-judge scoring of a 101-example stratified sample drawn from the 993-pair held-out set. No human expert validation, inter-annotator agreement statistics, or judge-model ablation is reported; because the judge shares the same generator distribution as the training data, the observed lift may reflect surface-form matching rather than improved legal fidelity.

Authors: We agree that LLM-as-judge scoring without human validation or ablation is a limitation and that shared generator distribution raises the possibility of surface-form bias. The 101-example sample and subdomain-specific gains were reported transparently as an initial proxy; full legal-expert annotation exceeded available resources. In revision we will expand the limitations paragraph to discuss judge bias risks and the proxy nature of the metric, while retaining the reported numbers with their caveats. revision: partial
Referee: [Dataset construction] Dataset construction (abstract and § on corpus assembly): all 17,058 pairs were produced by Claude Sonnet 4.6; only overlap filtering (22 pairs rejected) is described and no error rate or human legal-accuracy audit of the generated QA pairs is provided. This directly affects the central claim that the dataset faithfully represents U.S. immigration law content.

Authors: Generation used five mode-specific prompts followed by overlap filtering to enforce source grounding. A full human legal audit was not conducted due to cost and domain-expertise requirements. We will revise the dataset-construction section to state this limitation explicitly, add further prompt and filtering details, and note that public release enables external audits. revision: partial
Referee: [Baselines] Baselines paragraph: the comparison is drawn against the Llama 3 8B base model rather than an 8B Instruct model, which weakens attribution of gains specifically to the LoRA adaptation on ImmigrationQA.

Authors: The Llama 3 8B base was selected to contrast adaptation gains against a larger but untuned model. We acknowledge that an 8B Instruct baseline would allow cleaner attribution to the ImmigrationQA fine-tuning. We will revise the baselines paragraph to include or discuss the 8B Instruct comparison and clarify the original rationale. revision: yes

standing simulated objections not resolved

Human expert validation, inter-annotator agreement, or judge ablation for the LLM-as-judge evaluation
Human legal-accuracy audit or error-rate measurement for the generated QA pairs

Circularity Check

0 steps flagged

No circularity: empirical pipeline uses external generator, held-out split, and separate judge model

full rationale

The paper describes dataset construction from public sources via Claude Sonnet 4.6 prompts, LoRA fine-tuning of Llama 3.2 3B, and evaluation on a 993-pair held-out set scored by LLM-as-judge on a 101-example sample. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the derivation; the reported 1.08 vs 0.85 mean scores rest on an independent held-out split and external judge rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical NLP paper with no mathematical derivations. It rests on standard domain assumptions about automated data generation and evaluation quality rather than new axioms or entities.

axioms (2)

domain assumption LLM-generated QA pairs validated only by source-span overlap checks form a sufficiently accurate training set for legal reasoning
Generation used Claude Sonnet with five prompts; only 22 pairs were rejected on overlap grounds, with the remainder treated as reliable.
domain assumption LLM-as-judge scores on a 101-example sample accurately reflect model performance on the full held-out set
All reported scores and the 27% improvement claim derive from this automated evaluation method.

pith-pipeline@v0.9.1-grok · 5878 in / 1529 out tokens · 45710 ms · 2026-06-29T07:07:44.730893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 3 internal anchors

[1]

LEGAL-BERT : The muppets straight out of law school

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. LEGAL-BERT : The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898--2904, 2020

2020
[2]

MultiEURLEX -- a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Martin Katz, and Anders S gaard. MultiEURLEX -- a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv preprint arXiv:2109.00904, 2021

work page arXiv 2021
[3]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Ho, Christopher R\'e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher R\'e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, et al. LegalBench : A collaboratively built benchmark for measuring legal reasoning in large language models. In Advances in Neural Information Processing Systems, volume 36, 2023

2023
[5]

Krass, Lucia Zheng, Neel Guha, Christopher D

Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D. Manning, Dan Jurafsky, and Daniel E. Ho. The Pile of Law : Learning responsible data filtering from the law and a 256gb open-source legal dataset. In Advances in Neural Information Processing Systems, volume 35, 2022

2022
[6]

CUAD : An expert-annotated NLP dataset for legal contract review

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD : An expert-annotated NLP dataset for legal contract review. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

2021
[7]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[8]

LLM-assisted legal question answering with a retrieval-augmented architecture

Robert Mahari, Alex Pentland, and Markus Alber. LLM-assisted legal question answering with a retrieval-augmented architecture. In Proceedings of the Natural Legal Language Processing Workshop at EMNLP, 2024

2024
[9]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Kelsey Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Kelsey Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, 2022

2022
[10]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Anderson, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. LawInstruct : A general-purpose legal instruction dataset. arXiv preprint arXiv:2306.09027, 2023 b

work page arXiv 2023

[1] [1]

LEGAL-BERT : The muppets straight out of law school

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. LEGAL-BERT : The muppets straight out of law school. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898--2904, 2020

2020

[2] [2]

MultiEURLEX -- a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

Ilias Chalkidis, Nicolas Garneau, Catalina Goanta, Daniel Martin Katz, and Anders S gaard. MultiEURLEX -- a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. arXiv preprint arXiv:2109.00904, 2021

work page arXiv 2021

[3] [3]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Ho, Christopher R\'e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher R\'e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, et al. LegalBench : A collaboratively built benchmark for measuring legal reasoning in large language models. In Advances in Neural Information Processing Systems, volume 36, 2023

2023

[5] [5]

Krass, Lucia Zheng, Neel Guha, Christopher D

Peter Henderson, Mark S. Krass, Lucia Zheng, Neel Guha, Christopher D. Manning, Dan Jurafsky, and Daniel E. Ho. The Pile of Law : Learning responsible data filtering from the law and a 256gb open-source legal dataset. In Advances in Neural Information Processing Systems, volume 35, 2022

2022

[6] [6]

CUAD : An expert-annotated NLP dataset for legal contract review

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD : An expert-annotated NLP dataset for legal contract review. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

2021

[7] [7]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022

[8] [8]

LLM-assisted legal question answering with a retrieval-augmented architecture

Robert Mahari, Alex Pentland, and Markus Alber. LLM-assisted legal question answering with a retrieval-augmented architecture. In Proceedings of the Natural Legal Language Processing Workshop at EMNLP, 2024

2024

[9] [9]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Kelsey Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Kelsey Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, 2022

2022

[10] [10]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Anderson, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. LawInstruct : A general-purpose legal instruction dataset. arXiv preprint arXiv:2306.09027, 2023 b

work page arXiv 2023